Build NLP Pipelines with HuggingFace Datasets

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi welcome to this video we're going to have a look at hugging faces data sets library we're going to have a look at some of what i think are the most useful data sets and we're going to look at how we can use the library to build what i think are very good pipelines or data input pipelines for nlp so let's get started so the first thing we want to do is actually well install data set so we'll go flip and solve data sets and and that will install the library for you after this we'll want to go ahead and import data sets and then we can start having a look at which data sets are available to us now there's two ways that you can have a look at all of the data sets the first one is using the face datasets viewer which you can find on google do you just type in data sets viewer and it's just an interactive app which allows you to go through and have a look at the different data sets now i'm not going to i've already spoken about that a lot before and it's super easy to use so we're not going to go through it instead we're just going to have a look at how we can have view everything in python which is the second option so first we can we can do this so we we just list all of our data sets now i'm going to just write a dslist here and from this we will just get i think it's something like 1400 um data sets now so it's quite a lot so if we go len of all yes what is so yeah it's it's 1.4 000 days which is obviously a lot and some of these are massive as well so if we for example if we were to look at the oscar data set um so in ds list we could go [Music] no data set for data set in dslist if oscar is in the data set so these are just data set names okay and we have so we have oscar i think pt is what is pt i imagine it's probably portuguese and then we we have all these other ones as well but these are just these are users um uploaded oscar data sets this is the the actual oscar data set that's been sell by hugging face and it's huge it contains i think 100 and more than 160 languages and some of them for example english obviously english is one of the biggest ones that contains 1.2 terabytes of data so there's a lot of data in there but that's just unstructured text what i want to have a look at is the squad data sets so we're going to be using we're just going to use the original squad in our in this video but you can see that we have a few different ones here so italian spanish korean you have thai taikyui squad here and then also french as well at the bottom so you have plenty of choice now obviously you kind of need to know what sort of data that you're looking for i know i'm looking for a squad data set so i've gone i've got squad there are other ones as well actually if i if i change this to lower we'll see those also pop up okay so we have like this one here and this one this one doesn't seem to work but it's fine now to load one let's say it says obviously we're going to be using squad we write data set equals data sets dot load data set and then in here we just write our dates a name it's a squad now there's two ways to [Music] two ways to download your data so if we if we do this this is the default method we are going to download and cache the whole data set in memory we should for squad is fine i think squad it's it's not a huge data set so it's not really a problem but when you think okay if we wanted the english oscar data set that's massive that's 1.2 terabytes so in those cases you probably don't want to download it all onto your onto your machine so what you can do instead is you set streaming equal to true and when streaming is equal to true you do need to make some changes to your code uh which will i'll show you and there are also some things particularly filtering which we will cover later on which we can't do with streaming but we will just go ahead and for now we're going to use streaming we'll switch over to to not streaming later on and this creates like a iteratable dataset object and it means that whenever we are calling a specific record within that data set it is only going to download or sort that single record or multiple records in our memory at once so we're not downloading the data set and we're just processing it as we get which is i think very useful now you can see here we have we have two actual subsets within our data if we want to select a specific subset all we have to do is we write data sets again so let me actually copy this so we copy that and if we just want a subset we write split and in this case it would be train or validation and if i just call uh execute that so i'm not i'm not going to store that in our data variable here because i don't want to use just train we have this single uh iterable dataset object so we we're just pulling in this single part of it or single subset and we can also view so here we can see we have train and validation if you want to see it in a more clear way you can you can use dictionary syntax so sorry data set keys you can use dictionary syntax for most of this so we have train and validation now there's also so the moment we we have our data set we don't really know anything about it so we have this train subset and let's say i want to you know understand what is in there so what i can do to start is i write a data set train and i can write for example the data set size so how big is it all right data size data set not data size size i don't know what i was doing there and we see that we get so it's like so 80 about 90 90 megabytes so reasonably big but it's not not anything huge nothing crazy we can also so we we have that we can also get if i copy this you can also get a description let me see what the data is so squad i didn't even mention it already but squad is the stanford question answering dates that they use it for generally for training q a models or testing q a models and you know you can you can pause and read that if you want to and then another thing that is pretty important is what are the features that we have inside here now we can we can also just print out one of the 100 samples but it's useful too useful to know i think and this also gives you data types which is kind of useful so we have id title context question and answers all of them are strings answers is actually so in within answers we have it says sequence here you can we can view it as a dictionary but we have a text attribute and also an answer attribute so that's pretty useful to know i think and to view one of our one of our samples so yeah we have all the features here but let's say we just want to see what it actually looks like we can write data set and we go train and when we have streaming streaming set to false we can write this but because we have streaming set true we can't do this so instead what we have to do is we we actually just iterate through the data set so we just go for sample in data set and we just want to print a single sample and then i i don't want to print any more so i'm going to i'm going to write break after that so we just print one of those samples and then we see okay we have the id we have title so each of these samples is being pulled from a different wikipedia page in this case the title is a titled page so this one is from the university of notre dame wikipedia page we have answers so that further down we're going to ask a question and these this answers here so we have the text which is the text answer and then we have the position so the character position where the answer starts within the context which is what you can see here now we have a question here which we're asking and then the the model the q a models going to extract the answer from our from our context there okay so we're not going to be training model in this video or anything like that we're just experimenting with the datasets library we don't need to worry so much about that so the first thing i want to do is have a look at how we can modify some of the features in our data so with squad when we are training a model one of the first things we would do is we take our answer start and the text and we will use that to get the answer end position as well so let's go ahead and do that so i first i want to just have a look okay for sample in the data set train i'm just going to print out a few of the answer features so we have sample okay answer for answers sorry and i just want to print that so print it and i want to say okay i want to enumerate this so i can count how many times we're going through it so here i'm just viewing the data so we can actually see what we have in there so i want to say if i is greater than just break to stop stop printing answers uh for us so and then we have a few of these so we have text and we have anderson we want to add answer end and the way that we do that it's pretty straightforward we just need to take the answer start and we add the length of our text to that to get the answer end nothing nothing complicated there so what we're going to do here is modify the answers feature and the best way or i think the at least the most common way of modifying features or adding new features as well is to use the map method so we go data set so it's going to output a new data set so we write data train equals date set train and we're going to use the map method and with map we use lambda so we write lambda x so in here we're building a lambda function and what we need to do so this is one of the things that changes depending on whether you're using streaming uh or not so with streaming equals true in here we need to specify every single feature so what i mean by that is let me do it for streaming faults initially so when streaming is false we would just write answers and we would write the modification to that feature so in this case we are taking the current answers so it would be x answers and we would be merging that with a new dictionary item which is going to be answers m so um answer or end answer where oh answer start so answer n is equal to and here what we have to do is we go x answers so this is a little bit messy i know but it's just how it is so we're within answers and we want to take the answer start position so answer start and we want to add let me start a new line here and we want to add the length of answers text okay so all we're doing now is we're taking and start and we're adding answer text or the length of answer text to that to get our answer end now this is all we would have to write if we were using streaming equals false but we're not with streaming equals true we need to add every other feature in there as well i'm not sure why it is uh why why this is the case uh but it is so we need to just add those in as well so all they are is a direct mapping from the old version to the new data set so we don't need to really do anything we just need to add id we want to map that to id and do that for the other features as well so we have also have context which is x context we have answer already done of course question which is going to be answer question so id context question answers is there anything else i'm missing id oh title of course title just title yeah so also add title in there as well okay and with that we should be ready to go so let's let's map that and what we'll find is when we're using streaming keywords equals true our the actual process is or the transformation that we just built is lazily loaded so we haven't actually just done anything there all we've said is we've passed this instruction to transform the data set in this way but it hasn't actually transformed anything yet it only performs this transformation when we call the data set so if we did this again this would call the data set and it would force the code to run this instruction or this transformation so let's run that and you see we we actually do get an error here and why is that so let me come down we have so what am i doing um and start plus the length of answers what's wrong with that ah okay so if we look up here uh we have these items here within a list so we actually need to we actually need to access that first item but that's good because we we saw that when we at first execute this code nothing happened and it only actually came across that error when we called a data set because that's when this transformation is actually performed and now what we have to do is because we've already added this instruction to our data set transformation or building process we actually need to re-initialize our data set so we will come back up here so where are you here today no not that one this one so we need to load that again to reinitialize the all of the instructions that we've added in there and then we can go ahead rerun this and now it should work hopefully let's see there we go so now if we have a look at this and this is something i probably should have done but i completely forgot to so i should have added this as maybe a list rather than just the number but it's fine because you know you if you come across and you need to do this you may want to add that in but we're not doing anything other than playing around with with the data sets library so it's not not really a problem but you can see that we have added answers and into there now which is is what we wanted to do and also importantly is if i let me copy this bring down here we'll notice that we we do still have all of our data so if i go here i don't really need to remove that it's fine i'll just break straight away that's fine so sample sorry yeah so you see the whole thing and we see that we we still have the id we have the text we have the context we have everything in there now i'm just going to show you you know why this breaks or why what happens if i remove these okay so let me rerun that and this as well so yep so this should look the same do we have yet that's fine but then if i run this so before this had all the all the features but now we only have the the single feature that we specified in this function so the answers so that's why you need to when shuffle is set to true that's why you need to add every single feature in there otherwise it's just going to remove them when you perform the map operation but that's only the case when shuffle is actually set to true oh sure why am i saying sure streaming is set to true so let me bring this down here and let me also copy our initial loading code so here because we're going to need to reload our data set now anyway because we just removed all the features from it okay and what i'm going to do now is just set streaming to false and i'm going to re i'm going to run this same code where we still don't have our ids or anything like that in there and we'll see what happens as well we'll also notice we'll get a loading bar here and it's going to take a little bit of time to process this although actually with this it's probably going to be super fast so probably ignore that but it will you see okay it's taking a little bit of time so now it's going through the whole data set we haven't we haven't called a date set but we have used this map function when streaming is set to false the data set isn't lazily loaded and so the operation the map operation is performed as soon as you call it so it's a slightly different behavior and the other behavior which is different is the fact that we've only needed to specify the answers feature here so we only when we have streaming sets faults we don't need to include every feature within the map operation we only need to include the feature that we are modifying or creating which you know it's weird i don't know why there's a behavior difference when streaming is true or false but it is there so if i now take this again come down here and run that we see now that we have all of our features again right so before when streaming was true if i run this code it would have only included our answers the id title context question they all would have been removed but now we're streaming equal to faults that they're still there so weird a weird um so it's a weird feature or a weird behavior but it's how it is and we obviously just need to deal with it now next thing i want to show you is how we can also add batching to our mapping process so typically with with well pretty much every or any as far as i can think of any nlp test we're going to want to tokenize our tips so we're going to go ahead and do that for for q a so we would import transformers or from transformers import a tokenizer let's say and i would initialize that so this is you know what we typically do tokenizer equals bert tokenizer from retrained and let's say bert bass encased okay i will initialize that and then what i want to do is i'm going to tokenize my context or question and context in the format that squad would usually expect when you're doing q a or building a q a model and i want to do that using the the map function so you can do this in both streaming and non-streaming by the way so we just write data set because train so same same as before data set equals train or data set train dot map we are using a lambda function so lambda x and in here we just want to say tokenizer so i'm not doing the usually when you write this you would include a dictionary here but the tokenizer the output from the tokenizer is already in dictionary format so we don't need to we don't need to do it in this case but basically what we have here is this dictionary and what i want to do is so with q a in your tokenizer you pass two text inputs you pass your question and you would also then pass your contents and as usual we would set up match length so usually 512 i would set padding equal to the max length and also do truncation as well okay so very typical tokenization process nothing there's nothing different going on here this is what we normally do when we tokenize our text going into a transform model and then we want to say okay batched equals true so this allows us to do everything or perform this operation in batches and then we can also specify our batch size so batch size equals let's say 32. so now when we run this where has it gone let's see it now when we run this the map function here is going to tokenize our question and context in batches of 32 so let's go ahead and do that okay and then you can you can see that processing there so i mean that's that's all we really need to do with that so i think that's probably it for the map method and we'll well i'll fast forward and we'll continue with i think a few of the methods i think are quite useful as well okay so that's just finishing up now so we can go ahead and have a look at what we've actually produced so come to here and say data set train so what do we have now we have we have answers like we did before but now we also have attention mask we have input ids and we also have token type ids which are the three tensors that we usually output from from the tokenizer when we do that so we now have those in there as well we can also have a look another thing as well we can we can now rather than looping through our data set because we're not using a we're not using streaming which is true we're using stream equals fault we can now do do this and we can see okay we have a tangent mask and it's not going to show me everything because it's quite large so i'll just delete that but you can see that we have the attention mask in there so what i might want to do is say i want to be quite pedantic and i don't like the fact that there is the if i remove that that we have one feature called title maybe i want to say okay it should be topic because it's the topic of the the context and the question if i want to be really pedantic and modify that i could say data set train rename column and to be honest you can use it for this of course but you're probably not going to you're probably going to use it more for when you need to rename a column to make sure it aligns to whatever the expected inputs are for a transformer model for example so that that's where you would use this but i'm just using this example so i'm going to rename the column title to topic and let's print out data set train again so down here we have title in the moment we're going to have topic okay so now we have topic so just rename column like i said it can be useful not in this case but generally this is usually useful now what i may want to do as well is remove certain records from this data set so so far we've been printing out the here we have this which is now topic we have university of notre dame maybe for whatever reason we don't want to include those those topics so we can say very similar to before we write data set train equals data set train again this i'm going to filter so we're going to filter out records that we don't want and again it's very similar to the syntax we use for the map function which is the lambda and in here we just need to specify the for the samples that we do want to include or we do want to keep and in this case we're going to say okay wherever the topic is not equal to university of notre dame okay so we'll run this and we'll have a look at what what we produce so data set train so somehow look we have number of rows here which is just over almost 88 000 and we should get a lower number now now this will also go through so this remember we have shuffle set to shuffle why i keep calling it shuffle we have streaming set to false this time so it's going to run through the whole data set and then perform this filtering operation now whilst we're waiting for that now i'll just fast forward again to where this finishes in a moment okay so now we have it's finished and we have before we had 88 000 errors now we have 87.3 and we should see so let me take the data set train topic and i want to see let's say the first five of those okay now they're all beyonce rather than before where it was the university of notre dame so we have those and what we may want to do now is say for example we're performing inference with q a with a transform model we don't really need all of the features that we have here so we would only need the attention mask the input ids and also the token type ids so what we can do now is we can remove some of those columns so we'll do data train as always there's a train again and we want to remove those columns so remove columns and we'll just remove so well all of them other than the ones that we want so do answers context id question and topic okay and then let's have a look at what we what we have left okay and then that's it so we we have those final features and these are the ones that we would import into a transform model for training now i mean there's nothing else i really want to to cover i think that is pretty much all you need to know on i can face today it sets to get started and start building pretty i think good input pipelines and using some of the the dates that are available so we'll leave it there thank you very much for watching and i will see you again in the next one bye

Info

Channel: James Briggs

Views: 380

Rating: 5 out of 5

Keywords: python, machine learning, data science, artificial intelligence, natural language processing, bert, nlp, nlproc, Huggingface, Tensorflow, pytorch, torch, programming, tutorials, tutorial, education, learning, code, coding

Id: r-zQQ16wTCA

Channel Id: undefined

Length: 33min 50sec (2030 seconds)

Published: Thu Sep 23 2021