[Code] PyTorch sentiment classifier from scratch with Huggingface NLP Library (Full Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
how did Italy do so hugging face just released this NLP library right here and this is pretty cool because it allows you access to about a hundred NLP datasets and ten evaluation metrics prepackaged so knowing hugging face this is going to be a breeze to work with so what I thought we would do is we would try to use this I have not used this yet and it's been a while since I've used any hugging face stuff so what we're trying to do is use this to load up the IMDB data set and then use a Bert model may be to build a sentiment classifier on top of that using PI torch and specifically PI torch lightening so all of that combined from scratch and basically if I can do it then so can you and we're going to make some mistakes and look have to look at the documentation a bit and so on but that's the process okay so first of all if you like content like this let me know if you're not subscribed subscribe let me know in the comments if you have any sort of criticism or tips I'm always happy for them tips honestly so I have a pretty empty AREPO git repo here I have a gate ignore but that's about it so we'll just dive right in start up them and let's make a file so first some boilerplate code I'm terrible at talking and coding at the same time but you know so I like to use this absol library and I'm using as you can see I'm using the tab 9 completion engine with COC with neo vim this is absolutely great we easily maybe need apps app Flags logging that sounds good so we'll need the will need torch probably right and we'll need PI torch lightning torch lightning as pl we'll need the NLP library of course since we're gonna use that and we need the Transformers library now no hogging face has this tokenizer z' library too but there are some tokenizer z' in the transformer library already and we'll just keep it like like this so maybe numpy maybe not let's see so we'll export will have these flags object here maybe we'll do some flags later and the main function let's just call hello actually let's log that info and alright run main so this is our boiler plate and let's just quickly try it out just to see whether it works so here we are hello that's fine alright so where do we go from here so in pi torch lightning what you'll have to do is you have to build this kind of model class right so we'll build an IMDB sentiment classifier and that's going to extend this lightning module of Pi torch lightning so you need different things in the PI torch like module first of all you need the in it and we'll just do like a very basic in it i'll call super on it and that's about it and you need a forward method since this is a module so when the forward method you gonna get a batch and you have to do something with it and what we also need is a training step method training step which gets a batch and the batch index and we'll have to output some kind of loss or some kind of training procedure then we'll need a train data loader so all of this you can even look up in the in the documentation of pi torch lightning basically you implement these methods and it will do the rest for you so it will do all the training loop and it will do the handling of GPUs and whatnot down the the whole looping over epochs all of that is basically taking care for you when you use PI torch lightning so last thing we need is maybe a prepare data let's put them up here prepare data that method is optional but it gets called at the beginning and that's going to be pretty good for us so I have downloaded the weights of a birth model and the data set so we don't need to do that anymore yeah so that's about it and I am going to so maybe I've forgotten something lightning examples there's what we're gonna do we're going to look at like an example of PI torch lightning and just to see whether we will have it so maybe here domain examples imagenet sounds good so we'll have these methods this is way more than we need but down here so basically what you do is you instantiate your model and we won't save have these hyper parameters here these will be our flags but then you'll implement this trainer and then you call fit on the model right okay so let's maybe copy this down here so we'll in ma model this is our IMDB sentiment classifier and the trainer the routier let's call that logs GPUs we'll give it a GPU if CUDA is available else 0 and then we'll make a flag for the epochs we don't need the rest of this and then at the end we'll call fit model okay so if we had a classifier this would already run cool now what I like to do is to have to improve this module called SH which gives you some sort of easy shell commands and at the beginning of each run whenever the file loads I just do I remove the logs folder so I have basically a clean logs folder and then I I make it again like this so it just deletes the logs and then runs them again so if we run this right now this we're gonna is gonna give us an error probably okay so we don't have an epoch swag right so we need to define a flag that's called define integer yep and we'll go for ten a pox right now cool okay very cool okay so we haven't configured our optimizers so in pipe torch lightning you need some sort of optimizer configuration and we'll just copy that from an example going full Siraj here people okay so we need to configure optimizers and I kind of I like the SGD for this as GD tends to work well in neural networks we don't need the scheduler we don't need any of that so let's just return the SGD optimizer with the parameters and we'll make a Flags flag for the learning rate and we'll make a flag for the momentum okay we don't need any weight decay right here cool let's put these will make floats for the learning rate maybe start off with something like this so I never I never put strings if the description is rather clear only losers need help like don't be kidding yourself if you put the help string you need help that's how it works all right so I just don't like that this library forces you to put the help string because it somehow makes makes me feel bad because the it's very opinionated right it says basically well you you should put something there okay okay okay okay so we have this and now when we run this we don't have anything to optimize yet so first of all we need the model right do we need to prepare data first let's check so I have this short thing snippet here that embeds an eye Python shell and I just plug this into anywhere so I can see if I reach it right so I reach the prepare data so let's care about the data set first this is why we're here right so it is NLP library as you can see right here maybe so there's the usage right here so you can load the data set here with the I think even with the appropriate split and it will basically just gave it back so if you don't have it it will download it it's pretty cool so we'll just load the data set and I've already sort of I've already sort of checked out what they have and they have the IMDB data set okay and in split and this split argument we can say give me the terrain split and as a string you can say give me whatever the first five percent of the train split since we won't be like this is just my laptop here so we won't be able to Train like a super high grade model but we'll we'll go for five percent of the terrain split so this is to Train data set right in if if we see if we run until here so if you have not downloaded this it would download this so given the Train dataset I hope you can see this so it says it's a data set it has one thousand two hundred and fifty rows and it has each entry has a text and a label and if you look you can just index this like a a data set and that's the first sample right so the label is one here means that we should predict the label that this is a good sentiment right it's it's either 1 or 0 maybe maybe yeah I think so so either good sentiment or bad sentiment okay so our first task is going to be basically to get this into a form where a bird can consume it so how do we do this with this NLP library and that's the pretty cool part so right now you see this as text so in NLP we need to map this text into token IDs so we need to tokenize and we need to map this to ids and hugging face of course has very nice libraries for that they're called tokenizer z' so we'll have one of these tokenizer z' and we'll use this from the transformers library and I think this called Bert tokenizer that then the Bert models can use let's check it out okay we're at the documentation so Bert tokenizer there we go there's a bird tokenizer fast oh yes okay what we'll take the we'll take the fast one maybe not yeah we'll take the fast one come on be risky Bert tokenizer fast and I think we can just do this from free they have these methods from pre-trained yes right so we'll take this from pre-trained and we'll put the model name here now I want to make this a flag such that I'm not bound to a particular model oops cool so this is called model so this is our model deep bird base on case and we have a tokenizer right now so what we can do is we can now tokenize these things these every entry in the data set now in a classic setting we'd have to you know write a loop for that but with this data set library with this NLP library it's pretty cool that we can tokenize basically each of the samples we can map this tokenizer function across the the training data set so how do we do that we have this tokenizer and the tokenizer has i'm pretty sure it has like a tokenize or an encode or something method so there's forward now this is the birth model where's the birds tokenizer all right here right here okay it has it has pretty sure it has this encode or something here oh yeah encode right and code where is the definition of that can we click on this okay cool so this encode takes text and it takes a bunch of other arguments such as I hope you can see this there we go such as whether or not you should add the special tokens or the max length this is going to be pretty important and pad to max length we want everything to be of the same length so if you apply this taupe is encode function to a text of these samples so let's just take the first sample here and let's take the text entry then what you're gonna get is like a list of these IDs this is exactly what we want so the 101 here is this CLS token that bird takes in and then it's just the word pieces right so you could also say instead of this say tokenize i think and that will just give you the word pieces not the encodes yet all right so these are the word pieces right here this is the tokenized text and with the encode function it does this and then maps these two IDs such that bird can consume that so for this NLP this library has this convenient function called map in their data set so what we'll have to do first is will define a tokenized function that takes in a single sample and then it will run the tokenizer and code function across the text entry and we have already seen we need like so add special tokens is true this is cool with max length yes and we'll make a flag sequence length or something and we are going to pad to max length is true so every single sample will be of the same size now in this function there's a number of ways what you can return here so one way is to return the original sample and actually just set a new attribute I think set a new attribute on this original sample right here let's format this a bit nicer so see we have this tokenize function takes a sample it takes the text it token eise's it encodes it and puts this as the new attribute input IDs and returns it again and now what we can do is we can map this function across the training data set like so so this will go over the training data set and basically each for each entry do this thing so hopefully after this operation we'll have a data set with where with where each sample not only has a text and a label but also a input IDs attribute and we don't have this sequence length thing right here so let's put that here let's just go with 32 since it this is just my laptop so 32 samples should be fine so here it says can't pickle tokenizer objects so what it tries to do right here is it tries to it tries to parallel eyes basically this thing right here so if we look at this NLP thing is there documentation to this [Music] we can just look at the data sets may be naming splits builder arrow data set map right here so this function I think it will try to multi process and therefore it needs to basically pickle all of the things that go into the into the into the hah I can't speak today it pickles all of the things that go into the function which means this tokenizer right here it needs to be pickled now maybe we can keep in memory load blah blah blah maybe there's a way to get around this so one thing we can try is we can try another tokenizer maybe this one can be pickled the still library is pretty good but it can't pickle every everything yes so this tokenizer can actually be pickled in in the other tokenizer so I'm not entirely sure what you'd have to do honestly because I don't know the library but what you could do is like make a thread or process local variable of this and basically make it a singleton in each process and then basically in here you call the function to get it and it returns the already instantiated object and so on if you really want to multi process all of this anyway we have this train data set right now and you see the schema if you can see this the schema has been extended so there is now text there is label and there is input IDs which is a list of n64 things that's pretty cool so now what we can do since this is still a Python list right this is still a Python list now I know the tokenizer can already output PI torch tensors but that's kind of cheating so we want to use this library right here so we want the Train data set there is a method called set format right here and you say type equals torch and what that does and I think is you need to say which columns you want so we want columns maybe we should get all columns can we output the text so you can select from the sample which of the columns you want and let's check it out again for now as long as we're just debugging here I like to do a debug flag so this is usually one of the first flags I do let's define boolean debug and what this does is whenever this is active I try to be as fast as possible so they're in this PI torch lightning trainer there's actually this fast def run argument which does the same thing but I can push it a bit harder with this debug here so let me say this is like this is sorry this is like one yeah which is load like batch size samples if we if we're in debug mode batch size we don't we don't actually have a batch size argument yet do we if flag start debug else 5% okay so we don't have batch size yet we're surely gonna need that at some point so let's go with a batch size of eight just because we can so now if we run this in debug or if we run this in debug we should ah okay yes this needs to be a string chugga-boom cool so it says it's the fast def run and if we run it in debug it just loads very few data points so this map function here doesn't take this whole while maybe there's a way you can stream that I don't know for now this is pretty good so if we look at the train data set again you can see that it has the same entry so this is still a list of 64 but if you index it right now if you go to the zero with data point okay then it crashes because it tries to convert these to PI torch tensors and they can't convert this string so we'll have to say we just want the columns input IDs and we want the label like bow can't spell okay let's try it again so right here you see that what we get out is actually a PI torch ten source for this and not kind of Python lists anymore so this is now pretty this is one-to-one so with duck typing maybe it's even subclass this is a PI torch data set right which we can load into a data loader so this is a perfectly fine data set so we can now say self train data set is this train data set now we want to do this for the test as well but in order to do that we would have to write all of this code again which I'm not really in the mood so we'll just loop it will create a function prepare data set and we'll take in a split name all right like this and we'll just go with the split name here that should do it and we just call it data set data set chicka-boom should be bankable and return that so now we can say train data set self-taught test data set is prepare data set for train and test okay excellent so now we have a training data set and a testing data set so here in the train date loader we need to take the training data set and construct a data loader from it this is super super easy so what we'll do is we'll do it in one line data loader so what does the data array the data loader needs a data set so the prepare data is called at the beginning so we have this data set right here and I think we can do a go with a batch size right here and we already have a flag for that and I think there is like a drop last yes so the drop last will go for true we only want full batches during training and we'll also shuffle okay and the same goes for we need a validation data loader for our validation set so in patrasche letting you have trained validation and test and test test is really only for like the final final test if the test date set we have here is the would be called the validation data set in pi torch lightning so we false here false we don't want to shuffle particularly okay so we have a training data loader and a validation data loader now what do we need we have optimizer very good now what do we need all we need to do is to actually pass our data through the bird model so this forward thing here we're just gonna leave not implemented um maybe we can implement it okay so we do need a model as you can see right here this batch let's say this batch is going to let's go right here right so if you know if you don't sometimes don't know what to do you just go to where you should be okay I'll guarantee parameter we don't have parameters yet all right so what do we do we go up here and we make a model we need to actually make the bird model right here so from transformers we can use the bird model now they have a lot of berthed models and we'll go back right here to the bird models because they as you know Bert is just an encoder so we need to build a classifier on top of Bert but they already have done this so they have a bunch of birth different configurations and the one we're looking for here would be this this bird for sequence classification right this is bert bert model transformer with a sequence classification or regression head on top right so this is exactly what we need classifier on top of bert and we can i think we can also load this with this from pre-trained and just put in the same name so we can this Bert for sequence classification and we'll load up the same model that we had okay so this is our model easy as that so what do we what do we do with this Bert if we put in data what what happens for that we quickly go back again so in the forward method we can in we can input the input IDs right which is batch size sequence length tensor we can input the attention mask that basically tells you where there's padding and where there isn't mask to avoid performing attentional padding token mask value selected in 0 1 1 for tokens that are not masks 0 for tokens theorem masks then we can input the token type IDs which we don't have here we just have one sentence but usually in bird you have the capability of inputting two different types like a question and a paragraph or a first sentence and the second sentence position IDs are optional bla bla bla bla bla bla bla none of that okay we could also input the labels these are optional and it would already compute a loss for us which we we don't this that's almost cheating so let's just focus on putting in the input IDs and I think that's gonna be enough since we basically truncate our long text to 32 tokens we don't need to worry about masking right here otherwise you would input a mass for actually we can do it we can do it okay so what you could in put a mask for basically wear your tokens are not pad tokens and the path opens in Beart r0 so basically your mask should should just be whatever is nonzero but maybe also your model learns to ignore the pad tokens I might be wrong here and it does it automatically right so in your forward pass what do you do actually let's go to the training step well put something here you can see it so if you if you didn't have Bert it would actually Bert you perk you up it would download Bert right here but since I have it you can see here this is the smaller Bert model I touch lightning I don't have enough space in my console right here but it would give you a nice overview over your model how many parameters it has how much kind of layers it has and so on so we also need a validation step if we have a validation data loader validation step and we need the validation epoch end function so usually in training you don't really care about epochs too much because you just have many batch after many batch but in validation what you want is kind of one single metric across your entire test data set or validation data set and therefore you sort of in the validation step you'll just kind of output things you output local things per batch and then in the epoch and function you aggregate them into one big number so we'll we'll just put we'll put things into each thing same thing so I'm pretty sure we're going to end up in the validation step first because if especially if we do this debug run it pays it tries to run a validation first at the very start of training so we can look at a batch right here so what's a batch the batch seems to be a dictionary if you look at its keys we have label and input IDs okay so that's pretty cool so if we go for the input IDs that gives us a tensor and the tensor is of shape 8 which is our batch size and 32 which is our sequence length and we should be able to pretty much input that into the bird model that we created boom ok and what do we get out we get out a tuple and the first entry is going to be this looks like log it's alright ok let's check the shape and this is 8 so this is our batch size and 2 is the log it so one for the negative class and one for the positive class this is this we can basically input into a cross entropy loss given our labels so we also have our label here and their label is all ones nice is this maybe sorted is the data set sorted into good and bad things because that would be it would be bad in any case so what do we have to do so in the forward method we get the input IDs let's let's just say we get the input IDs and we run this through our model and we can actually construct a mask here and the mask is going to be wherever the input IDs are not zero and that as a what does it need to be so these masks this attention mask is going to be a float tensor okay so we'll put it as a float tensor cool right like this so our lockets are going to be that and yeah tuple with one entries at the comma here is important we're going to return the log it's so this is our forward function so in the validation and the training step the first thing we got to do is we got a call this forward function with the input IDs and these are of course our in our batch like this so these are going to be our log it's and then in the validation what we want to do is we first of all want to compute our loss right so we have to construct this up here in the init we can actually just fold this prepare data loss is going to be a cross entropy loss yes that exists with read reduction I like to put reduction on I don't think there is like an deprecated reduce and there is like a reduction where you can put mean or something I like to not reduce the loss at first because then I can Agra use the same thing for validation and training so in the validation step I just want to compute my loss right here with self so loss loss and we'll have to cheat a bit but so look up the cross entropy loss and come on okay where is the cross entropy loss cross-entropy loss it takes yes its reduction ha tada and so the input to the function that we construct is going to be first end by C first the input and then the targets so first the log its and then the targets right criterion that combines logs of Max and n ll loss over a single class nice nice nice okay okay cool so first log it and then lables lable okay that's our loss so if we check out what our loss is going to be it's probably going to be an vector of size eight because we have reduction none loss yes C vector of size eight very nice so we can just basically return say we can return this loss just as is and then in the validation epoch ends the outputs here is going to be a list of and every entry in the list is going to be one of these validation steps for for one batch so we can aggregate those so losses is we'll concatenate them since they're going to be chunks of eight outputs at the dimension zero and then we can calculate the mean right so we can do that and then we can Oh No we need to do more we also want to know the accuracy right so the accuracy is going to be whether or not the law gets Arg max is go is equal to the label label so the accuracy for each sample is going to be that decide they're going to be 1 or 0 and we want that as a float so here let's output a dictionary with loss and accuracy all right excellent so here then we can aggregate so the loss is going to be and I like to have like a construction here that aggregates this still so we go out loss for oh in outputs so these are now going to be entries each one is going to be a dictionary right so our loss losses we have concatenation to the mean okay our accuracy is going to be the same thing for the accuracy nice so our output here is going to be a dictionary and I think in patrasche lightening they're there if you put validation accuracy it's like Val ACK it selects the model according to this but I'm not sure so also in PI torch lightening I can now output this here but also if you have a log entry it will forward this to the logger which we can actually do and make a tensor board logger out of this so what have we done we have first of all set up the validation step so the the patrasche lightnings gonna run through the data loader for each batch do this so we forward it through the Bert model to get our log it's and then we compute our by the cross entropy loss of the lockets and the labels and we also computer accuracy by seeing how much the log it's agree with the labels or the maximum log it and then we aggregate all of this over the entire epoch and output that now let's set up a logger so for the logger we can put this I think in the trainer here hydrogen lightning logger dot and I think there is a tensor board logger really sure PI torch lightning is their tensor board no pipe torch Ning logger I'm pretty sure that exists that's not the newest version I hate these these old dogs so latest come on now this was called logging log girl again loggers sensor board logger right here nice so our saved ear is going to be called logs and then what we what do we want we want the name IMDB and there's also this version thing where if if if you don't put version zero we'll just make a new kind of folder each time but I guess we delete the logs anyway we delete the logs folder at the beginning so we don't have this problem but I generally like to overwrite my logs and not make new runs but if you like something different that's you know fine all right so let's run this again and we're cool though this is the bird configuration that we loaded and then we have no attribute logger by Church lightning loggers loggers okay again loading the weights very cool blah blah blah blah blah blah blah blah and we're in a night Python shell and do we have an ID Python shell remaining only in the training step okay so we're at the training step right here and we can actually can we can check whether or not ah now we have lightning logs and logs my ah okay so this appeared to be our tensor board logs so we are may be able to run the tensor board here later let's run it logs we don't have tensor board okay we have uninstalled it because I was angry at it oh come on what's going on um I have tensor board I should have tensor board somewhere it's it's like in in local bin or something okay then no it's not in local bin Oh Oh we'll find it we'll figure it out how to get a tensor board maybe we need to install tensor flow well that's gonna take a while okay so back to the trading step in the training step we basically need to do the same as in the validation step so we'll need to forward our batch through the model but here we don't need to compute an accuracy but we do need to compute a actually batch loss that we can back propagate on now in the training step you can either specify how you back propagate per se or what you can do is you can just output this log loss of attribute and then PI torch lightning will basically do the back propagation for you we have T tensor board now please all right there we go and we can we can put this into a different thing right here get an Opie demo yes okay so this is running and if everything goes correctly six Shaboom we have a tensor board okay so we need to forward our training step and we need to calculate a loss for the batch so these last year we do the same thing but here we call mean on it so this is the mean loss from this batch and we can now return the loss right here and we can also in the training step you can also output a log dictionary and will output the loss again here in order so this is our going to be our training loss that we output right here let's call it train loss and this also will go into the tensor board so if we run this right now we don't have an eye - she'll simply by outputting this loss attribute we already instruct Pi torch lightning to now run back prop on this loss using the optimizer that we have defined okay and by outputting the log we instructed to put this into the tensor board so now we have a scalar entry and you can see this it only contains two valid no it contains everything very cool very very cool so let's remove the debug flag and we'll just see what happens so to recap right to recap we have on a GoSee epoch one epoch to go go go go go ah very cool what we've done is we've set up this PI torch lightning module it needs a bunch of functions but in the init we've basically just set up our Bert model from the face Transformers library we've loaded in a pre trained birth model that we're going to fine-tune the main thing that the pi torch lightning module needs is a training step function where you define what did she do with the data and this data loader function so in the data loader function we've loaded up a data set and we basically specify the batch size this is very easy where does the dataset come from we do it here in prepare data this is a special function in Python lightning that's basically called after the in it but before anything else runs and here we are loading this data set from the NLP library and this is the end of the magic part we specify the split and the size that we want inside of the string and you can do this in percent or in a number of samples that you want I'm sort of sure you can do more things but I haven't explored that then we run map on the data set in order to tokenize it and that's right here and we use a tokenizer again from the PI torch lightning and just run this encode function this is very simple like if how complicated was this just like a year ago it crazy then we need to put set format and set format tells the data set how it needs to output its samples and we tell it please output torch tensors and we want these columns right here and we make a train and the test data set with from the train and test split accordingly so we have this this goes into a data loader PI torch lightning will take the data loader and run training on it using this train step function in this train step function we get a batch in the batch there are these two columns that we specified previously input IDs and label the input IDs will put through the forward function of the model itself this is the forward function I'll construct a mask and run it through the model we wouldn't actually need to construct a mask okay and we get back the lockets of the classification and then we run this through a cross entropy loss get the mean of the batch and there we go in the validation step we'd be the same thing but also calculate the accuracy but don't calculate the mean we want to keep it for example and only at the end we want to concatenate everything and calculate the mean if we've done everything correctly you see right here our train loss goes down down down until it is almost zero because we've just and the validation accuracy is super high is this is this because all the labels are equal like for real okay so we'll do something else we'll make an integer with percent and this was five right so that we loaded five percent of the data set but let's load some more and this might take longer but let's load fifty percent of the data set and just see what happens no precent I called it pre sent very good so we'll load up 50% of the data set and we'll do the same thing and we can track in real time what happens in 10 support and unrecognized instruction format okay can we make a format string in a format string this is nasty does it work please work we can make a format string in a format string absolutely bonkers okay so it takes a little bit longer and you could actually I think you can speed this up this mapping of the data set maybe you can stream it I'm pretty sure you can batch it you can do batch processing of this but for our case right here we think it's enough so it was like what 1200 if we had 5% so now it should be something like 12,000 so let's continue with the recap of what we did here we have the Train it's at the validation data set and on yes so we have everything like this the configure optimizers you can put an optimizer you can also put a learning rate scheduler if you want to and then in the main function we load this package lightning module and we specify a trainer and the trainer we tell it you know the max epochs and so on and we set up the logger and we just run fit on this model and this runs epochs of the model and after each epoch it does a validation epoch and minimizes our loss very cool very effective so now if if please if you would alright here we go this is my laptop training Bert okay we don't seem to make too much progress let's check the tensor board training loss goes down training loss goes to zero I have the sneaking suspicion that this is not entirely shuffled so is there like a shuffle like a shuffle thing because this seems a bit this seems a bit bit fishy this IMDB data set right here it just seems like you know we could use a bit of shuffling because all the labels yeah the training loss instantly goes to zero so maybe maybe it's not we can we shuffle here let's look at the load data set function hello dataset batched now keep in memory no none of that okay this does not seem to go to continue right here data sets NOP data sets I hope here we know we should find this load data set somewhere builder features load load data set da-da-dah split can we know a shuffle anywhere I'll search shuffle builder okay so generate examples dysfunction pre-process the examples references key will be hashed okay we are not able to shuffle this just like that but I hope this at least gives you an impression I guess if you were to take the full dataset and map it then it would actually work we'll just try again with 10% of the data just to see it go down tensor board see this is not good because we always delete the logs folder we don't have any remnant old tensorflow logs all right come on come on so 10% should be about this about this okay train loss looking good looking good so far look at these models how large is that how large is the dirt base case hugging face pre-trained models pre-trained models bert based on case that's the one we have 12 layers 110 million parameters easy easy oh no it's too large training loss goes to zero again okay so we've determined that this dataset very probably isn't entirely shuffled it might just have all the good labels first and all the bad labels last and yeah just to confirm let's confirm this right here let's go with 100% but let's put an eye Python shell down just before we map the dataset so we don't have to go through the whole mapping procedure actually that would be here right yes can we not map this asynchronously map I might be doing something really wrong with this library but I think that's that's how it should go so map def map right here we can do batched we could do batched and then I think hogging face has a function to encode batched encode batch encode encode batch no let's go to the tokenizer shea butter bum built-in puts create token type ids get special token masks save or is encode right here can we have batch encode build inputs no this might be it batch encode yes there is a batch encode where you have batches of these things so okay what if we do the negative one see here's the label 0 I'm pretty sure I'm pretty sure batch true let's do that and in our function here we'll say batching code so let's see how fast this is with a hundred percent we're tokenizer has no but we just had matching code oh but this might be we have matching code plus matching code plus or text pairs okay we need this matching code plus but then that gives us a dictionary right this gives us a dictionary with the fields input IDs right here so like this how about that and then we still might want to limit the actual data set once we have once we have mapped it because we need to train on it as well but I just want to see how fast this batch encoding is yes okay reasonably fast but it takes like three minutes yeah so we won't go on here I will put I will put this as is on I'll put this as is on github and I hope you can profit from that in any way you want the hugging phase site has a tutorial on squad where they also use the metrics so they have basically these predefined metrics like blue or Rouge I think and you can just use them as you use these data sets so it's very very very convenient to work with these things in NLP so NOP has come a long way absolutely invite you to check out the the Transformers and tokenizer 's and NLP repos and with that that's it for me I think I hope you enjoyed this again leave if you see improvements or if I maybe should edit this a bit more I thought the entire process of just going through and making mistakes would be entertaining to some alright bye-bye
Info
Channel: Yannic Kilcher
Views: 27,917
Rating: 4.9445729 out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, code, pytorch, bert, pretrained, lightning, live, tutorial, pip, nlp, transformers, tokenizers, sequence, sentiment, imdb, dataset, full, github
Id: G3pOvrKkFuk
Channel Id: undefined
Length: 62min 41sec (3761 seconds)
Published: Mon May 18 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.