Fine-Tuning BERT with HuggingFace and PyTorch Lightning for Multilabel Text Classification | Train

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys in this video we're going to continue with the example in which we are fine tuning birth for toxic comment classification and i did a bit of more reprocessing because the training process took a bit too long for my liking and the pre-prosync is based on eda or export or data analysis done by this guy right here so his name is shishir kumar this guy made a very good notebook on the process of toxic comment classification and the most important thing right here to note is that the data set contains uh basically very imbalanced uh number of categories in which the toxic comments are actually clean so using this observation and something else here as well uh given that the mean text length of the each comment is below 128. this basically gives us a way to choose the tokenizer size or the number of tokens that we're going to use when uh tokenizing the text or the comment text so using both of those [Music] data or data points or this knowledge i did something a bit different so if we go again after loading the csv and looking at it the first thing that we did was to have a look at the labels or they their count for which label and you can clearly see that the training df contains a lot less examples if you combine all this right here let me run this again so the number of all packed comments and this is actually quite a bit lower than this number so we have 151 k examples or more in total comments and only at most which is of course not the case because we have multi-level classification only around 33 count k of the comments don't have this so one thing we can do right here is to choose all the comments that contain toxic text and we are going to do that using something like this so i'm going to basically take all of the rows in the data set that contains something some tuck and i'm going to do this based on the roll so you can see a result of this right here so this will tell us that those four comments don't have or are obviously clean and this is toxic in some way so it has at least one pack that is not clean and if we take every example that contains at least one tag we can say that those are the toxic comments that we have and now i can see their number right here so we have only 15k examples so that's roughly 10 times less than the total examples that we had right here and similarly i can get the clean comments let me copy this and if the sum of tax is equal to 0 this will mean that this comment is of course clean and now i can get the shape of both of those and you can see that we have 130k 36k examples of comments that are clean and this vastly overpowers everything from here so one thing that i can do is that i can create a new training data frame and i'm going to basically take all the examples that contain toxic comments and i'm going to get only a sample which will contain 15k examples from the clean comments so now the ratio between both of those is going to be a bit different so we'll have roughly the same size of toxic and clean comments but i want to show you that this will remain the same and this will be a bit hard to see but at least we're going to check the first number so this is a double check for me at least for me that i usually do when i am doing some pre-processing because of course most of the time i get it wrong at the first time and if you screw up right here in the data processing steps most of the time your model will produce just gibberish or it's even worse might fool you that is performing much better than you expect so now that we have this information i'm going to delete the sampling right here and basically the rest of it is going to be the same so once again this is done because uh the birth model is staying is taking our quite a while to use those all those clean examples and one thing that you should notice right here is that we are keeping the validation set as clean as possible or unchanged as possible we are just taking some sample of the all of the comments that we have we don't do any post processing on it and this is great because this will keep the original distribution of the comments and we are going to evaluate our model based on some realistic distribution so keep in mind that you should maybe keep this distribution as well as original as we are doing right here no matter what we are doing with the training data set so now that this is complete i'm just going to restart this runtime and run everything so this will go ahead and do a redo the pre-processing steps and this should give us a good starting point to continue with our example all right so now that everything has run properly i'm going to get rid of this and i'm going to define the number of epochs for which you are going to train or find you in the board model and this will be 10 you might write different numbers of course and the batch size that we are going to use is going to be 32 so right here i'm going to pass in the batch size to the toxic common data module so this will run the data module and it will of course give us a batch size of 32 so now that this is complete i'm going to of course also get rid of this cell right here so this is from the demo and next we are going to enter some modeling information and so we are going to evaluate the model that we are going to fine tune using binary cross entropy was and basically what this does is um way to tell us how much each of the possible tax and the prediction for it is correct so given that your model does perfect prediction for each stack whether it should be placed or not you should get a number that is zeros or close to zero as possible so we will try to of course minimize that was and the way you can do that is to use bca was or binary cross entropy was that is provided from the torch nn module and there is a quite a extensive explanation of what this does one thing that you should notice right here is that you need to have targets or labels in our case and the numbers of those should be between 0 and 1. so this is the first thing that you should notice and the second thing here is that you can have a look at an example so this example uses sigmoid and bcaa was and basically generates two tensors using some number random number generation and then this one goes through the loss and you should notice right here that the input is passed through a sigmoid function so essentially we are giving the bca was a probability estimation or probability distribution for each of the possible labels that we have and at least for me this wasn't exactly clear and another thing you should have a look at is bca with widgets was so what this does is basically that it combines the sigmoid function with the binary cross entropy was but we are not going to do that right here because we will basically output the result of the sigmoid function as a print as a prediction of our model so we don't have to apply this after the was is computed so you can try to model this with this watched swords but i think that you should know that this is what it does of course this is not exactly the case because at least in the documentation they state that this version is more numerically stable than using a plain sigmoid followed by bca was so in some cases your training or maybe even prediction i don't know might blow up because the numbers that you put in are uh into the sigmoid and the bci was might not be good enough or might be stable might not be stable so uh even even though this might be a better way to do it i've tried with the bca was an achievement function because it tells me what is happening under the hood and i think that at least while you're warning this is a great thing so what we can do right here is have a look at an example of how bca was is going to be used and i'm going to initially instantiate instance of the bca loss and we are going to imagine that we have some prediction from our model and this is going to be a fault answer this will be the prediction basically and i'm going to take some numbers because i don't want to type them in so let's say that this is the arrow prediction in which we are not applying any sigmoid function of course if we apply the sigmoid function i'm going to show you what the result of that will be in a second and after that we have some labels which are coming of course from our data set and let's say that the labels have those values uh know that we are using what values here as well and the numbers are between zero and one actually they are only zero and one in our case so if i apply the sigmoid function over the prediction you should see that each value now of course happens to go between 0 and 1 as well so this is uh great so basically what we are going to do is we are going to pass in the result of the sigmoid application to the prediction with the labels to the bca was and to do that i'm going to just call the boss sorry i'm going to pass in the criterion all right i'm going to take this and the targets are going to be in the labels and this will be our output and you can see that there was is a number which is not close to zero at it might be at the moment so if i fiddle around with the let's say let's say yeah this number if we go over here and do something like this the loss is war right now next if i go let's say with this number as well and the watch is war so what will happen basically is that uh the the birth model and our model in general will go to widow with those numbers during training of course and get them to match the the labels by applying a sigmoid on top of that one thing that the kegel competition is going to use for evaluating the performance of your model is going to be the rock a orc or the area under the curve below the rock characteristic so this will be the receiver operating characteristic and to understand what this is there is a very good article on wikipedia which i'm going to link in the notebook and that explains what the rock curve actually is but i think that even a better explanation is going to be the machine warning crash course by google and this socket worn page on the same subject as well but what is a rock and why we should care about it well rock is basically going to be a curve which explains how well a binary classifier does and we are interested in binary classification in this case because we want each stack or each possible pack to have the predicted value as close as possible to the labels that we have so this will tell us how well each of our labels for each example of which we have let me go however the toxic category or attack is predicted however the severe toxic obscene threat insult and identity head so each of those will be individually evaluated using the rock curve and one important thing about the rock curve is that it is just a port or a number or numbers which are going to be defined using the following so we have the true positive rate and the false positive rate which are going to be defined on some curve which can be plotted of course and we have this rock uh area under the curve which is going to be a basic number which goes from uh 0 to 1 or most of the time it goes between 0.5 to 1 where a number of 1 is going to be the best possible classifier and as close as we go to the 0.5 the classifier is as good as the random number generator so you will want this this graph or this chart to go as close as possible to one and if you're not familiar with what true positive rate is for example this will be the number of true positive predictions which means that the prediction is true so if we have an example in which the toxic comment is present into a given text and the model is really predicting this so we have a true positive right here and then we will get uh or at to the true positive the number of false negatives so what this means is that we have a negative example and our model is unable to get that next we have the false positive rate and a false positive is uh something that our model predicts us that it has let's say that we have uh example in which the toxic attack is not present but our model predicts it so this should be an example of uh forced positive and this should be added in the numerator to the true negative so this will be on anything that let's say that a toxic is not present in a comment and our model says that it is not present in a comment so this should be an example of true negative if we get those two rates and plot them you get a curve that is similar to something like this so basically the area below this chart or this chart right here is going to be the area under the curve and this is just a number which goes between uh i think 0 and 1 as well so you should try to maximize this number and of course we are going to use rock auc score to do that and the python whitening has an implementation for this so we are going to use it so this is pretty much how we are going to evaluate our model we are going to use both um bca was and then we are going to use the rock score to uh to evaluate how good our model is doing so now that we have up one of how we are going to evaluate our model i'm going to show you a way which you can use to fine-tune birth or even some other model in which we are going to use a warning rate scheduler and for our example we are going to use something that is pretty standard i guess for word we are going to use the linear scheduler with warm up so we are going to set some number which is going to be the warning rate and what this scheduler is going to do is going to go from zero to the warning rate that we've specified and is going to do that in the number of warm-up steps that we're going to use as a parameter and then from the for number of training steps it's going to go from the warning rate and it's going to go linearly back to zero so this will be the scheduler that we are going to use of course you can try out uh some other schedulers and there is quite a lot of examples right here from the hanging phase transformers library which you might wanna have a look at with all that information in mind we are going to continue with creating the model itself and i'm going to do that using a class which we are going to call toxic comment packer or classifier however you want to do it and this will extend from the whitening module so this will be an alternative for from the pi torch data nn module and this will be a bit different uh in a way that we are going to define a training step validation step test step and training epoch and methods which are not standard for pytorch training but let's start with the constructor and this will take a number of classes and it's going to take steps per epoch this will be required for our linear scheduler and the number of epochs in which we are going to do that the training itself which again is going to be required for the linear uh worm scheduler schedule to warm up so here i'm going to call the superconstructor and i'm going to initialize the birth model which i've already done and i'm going to basically take this all right i'm going to do uh save this in a instance of our quas then i'm going to create the classifier classifier here so this will serve as a way to get the outputs of the board model and convert those into the number of classes which we are going to want to predict so this as input is going to get the number of the hidden size units into the board model so this will be 6 7 60 sorry 7 birth base hidden units let's check how many those were 768 hidden units this will be the number right here and this will output the number of classes so i'm going to save the steps per epoch and the number of ebooks and finally i'm going to initialize the criterion which of course is going to be the one that we discussed the binary cross entropy was next we're going to define a forward method which going to get the input of this the attention mask and the labels this will be mine a bit strange to you but we are going to compute the wasps right here into the forward method and this is pretty much standard in all the tutorials that you might see around the net or fine tuning bird and those will be known when you're doing inference and those will be some value when you're doing training so you will need to compute the loss right here so i'm going to pass the input ids and the attention mask to the board model i'm going to run the linear layer on top of the output but i'm not going to run that under all the outputs but i'm going to use the poor the put result from birth and i'm going to apply the sigmoid function just like right with what we did right here and here i'm going to define the words as 0 and if we have labels we are going to compute the wash itself just like what we did right here and i'm going to restart the results i'm going to first return the course and then i'm going to return the output or the prediction of our model so next we are going to define another method as i've already mentioned it's going to be the training step what this can get is the batch and the batch index of our data set and from the batch i'm going to take the input id the attention mask and labels this is basically what a data water returns in this case and finally the labels next i'm going to pass in all this through our model and i'm going to take out the was and the outputs and the model itself itself so this will call the forward method for us passing the input ids the attention mask and the labels okay so we're almost done with this now i'm going to do something that is specific for by torch whitening and i'm going to output the training course and i'm going to call the self.walk method i'm going to pass in the drain loss and this will be the loss that we are passing in right here and i want to say that i want to observe this this result and i want to void so when we are actually fitting the motor you are going to see that this value is going to be printed or updated when we are training the model and the result of all this is going to be a dictionary which will contain the loss the predictions which is going to be the outputs and the labels and you'll see why we need those in a second all right so this is the training step or this is one training step and we need to pretty much do the same thing for the validation and test step but we are going to uh video with the numbers right here and what we are returning right here so oh sorry i have a typo here so this will be validation step and here i want to print the validation was and i'm going to just return the loss yeah this should do it so this is the validation loss and exactly the same thing with just minor adjustments we are going to do for the test step of course you can extract this and make it look a bit nicer but for us this should do right now so this is the best first right here when we are doing the test step so now that we have the training step validation step and test step we are going to define the final method which is a bit non-standard i would say and a bit specific to our example and this will be the training epoch and method in which we are going to compute the rock score so let's define it and here i'm going to take these outputs of the training step so this will have exactly this form it will be a dictionary and i'm going to take the labels and the predictions and i'm going to iterate over all the outputs and those will be batched so you need to do to expand or flatten the results and i want to detach and go to the cpu because we are going to do the training itself using cuda or the gpu and i'm going to append the labels and i'm going to do the same thing for the predictions all right so now that we have the labels and the predictions i'm going to convert those to torch tensors which are going to be flattened and i'm going to use torch dots stack all right so now we have uh tensor two distances with the labels and the predictions i'm going to compute the uh rock score or the area under the rock you might say so to do that i am going to iterate over all label columns so we're going to iterate and take the rock score for each individual tag that we have in our labels and to compute the score i'm going to call the owlrock or area under rock and i'm going to show you where this is defined compute area under the receiver operating characteristic this is into the pytorch whitening and this event has source code if you're interested in and i believe that yeah they have an example of how you can use it it's very similar to what psychic one offers all right so this should work over the predictions at column or index for which label we're interested in right now and i'm going to pass in the labels for this stack and then i'm going to walk the result to tensorboard and it should be the name of the pack and this will be for the training set only right now so this should be the rock score and we are going to pass in the current epoch okay so what this will do is going to evaluate the training set and tell us what is the rock score for each individual attack last but not least we need to define another method which is going to be configure optimizers and this will configure the optimizer as well as the warning rate and this is where we i will get we will use all the parameters for the learning rate scheduler that we've passed in to the constructor first i'm going to start with the optimizer and this will be adam with way decay fix we're going to pass in the parameters of the model pretty standard and the warning rate is going to be this number right here this is uh straight up taken from the fine tuning recommendations done by the original bird paper next i'm going to calculate the warm-up steps that are required for the linear schedule or with warm-up so this should be this number right here and i'm going to use roughly one third of the tray of the examples or the steps that we are going to do during an import for warm-up so for one third of the the training for the current epochs we are going to go from zero to the warning rate so this will be something like let's say something like here and then from this point on we are going to go back to zero again so i'm going to pass in the warm-up steps which is going to be the all of the steps that we are doing per epoch and i'm going to divide this by three so this will be our third and the number of total steps that we're going to complete is going to be the steps per epoch i'm going to multiply this by the number of epochs so this won't be the one third it's going to be yeah much smaller number i believe because we are not multiplying by the number fee box but this should be fine in our example and i'm going to take out the number of warm-up steps and the scheduler is going to be the result of this i'm going to pass in the warm-up steps and the total steps so the syntax right here might be a bit funky we need to return a list of optimizers and schedulers in our case those list will be just containing the optimizer and the scheduler so if we run through this this should give us a complete model it's quite a large number of wines so the model is big i know that's what she said but yeah this model is really big and so let's create an instance of our model toxic comment power the number of classes is going to be six because we have six stacks the number of steps per epoch is going to be the length of the training data frame which is going to be divided by the batch size that we have and if i do this i need to pass in the last parameter which is going to be the number f box and this will be 10 a constant that we defined already so if i instantiate this this will go ahead and download birth if it is not already downloaded in our case we did that and i'm going to show you what an untrained model is going to return as a result and here what we need is something like this we're going to take the sample item that we've passed in in the previous video and yeah so this you get the input of this and attention mask and return wars and predictions now that we are not passing the labels here and this should give the predictions for each possible attack so note that we have a tensor with six values and those will be primarily random at the moment because we haven't done any fine tuning whatsoever but this is going to change right now because we are going to train the model and here comes the power of pie torch writing you basically don't have to do any more training groups you have to just define the training validation and test steps and most of the other stuff like iterating over the batches or the data waters or taking care of what moving the the data to some device or the model to some device to do the gpu or whatever you're training on so to simplify all this they just define this trainer which we are going to instantiate right now and this takes a number of max epochs so this will pass in we will pass in the number of epochs that we have right here the number of gpus we have only one gpu and i'm going to specify how often the the the output of the training is going to be refreshed so you need to pass in this because uh google group basically won't work as expected if it's lower than 20 i believe so if we run this it is saying you that we have one gpu available even though it says that it is array of zero which is going to be the only gpu that we have and it's going to use it because we said gps one next i'm going to start the training so to do that i'm going to call training trainer.fit and this accepts two parameters the first is going to be the model it is two parameters that we care about and the next one is going to be the data module that is going to define how and the data should be taken and used during training so if you run this we should hopefully see a summary of the model right here and then the actual training will start of course you can use many more helpers yeah what we have here is quite good you can use a lot more helpers you and better tracking allows you to specify let's say early stop callback so let's say if you're monitoring the validation wars and it doesn't go lower for some time you might stop the training and you can do a lot of things using python threatening but i'm not going to focus on that in this video as well as much if you want to see more about how to use pytorch whitening let me know in the comments below i might do a deeper dive into the all possible callbacks and checkpoints and login and basically tpus multiple gpus of course you can use multiple gps right here it's so easy just to change this number if your machine contains multiple gpus and you can see that this is already training this is you probably can't see all of it because i'm using a dark theme but you can see that this is printing the walls the validation was and the train was and i'm going to stop the training right here and i'm going to show you the results of my retrained model at least all right and let me show you what the tensorboard is giving out so you can load the extension tensorboard and you need to run the tensorboard with using the training works as a parameter so the training starts and it creates this folder right here in here you have your first experiment and you can change the name when you're defining the trainer so you hear in here you'll find the tensorboard events and the checkpoints which are created for you by pytorch lightning and this is pretty much an identical model that i've showed you how to do and in here you can see that i've trained this for 10 epochs and this contains a lot of metrics but most importantly we have the rock curves for each individual attack and you can see that at the beginning the rock was low for example for identity hit but during training those numbers got better and better of course you might want to train this model for longer it just this might give you better results so let's say for the insult you can see again a huge spike in performance all right and the training was you can see that it goes from a rather large value to something very very reasonable and the validation was something that we are more interested in goes from 0.715 to 0.097 which is i believe is quite good and you can observe the validation loss right here and you can see that the validation was was quite low at the end of the training and you can see that i'm using the trainer.save checkpoint to so now that i've trained i've shown you how to train your model i'm going to show you how you can use it to get some predictions and to do that i'm going to go ahead and download the pre-drained checkpoints so this will go ahead and take the toxic comment classifier checkpoint from my google drive folder i'm not sure that i'm going to keep this up because you can see that this is 1.3 gigabytes so it's again a very large model and i'm going to take an instance of that by calling toxic command tagger what prop checkpoint and i'm going to pass in this checkpoint right here which is now available here in our local files okay and i got to specify the number of classes that we have after this is instantiated i'm going to call training model dot freeze which will prepare our model for inference or faster inference and after this is done i'm going to start with a test comment and let's say that we have something like this i really hate you so let's say that we have something very toxic like this next we are going to use the tokenizer to encode the text and i'm going to get back take this and pass in the test comment and i'm going to change this to 128 and run through this so this is now the encoded test comment after that i'm going to take the prediction from the trained model and i'm going to pass in the input ids and attention mask here and i'm going to look at the prediction okay so you can see that uh there are numbers that are quite close to one so these are the tags which are model things most probably that should be added for this comment and you can see that some of the other tags have very low probability so one thing i'm going to do after this to do some post processing is going to convert the test prediction to a numpy array and you'll see why in a second i'm going to flatten this because you can see that we have tensor of tensors because this works with batches of course and i'm going to call numpy on that so this will be a fourth number array right now with the same values right here and next i'm going to convert those values to the labels that we have for each tag and i'm going to do that by iterating over the predicted probabilities and possible labels and applying a threshold for which we're going to take the labels we are going to enumerate over all possible tags or labels and i'm going to take the label probability and if that probability is larger too than some threshold and this in our case is going to be 0.5 i'm going to say that the model thinks that this label should be applied to this text or this comment so for our example again i really hate you etc our model believes that the labels that should be applied are toxic obscene and insult which is very good i guess so finally i'm going to wrap all of it into a simple function that you can call called classify which i've shown you at the start of the previous part and i'm going to define this function but this will take a bit more parameters that i've shown you in the beginning of the previous video it's going to take the comment text the model the tokenizer the label names and some threshold for which we're going to specify 0.5 as a default you might want to specify this a bit higher but that's up to you and your business case i guess so i'm going to start by encoding this text right here so after the test the encoding is complete i'm going to get the prediction from my coding and i'm going to call that on the model that we're passing in this function next i'm going to apply the same operation we're going to flatten it and call the numpy array from this next i'm going to extract the predicted labels i'm going to apply the threshold which should be this parameter right here and i'm going to look at the label names which is another parameter and i'm going to take the label probability from the prediction this will give us the labels and this will be basically the result of our function so now that we have this i'm going to write out with a bit of examples which we with an example that we start the series with so let's say that we have this text and i'm going to classify this with the trend model the tokenizer and the label comps and i'm going to keep the threshold as it is so if this runs you can see that we have an error or which let me check it so i'm calling the model and it says that yeah i forgot to do enumerate here okay so for this text we have toxic obscene and insult which is very good i believe and one interesting thing that you might want to try here is to pass in clean text or a text that doesn't contain any toxicity or bad words and for this example i'm going to use a phrase that i'm really really happy like let's say that it's from one of my favorite tv shows i made it and and how good at supplier relations so this is a text that is supposedly clean it doesn't contain toxicity and i'm going to basically call the same function with the same parameters and this should return an empty array so there should be shouldn't be any toxicity and you can see that this is the case so i believe that this model has been fine tuned very good and you can try to fiddle with the parameters with the linear rate scheduler and you can even go ahead and train it for more you can try other bird models or albert or something completely different i guess and you might get even better results but i am continuously blown away by the performance of those birds and transformer models in general for various nlp tasks of course the power comes from the pre-drained uh let's say context in those and i'm really glad that those models are open source that the hanging face library transformers is available for free and you can basically do some very cool stuff with that so thanks guys for watching uh i hope that you really enjoyed this series and uh again i wanna say thanks for hitting 5k subscribers i'm going to create a notebook with all of it with all this i'll be able to show you i'll share the notebook with you into the description below and i'm going to create a blog post which is going to detail everything that we've done so far of course the blog post might contain even more and or even better information i'm going to reference all the stuff that i've talked about right here thanks for watching please like share and subscribe i'll see you next time bye

Info

Channel: Venelin Valkov

Views: 6,703

Rating: 5 out of 5

Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning, PyTorch, BERT, HuggingFace, Transformers, NLP

Id: UJGxCsZgalA

Channel Id: undefined

Length: 56min 52sec (3412 seconds)

Published: Mon Dec 14 2020