Text Classification using BERT | Google Colab

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it's a saturday evening my girlfriend is out my dog is asleep so let's do a birth tutorial okay so i'm gonna try to do this as fast as possible because there's quite a lot and i just left links all over the notebook and you can check those things separately after the video or during it the thing is at the end of the video what i want you guys to do is just go to twitter copy random tweets preferably from trump uh come back here paste that tweet get an answer if it's positive or negative in this case it's a negative one so sleepy job island is proposing the biggest tax hike in our country's history seems to be negative if you agree with that slap a like if you don't just argue with me in the comments down below so what we are using here is google call up we use this for both logistic regression and simple neural network tutorials that i uploaded for the last two weeks if you want to watch those and i highly suggest that you do i'll leave a playlist link somewhere around here uh the call lab itself it's a jupyter notebook server on the google cloud which you can access through your google drive at any time from anywhere as long as you have internet and it's a great way to start learning machine learning since you don't need to set anything up you can just come in here run your code and see what it does if you want to know more about google collab here's a video as well so our notebook we have imports we have our data for training we have the modal part where we set up the model and then we have essentially the test part which i just showed you so for the imports we start with installing a couple of packages since we are using bert from tensorflow hub so already a trained model which we're gonna find tuned today uh we need to install tensorflow hub and since we are using google's birth model from tensorflow we need to install tensorflow models as well having said that here is our imports we are vastly different from the ones that we use for the simple neural network tutorial and for the one that we used for logistic regression just because of those additional packages that you're gonna use at the end of the importing i added not my code taken from the internet and check if you have a gpu available to you and if you see here that gpu is not available then it means that you need to change your runtime and you need to change it to a gpu one as i said if your first time using google collab go watch the google collab video it explains it in a bit more detail so after the imports we want to read our data and prep it for training we're gonna use the same data set that we use for the last two tutorials so sentiment 140 from kaggle with 1.6 million tweets it already has 1.6 million tweets annotated being either positive or negative within the target variable what i want you to do is to download the data set then open up my notebook which i will leave in the description down below with all the other notebooks so the logistic regression simple neural network this one for birth and a modified birth as well so after the loading data you need to upload it to your google drive to use it for a notebook and just change this path right here to the one which is your own from your own google drive and then just read data and print out the first couple of lines we do that just to check if the data was read correctly what are the variables how the data looks so we see that we have a couple of variables here namely the ones that we are interested in at target and text target being the sentiment so is a tweet positive or negative and if it's a positive one it's a four and if it's a negative one it's a zero and the text so the text of the tweet which you're gonna use to predict that same sentiment so if you would check out the amount of classes that we have so the sentiments you'll see that we have a zero and a four i get it when you have a zero and the four you expect to have something between that like a two for neutral but you don't so we just gonna stick with that before any classification you want to check if the classes are equally distributed so you see that yes indeed there are both negative and positive tweets are distributed equally within the data set why we want equal classes within the data set i'll just imagine a situation where we have 95 percent of the tweets being negative and just five positive at the end what you will get a model which is depressed and says that every tweet you input to it is a negative one and we really don't want to have that so to start modeling we need to do a couple of things split the data into training and testing subsamples we need to encode our labels so our classes and we need to localize the text within the tweet so we're gonna start with uh splitting the data and for training we're gonna use only five percent of all of our data why well for one we have 1.6 million tweets so that's quite a lot of tweets and we take quite a lot of time to train on all of the data set and the second thing we don't really need that much data five percent should be quite enough here to train our model for a couple epochs and get at least the same result that we had with simple neural network where we used 10 and logistic regression where we used all of the data so having said that yeah we split the data set into training and testing we needed the testing one to check if the model actually is training on the patterns within the data and not just remembering the data itself and without a testing sample that wouldn't be possible after the split we need to encode our labels so our sentiment what this does it just checks what the unique classes within our sentiment are and then transforms that variable into indexes of the classes instead of the values of the classes themselves we use one hot encoding here just because i figured that you might want to use this notebook for multi-class classification where there's three or more classes with binary classification you could actually do it in an easier way but i wonder this notebook would be reusable for you so we are sticking with one hot encoding now i'm saving the encoder here just because we're gonna use it later on for fine-tuning our model a bit more after initial training and because we already did use it and the testing side here we read our encoder to load the classes variable of that encoder to see what are the indexes of the classes so having said that let's move to tokenization so differently from the simple neural network that we did last week birth needs not one input but three inputs one of those inputs is pretty much the same as the simple neural network so tokenized tweets where you just get arrays of token ladies instead of tweets themselves the second input is the input mask where we just show the model where the inputs actually are and the third input is the input type we don't really have multiple types here so segments like one sentence and the other so instead we just show the model where the actual inputs are and not just some tokens as classification or separation so first before tokenizing anything we actually need to download the model itself if you want to read about birth models i suggest you go to this link here if you want to read about the model specifically that we are using this is the link for that so as you can see we're using a model from tensorflow hub and our model is vert multi which stands for multilingual so the model is pre-trained on more than one language if i remember correctly it's 112 cased takes into account if the words were upper case or lowercase so you can read more about the model here in the link provided and you can check all of the hub models in this link for those of you who don't really know what tensorflow hub is it's a hub for already pre-trained models which you can use and fine-tune to solve some problems that you might have so such as text classification in our example this is what you're gonna do as well we are downloading a model we are setting but it's trainable just because well we're gonna train it so after downloading the model we need to read a couple of variables of that model and those being the vocabulary file and the lowercase variable so since our model is cased and that means that it takes into account if a letter is upper or lower case you can check that variable after reading by just printing it out and you can see that it's false you can see that described right here in the text box after the cell another thing that we are reading is the vocabulary file we are not building a vocabulary as we did with our simple neural network tutorial we are reading the one that is already there and based on those two we set up our tokenizer which we gonna use to tokenize our tweets into ids there's gonna be two additional tokens that you're gonna use the classification token and the separation token and you can check their ids right here but essentially we're just gonna say that for the model that it's a classification task and where the separation occurs now this function right here is gonna tokenize our tweets and after tokenization our tweets gonna look something like this so a tensor of the indexes not of the words but of the tokens so when i say not words what i mean is that in our simple neural network example when we tokenize the tweet we saw that each word corresponds to a specific index now in bert's case we have different kind of tokens a token can be a letter a character it can be a combination of letters one thing to note is when you want to check out how the model actually tokenizes the tweet and you want to understand it there's a way of reading it so if the letter doesn't have these hashtags i guess if you would just comment down below i would really appreciate but for now let's just say it's a hashtag so if a symbol doesn't have hashtags before it it means it's a start of the word so here you see i then you see that the second word starts with f which stands for found here and then everything in the middle of the world has hashtags in it then another word starts with oh and you see that is without hashtags and so on so on uh some separate words like that and they will share a hashtag made up of three characters here and here but essentially no hashtags in front that's the start of the word hashtags in front where something before that character cool so we tokenize our words how that looks well this is a visual representation of those tokens on the x-axis we can see the length of our tokenized tweets and on the white axes they are just tweets so first one second one third one fourth one and so on now different colors denotes different token ids colors selected on the id of the token here in the front you'll see that the first one is the same color and looks like empty because it's the same color as background that's because we added a class token in front of it now as for all the other tokens you'll see that we are different in color just because the ids are different all it shows for us is that our tweets were tokenized now after doing the tokenization we still need to add two more inputs so mask and input type for the input mask we just denote everything that is not padding as input as shown here so everything in yellow is input again y-axis is the tweets themselves and x-axis is the length of the tweet and for the input type ids the only thing that is different from this one you'll see there's a small little line here which represents the classification token and the actual inputs the actual words let's say will be denoted as a segment of input you can see that there's a little line here by printing out the input type ids and we'll see there's a zero before each of the array here in the tensor having said all that again you wouldn't usually do all this in this way as you want to iterate your training you should make a function a function which turns your inputs your actual tweets into inputs you're gonna use for the model so the tokens uh the mass and the type for that we're gonna need to add our tweets to make them equal size just as we did with a simple neural network and we need to figure out the length of our tokenized tweets since our tokenized tweets don't really consist from words but more from combination of characters the maximum length won't be the same as for a simple neural network we can check the maximum sequence length by just going through the input port ladies and getting the maximum length which is in this case is 160 and then just to be safe i usually make it 1.5 times more and that amounts to 240 now what that 240 will mean to us it will be our set size for both how we prepare the inputs and how we build our model in the same way as we did in the simple neural network when we prepared our model we said that the input size is 180 i think and whenever we input a new tweet which we want to be predicted being positive or negative it can't actually be bigger than 180 words long in this case it's gonna be the same but just in tokens so since we have our maximum sequence length already we make up a function out of the things that we did before so encoding of the text within the tweets and then turning that into three separate inputs so the input word leds which are the tokens input mask which just masks the input and then the type leds which we discussed just now it's not really that complicated all this does is makes an array for each of the inputs in the same way described so we get the tokens from a tokenizer we add those here we get the input mask done from those tokenized inputs which is here and then we get the segment ladies which is here so after having written our function we just want to finish pre-processing our training and testing data and give that out to our model so we turn our training and testing data each into those three inputs and for that we use the tweets themselves the node and here is extreme then our tokenizer which we built from the previously downloaded birth model and then we use our maximum sequence length or padding so all of our inputs would be equal in length after that we are finished with data processing and we can move to modeling part and the things that we need to establish before modeling are pretty much the same as as with the simple neural network the number of the classes which we're gonna use on our predict layer right here and the maximum sequence length which we're gonna use to establish the size of our inputs right here so we have three input layers as i said before those correspond directly to the inputs that we just made then here you can see that we use bird layer to get two different layers a pulled output and a sequence output there's a difference between those but just to make it more simple for this case we use a pulled output and you should use that one too so we set our outputs that pull the output and we add a dropout layer a dropout layer helps us with our training and as we saw from the simple neural networks case model tends to over train on this specific data then we set another dense layer right here and this is our predict layer so we set our number of classes and we change our activation to softmax now you can see here a graph representing our model is it really this graph well no within this keras layer is our birth model which we downloaded and that one is quite a bit bigger than one layer but since we downloaded it and are using in creation of our model it is represented just as one block clear here as you can see we have the three input layers then the downloaded birth model then a dropout layer for overtraining and then a dense layer for class selection now you might ask can we add additional layers yes we can but for tutorial purposes we shouldn't go too much into that now all we need to do is set up the training parameters and we will be able to go and train our model so here we set our epochs if you remember epochs essentially at the times the model will go for all of the data it's given batch size the increments in which the model will receive the data as mentioned here in the comment select that based on your gpu resources the bigger this is the more resources it requires at some point your gpu will be maxed out and you won't be able to run your model and you will be out of memory if that's the case just lower the batch size and at some point you should be good to go and again on the imports check if the printout says that your gpu is available if it's not change the runtime cool so we set up our training variables and then we just compile our model and here in the printout you will see pretty much the same that you saw in the graph before us but the thing here which is mentioned the total number of parameters well essentially the total number of neurons that you're gonna use is 177.8 million now if you would compare that the simple neural network that we used before this one had 11.7 million neurons so essentially more than 10 times bigger network will be used on the same data and we'll see how much better the result can be uh with simple neural network we got i think 78.5 yeah 78.5 accuracy what i'm expecting from this is after the initial training to have at least 80 well i'm pretending to be dumb i already ran everything it is 80 uh but yeah what we are expecting is just to get a much better result this is why we are using a much more complex model so here we run the training as you can see one epoch of training takes around 2200 seconds which is more than half an hour which is why we use only five percent of our data just because we want to test our models quickly and just iterate by trying out better learning grades different data and so on so after training our model for free epochs with five percent of data we get that our testing accuracy is already almost 83 which is great because it's higher than simple neural network it's higher than our logistic regression and is higher by quite a bit now of course as always to visualize it better we use plot and here on the left side you can see the training and validation accuracy and on the right side you can see the loss so the interesting thing here is that the same as with simple neural network after the second epoch right here you see that the loss on the third epoch went up what that means is that our model is over training again and it was the best model we had on the second depot you would think that that's fine because the accuracy from the second depot here to the third depot here went up for validation as well as for training of course but the thing is you always need to check the loss as well just relying on accuracy alone when training a model isn't really a good approach and this is why loss is important now having said that is this the end of our model almost 83 percent accuracy rating the end and the best we can achieve no not not not really no we only used five percent of data we still have other 95 percent of data to train our model on what we can also do is change our learning rate and we're gonna do just that we're gonna add some additional data and we're gonna change our learning rate and try to achieve a bit better accuracy and see if iterative training of our model could actually work so what we do here is we save our model for later use so since we already trained our model here for one two three three half hours which a bit more so around two hours let's say uh we want to save our model because we don't really want to train it from scratch again we want to have the one that we already trained and then fine tune it a bit more later on let's say in the next working day so what we do here is we recompile our model essentially we just change the optimizer to my name and we do that because with custom optimizer that we used before with the weight decay we can't really save our model i don't really know what that error is i don't really know how to fix it so what i do and what i suggest for you to do find a workaround and use that so we save our model we just check the name of the model that we want to save the directory in which we want to save the model we save it by just saying model dot save and then the path where it should be saved it saves it and then let's say a couple days passes we want to get back to our model we want to fine tune it a bit we jump into imports we import all the packages that we want to use we jump into data and we read our data set and then we can just read our model we can validate if it's not corrupted in any way always when saving a model validated at the same time just because sometimes it it can go wrong and it might corrupt some files and not only the model itself but also the tokenizer just to be sure if this loads up and then tokenizes the tweet you're fine it will tokenize the tweet in the same way you don't really need to check that out with the tokenized tweet that we did before but because if sent something went wrong and it didn't save the model correctly this won't just run at all so after checking all of this you are good to go to load your model in a couple of days and let's say those couple of days pass by you want to train your model again you check these graphs again and say okay so the second epoch it went sideways and the loss went up i want to train it a bit more you have your model you decided that you will add another five percent from your data set as your training data and you will lower your learning rate because your model is already a bit trained now for those of you who don't really know what learning rate is so neural network has a bunch of neurons and those neurons have weights associated with them each of those weights impacts the loss of the model right so imagine a graph so each of those weights impacts the loss so imagine that here on y-axis we have the loss okay here is [Music] cool and then the x-axis we have the weight which is associated with the neuron [Music] so there's some some weight out there which is this weight which well we don't really know well at least at this point which is the best weight for the loss right which makes that loss minimal here right so this is the best weight that we can have now what our learning rate is it's the speed at which the weight changes so each time we run our data through the model during back propagation it adjusts its weight based on if the answer was correct or not so it always tries to optimize this it always tries to move the weight into the position where the loss is minimal so here right so this is the main loss oh my god i really need to get some writing classes so this is the minimum loss if our burning rate is one what is going to happen it is going to move the weight let's say from here to here right and our loss didn't improve and at the start of training a high learning rate is actually quite good because it it gets your model going and it and it adjusts the weight quite fast but when you already trained your model for a bit and you see that your loss is going up again or is stagnant what you need to do is you need to decrease the learning rate why well if it's a one if you would make it to 0.5 it would move from here to here right and that's actually where the minimal value of the loss is so if this one is 0.5 it moves the weights but into the minimum value so again just to be short when training your model at the start use a bit higher learning rate and then use a bit lower one on the second and the third iteration on or how many iterations you're gonna do just to decrease the speed at which the weights are changed so the model would optimize itself a bit more precisely so we're gonna do exactly that okay we went through those uh two data uh we already read the data set what we want to do now is make a different training and testing sub-sample when we had before so we just changed the random states and bam we got our new data just to be sure that our sample collection is random we can check if the plot of distribution of the classes again is looks the same as previously it does everything's fine for the label encoding we already saved our encoder before all we need to do now is just to load it again and we'll have our label encoder for inputs we can make our tokenizer from the vocabulary file of the saved model that we saved after training and here it is all we need to do is just set that lowercase as false as it is with the model since it's a case model and good to go we have our tokenizer now so we don't really need to run any of that code before just the imports and the data read then i just replaced the the input preprocessing function here again so just use this function to prep our inputs once again right here and then we can train a bit more so this is our model we read our model from the saved file we print it out we see that it's the same model we have the same maximum sequence length and the same input size as before this is the picture of the model same as before and then we can check how different the data is so we have now an x x test one and x text two x test one is denoted as x test and x test two as x x two and after checking our model both of those data sets you'll see that our model performs better on the new data set so our data sets are indeed a bit different to be consistent when training the model for a second iteration what you want to do is to use the same testing data set just to be consistent and see the changes for the actual changes so what i'm saying here don't forget to save your testing data set uh then we set our training parameters again same number of epochs same size batch size so the only thing that we change is the learning rate for our optimizer our optimizer is actually using weight decay so the learning rate goes down over time by itself but let's not get into that in this tutorial i guess we'll just keep it for later we recompile our model remember we changed our optimizer to adam me when saving our model and now we want to re-change it back to our new optimizer and after this we can do another training session so as you can see the training takes essentially the same amount of time again all we do is use new data here but we use the same data for validation and then if you plot everything out what you will see is some great news so at this point uh we had already trained for free epochs so right here give me a sec so we already trained for free epochs so here this is what we already had and same here so this is training session one and this is training session one and this is training session two training session training session two as well right so what you saw here is that validation accuracy actually went up from both from first second deep pokemon from second to third but the loss broke at second epoch here [Music] now after training our model for another free epoch we see that the validation accuracy increased in every successive epoch here but the training accuracy dropped significantly when we introduced a new training data set and then it went up again bit by bit the good news here is what we see is that our training loss separation and the validation loss operation well it's another way around but training was validation loss our separation is the same as we had it here but this time our validation loss was quite stagnant meaning at this point the model is not over training as much as it did after the second epoch on the first training session so it's a good thing we have a better model and by how much a better model well we can check the accuracy again so if we would go to evaluation did we do that no we didn't doesn't matter we can go here it will give us the same so what you can see here now our model predicts at 83.66 percent accuracy and what's more i think that at this point we could add additional five percent of data and then train the model for additional free epochs we could try lowering our learning rate again but i think the additional data would help us a lot as well so since the first training session we moved up for what around the percentage right yeah so we had around 83 and now we are nearly to 84. so by this point i think that we could actually buy iterative training achieve around 85 percent of accuracy with this model and you can actually do that you can use this notebook and then just reiterate iterative training and you should get to something like 85 with this so at this point the main question here is is that additional couple of percent of accuracy is worth it so let's look into our models our logistic regression has a 0.8 accuracy simple neural network pretty much the same right and our fine-tuned birth after two training sessions has almost 84 attackers i ran a couple of times so it might be off a bit but yeah so the thing is it all depends on you and as far as this tutorial goes uh you're good you know how to fine tune a birth model download the either download the notebook or just go to a collab link in my github and the link to that you will find in the description but as for our further investigation how we could improve our accuracy on this model on this task or what else could we do uh there's quite a good question to ask ourselves at this point like in any project that you're gonna do with machine learning you need to think of a couple of things when thinking about deployment in a lot of the cases i would say that just [Music] go with logistic regression because it's already at eighty percent accuracy uh we did the same test with it it works just fine you won't have any issues and the biggest advantage of logistic regression is it's fast it doesn't really take much time to give out an answer with it and you don't need a gpu you don't need anything you you can do you can implement it in a simple web page now simple neural network no you already have a logistic regression with a better accuracy than the simple neural network so just go with that now when we compare fine-tuning births with logistic progression [Music] the main question is what is more important to you is it more important resource wise whether it would be fast and that it wouldn't cost much to run then surely go with logistic regression but if accuracy is key then go with the fine-tuned world and if you think that accuracy is key for your application that that four percent increase of accuracy has much more weight than the amount of resources that you're gonna save using logistic regression instead of birth uh what this would suggest then is to do a modified version number add additional variables to let's say like in this example here we have date and maybe dates of the tweet let's say a weekday or the time impacts the sentiment of the tweet you could add those additional variables to the birth model itself and instead of this with three inputs here one two three you could add an additional input right here and make it like time let's say it's time i'm oh my god it's good when i do machine learning tutorials and not drawing tutorials okay it's time you're gonna get it time and connect that to this layer here or just you know append it and do your decisions from that you could add additional dense layer here right so this is what we're gonna do next week we're gonna add an additional input layer we are not going to add an additional dense layer but just to keep in mind if i'm forget to add this next week since you're adding an additional layer of input and you just appending that to the pull the output from birth model it could be quite a good idea to add an additional layer between the predict layer and pull the outputs with the appended new variable there so having said that that's all for today now you should be able to find tuna birth model i hope that most of the things that i said made sense as always some of the code is mine most of the code is from internet but i guess you can use it freely if you make something fun out of it or not or even if you have some problems [Music] making something leave a comment write me on twitter instagram discord whatever and i'll try to do my best to help you and for the next week we'll do a modified birth model with additional variables and that i say goodbye and see you really really soon bye [Music] [Music] you
Info
Channel: adam0ling
Views: 13,720
Rating: 4.8566308 out of 5
Keywords: machine learning, deep learning, text classification, neural network, tutorial, natural language processing, logistic regression, google colaboratory, logistic regression example, nlp tutorial, data science, twitter analysis python, south park, artificial intelligence, machine learning tutorial, machine learning python, dropout layer deep learning, text classification tensorflow, BERT, python, jupyter notebook tutorial, tf hub, tensorflow tutorial, gpt 3, sentiment analysis
Id: E9nGPt4iMM8
Channel Id: undefined
Length: 42min 57sec (2577 seconds)
Published: Thu Nov 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.