Multi-Class Language Classification With BERT in TensorFlow

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi and welcome to this video on multi-class classification using transformers and tensorflow so i've done a video very similar to this before but i missed a few of the final steps at the end which were saving and loading models and actually making predictions with those models so i've made this video to cover those points as a lot of you were asking for how to actually do that so we're going to cover those and we're also going to cover all the other steps that lead up to that so if this is the first video on this that you're you've seen then i'm going to take you through everything so i'm going to take you through sourcing date from kaggle that we'll be using pre-processing that data so that's tokenization and encoding the label data then we're also going to be looking at setting up a tensorflow input pipeline uh building and training the model and then we go on to the few extra things i missed before so the saving and loading the model and also making those predictions so that's everything that we'll be covering and what i've done is left chapters in timeline so you can skip ahead to whatever you need to see so we'll jump straight into the code okay so here we're just going to download the data we're going to be using the sentiment analysis on the movie reviews data set uh which you can see over here on calgary can download it from here if you just click on train.tsu.zip download it and unzip it i'm just going to do it through the kaggle api here and then unzip it with zip file here just a little bit easier so i'll run that and i do have a video on the on the cargo api so i'll put a link to that in the description okay so we've got our data you can see in the left here and it's going to import pandas and view what we have just read from csv and we're using a tab limited file here so we need to use the tab separator and let's just see what we have okay we have the sentiment here and we have the phrase and that's all we really need so on the phrase here we're going to be tokenizing this text to create two input tensors our input ids and the attention mask now at the moment we're going to contain these two tenses within two numpy arrays which will be of dimensions the length of a data frame by 512 512 is the sequence length of our tokenized sequences when we're putting them into bet so when tokenizing all we're going to do is iterate through each sample one by one and assign each sample or each token example to its own row in the respective numpy array and we'll just first initialize those as empty zero rays so we'll do that let's first need to import numpy and then like i said before the sequence left is 512 and then our number of samples is just equal to the length of our data frame and with that we can initialize those empty zero rays so one of the x ids which will be our token ids and that is initialized with mp0s and then we pass the size of that array here so number of samples all the length of the data frame by the sequence standard 512 and then we can copy that and we do the same thing for x mask which is our attention mask and then let's just confirm we have the right shape as well and do that like this okay so we have 156 000 samples and 512 tokens within each sample so now that we have initialized those two arrays we can begin populating them with the actual tokenized text so we're going to be using transformers for that and we are using bert so we'll import beta tokenizer and then we just want to initialize the tokenizer and what we'll do here is just load it from pre-trained which i'm using bert base case and then what we can do here is we'll just loop through every phrase within the phrase column i'm just a there so we get the row number in i and the actual phrase in the phrase variable so here phrase and then we want to pull out our tokens using the tokenizer so the encode plus and then we have our phrase the max 11 which is sequence load so that's 512. we're going to set truncation to true so here if we have any text which is longer than 512 tokens it will just truncate and cut it off at that point because we can't feed in arrays of or tensors of different shapes into our model it needs to be consistent likewise if we have something that is too short we want to pad it up to 512 tokens and for that we need to use padding equal to match length which will just pad up to whatever number we pass into this match length variable here and then i want to add special tokens so invert we have a few tokens like this which means the start of a sequence this which means separator and that is either separating different sequences or the marking the end of the sequence and also this which is the padding token which we'll be using because we set padding equal to max length so if we have an input which is maybe 412 tokens in length or let's say 410 both of these will be added to it which will push up to 412 and then we'll add 100 padding tokens onto the end which pushes it up to 512. so that's how that works so obviously we do want to add the special tokens otherwise bear will not understand what it is reading and then because we're using tensorflow here we're going to return tensors equals tf for tensorflow and then what we want to do here so we've pulled out our tokens into this dictionary here so this will give us a dictionary it will have different terms in there so we'll have our input ids and attention mass and we want to add both of those to a row within each one of these zero arrays so to do that we want to say x ids and then this is why we have enumerated through here so we have this i this tells us which row to put it in and then we want to select the whole row and we set that equal to tokens because input ids now as well we do the same for the x mask but this time rather than input ids we are going to say equal to attention mask okay so let's just have a quick look at what we have in x ids now okay it's just zero so now let's run this and now if we rerun this we can see okay now we have all these other values so first this one zero one is the cls token i mentioned before so the start sequence token and these zeros here they are all padding tokens so obviously at the end here we all almost always have paddings unless the text is long enough to come up to this point or if it has been truncated but here we can see sort of structure that we would expect and we see some duplication here so we have one zero one then we have this one three eight one three two six and if we look at our data here you can see okay that's why because the first few examples we have like segments of the full sequence here so there is some duplication there so that all looks pretty good and i think let's just have a quick look at the let's mask so here we see something different we have these ones and zeros so x mask is essentially like a control for the attention layer within bert wherever there's a one but we'll calculate the attention for that token wherever there's a zero it will not so this is to avoid bert making any kind of connection with the padding tokens and actual words because these padding tokens are in reality not there we want birds to just completely ignore them and we do that by passing every padding token as a zero within the attention masquerade or tensor now as well as that we also want to want to encode our label data so at the moment we can see here we have the sentiment and this is a set of values from 0 to 4 which represent each of the sentiment classes so here we have center and labels of zero which is very negative one somewhat negative two neutral three somewhat positive and four positive so we're gonna be keeping all those in there but we're going to be one high encoding them so to do that we will first extract that data so we'll go put into array we want to put df sentiment which is the column name and we're going to take the values and if i just show you what that gives us just gives them an array and we have up to four so zero one two three four and that's good and now what we need to do is initialize again a zero array so we'll do np zeros and in here we're going to do it the length of the number of samples because again we have the same number of samples in our labels here so yeah length of the data frame and then we want to say the array max value plus 1. now this works because in our array we have zero one two three four so the max value of that is four okay and we have five numbers here so if we take the four plus one we get 5 which is the total number of values that we need and this essentially gives us the number of columns in this array and we need one column for each different class so that gives us what we need and we can just check that as well so we have labels that shape and here we see we have the length of the date frame on number of samples and we have our five classes now let's have a quick look at what we're doing here so we'll print out labels okay we just have zeros now what we're going to do is we're going to do some fancy indexing to select each value that needs to be selected based on the index here and set that equal to one so for these first three examples we will be selecting this making it equal to one and then this because this is in the second so we've got zero one two these are the column indices so first one that will be set to one because we have one here this is number two so we select number two and two here as well and then for three down here we'd have a one here okay so to do that we need to specify both the row so we're just going to be going one row at a time so all we need here is a range of numbers which covers from zero all the way down to 156.060 which is our length here so to do that we just go mp arrange we have a number of samples and then here we need to select which column we want to set each value to or select each value for and that's easy because we already have it here this is our array so we just write array and then each one those that select once that equal to one so let me just mean to put it there i want to put it here okay and now let's re-run this cell as well okay and now we can see we have those ones in the correct places so that's our one hot encoding and now what we want to do is sell our data here and put into a format that tensorflow will be able to read so do that we want to import tensorflow first and what we're going to do is use the tensorflow dataset object so this is just a object provided by tensorflow which just allows us to transform our data and shuffle and batch it very easily so it just makes our life a lot easier it's a data set and we're creating this data set from tensor slices which means erase and in here we're going to pass a tuple of x ids let's mask and labels and to actually view what we have inside the data set we have to use specific data set methods because we can't just print it out and view everything let's say a generator so what we can do is just take one and that will show us the very top sample or after we batch you the very top batch and just print that out you can see here okay we have this take data set shapes and we have this tuple here so this is one sample and inside here we have a tensor which is of shape 512 so it's our very first x ids array or row sorry so this is like doing this and getting this so this is what this value is here this is the size or the shape okay and then we have the same for the x mask which is the second item here in index one and we also have four labels as well you can see here okay and that's all good but what we need now is to merge both of our input tensors into a single dictionary so the reason that we do that is that when tensorflow reads data during training it's going to read or it's going to expect a tuple where it has the input in index 0 and the output or target labels in index 1. it doesn't expect a tuple with three values so to deal with that what we do is we merge both these into a dictionary which will be contained within two point index zero so first we create a map function and [Music] what we're going to do is just apply whatever is inside this function to our data set and it will reformat everything that they set to this format that we set up here so our input ids just change that to input of the mass and we have the labels and all we want to do is return input ids and masks together and we're also going to give them these key names so that we can map the correct tensor to the correct input later on in the model and we go input ids we have our attention mask which goes to our masks and then that is the first part of our tuple and then we also have the labels which is the second part okay and that's all we need to do for creating that map function and then like i said before data set makes things very easy for us so to actually apply that mapping function all we need to do is data set dot map map function so now let's have a quick look at what we have see if the format has changed or the shape okay you can see now we have so it's all within a tuple this is the one index of that tuple this is a zero index to that tuple and in the serial index we have a dictionary with empire ids which goes to our input id tensor and attention mask which maps to our attention mass tensor okay so that's great and now what we want to do is actually batch this data so i'm going to use the batch size of 16 you might want to increase or decrease this probably mostly depending on your gpu i have a very good gpu so i would say this is at the upper end the the size that you want to be using and what we do here is dataset.shuffle so this is going to shuffle our data set and the value that you should input in here i tend to go for this sort of value for this type of size data set but if you notice that your data is not being shuffled just increase its value and then batch and then we have the batch size okay so let's play into batches of 16. so first shuffle the data and then we batch it otherwise we would get batches and just we would end up shuffling the batches which is not really what we want to do we just want to actually shuffle the data within those batches and then we want to drop the remainder and set that equal to true so that is dropping so we have that size of 16 if we had for example 33 samples we would get two batches out of that 16 and 16 and then we would have one left over and that wouldn't fit cleanly into a batch so we would just drop it we would get rid of it and that is what we're doing there and then let's just have a look at what we have now so now see this has changed so we still have that two ball shape where we have the labels here and the inputs here but our actual shapes our tends to shapes has changed so now we have 16 samples for every tensor okay so that's that is what we want that's good we've got our full data set our training data set here and what we might want to do is split that into a training and validation set and we can do that uh it's setting the split here so we're going to use 90 you can change this so we're going to 90 trading data 10 validation data and what we need to do is just calculate the size of that split so the size of the training data in this case so we'll take the x ids shape okay or actually we can just set the sql to the value that we defined up here okay the number of samples so these are the same let me show you so if number of samples and the hides shape coaches are the same thing we've already defined that so let's just go with that and we're going to divide that by the batch size okay and this gives us the new number of samples within our data set because we've batched it here okay now one we only want 90 of these so we multiply it by split so i do need to just run that quick okay so that's our 90 mark and when we're saying in a moment that we want this number of samples from the data set we can't give it a float because we can't have .375 of a batch it doesn't work so what we need to do is set that equal to a integer to round off so we'll do that here to move that and run that so now we have our size we can split our data set so we're going to have the training data set which is just going to be date set and as we did up here where we do this take method to take a certain number of samples we do the same but now i'm taking a lot more than just one we're going to be taking the size that we calculated here which is 8700 or so and then for the validation data set we kind of want to do the opposite right we want to skip that many samples so that's exactly what we write here we just write skip size so we're going to skip the first 8 700 or so and we're just going to take the final ones outside the final attempt sent and then we're not going to be using dataset anymore and it's quite a large object so we can just remove that from our memory okay so now we're on to actually building and training our model so we're going to be using transformers again to load in a pre-trained bear model we're going to be using the tf auto model using the tensorflow version and we'll set bet equal to tf auto model from pre-trained and it is bert base and case and just like we would because here we're using the tensorflow version here if we got rid of this we'd be using pie torch so here we're using tensorflow just like we would of any other tensorflow model we can use the summary method to print out what we have okay and see we just have this one layer which is a layer and this is just because everything within that layer is a lot more complex than just one layer but it's all embedded within that single layer so we can't actually see inside of it just by printing out the summary now we have bet and that is great but that's that's kind of like the core or the engine of our model but we need to build a frame around that based on our use case so the first thing we need to do is okay we have our two input layers okay we have that one for end parties and one for the attention mask so first we will need to define those so we've already imported tensorflow earlier for the data set so we don't need to do that again and what we do is input ids and we say tf keras layers and we're using a input layer here and the shape of that input layer is equal to the sequence length so sequence zone and then we just add a comma here okay so that's the same shape as that we were seeing up here okay now we need to set the name and we do this because as we have seen up here we have this dictionary right and it we need to know which input each of these input tensors are going to go into and that's how we do it we map input ids to this name here empire ids and we'll do the same for attention mask as well in a moment and i'm going to set the data type equal to integer 32 and we do that because these are tokens here so expect them to be integer values and we do the same for mask tf keras layers input and the shape and sequence length again we have the name which is where we use our attention mask to map that across and again it's just the same d type which is int 32 okay so they are our input layers and then what we need is we need well after the input what do we have we want to feed this into birth right so what we're doing there is we're creating our embeddings from bert and what we need to do is access the transformer within that bert object so to do that we for birth we just write bert dot bert which accesses that transformer and in there we want to pass out input ids and our attention mask which is going to be the mask so these two input layers here and we have two options here we can pull out the raw activations or the raw activation tensors from the transformer model here and these are three-dimensional and as i said just take out that raw activation from the final layer of bert or what they also include here is a pooled layer or pooled activations so these are those 3d tensors pulled into 2d tensors now what we're going to be using is dense layers here so we are expecting 2d tensors and therefore we want to be using this pulled layer we could also pull it ourselves but the pooling has already been done so why do it again now what we want to do here for our final part of this model is we need to convert these embeddings into our label predictions okay so to do that we want two more layers and these are gonna both be densely connected layers and for the first one i'm gonna use 1024 units or neurons and the activation will be rayleigh and we're passing at the embeddings into that and then our final layer which is our labels that is going to be same thing again so dense but this time we just want a number of labels here so we did calculate this before it was array max plus one right so that is just five okay so we have five output labels and what we want to do here is calculate the probability across all five of those output labels so to do that we want to use a softmax activation function and we just want to say we're going to call this the outputs layer because it is our outputs now that is our model architecturally they are all of our layers but we haven't actually put those layers into a model yet they're all kind of just floating there by themselves obviously they do have these connections between them but they're not initialized into a model object yet so we need to do that and to do that we go tf keras model and we also need to do here we need to set the input layers so our inputs and we have two of those so we put them in a list here the input ids and the mask so this is those two then we also need to set the outputs and we just have one output and that is a y here so just setting up our boundaries of our model we have the inputs and they lead to our outputs everything else within this is already handled to go to x we consume embeddings and embeddings consumes input ids and mask so those connections already set up and let's just see what we have here and let's realize so this is input this should be mask and let's see what this error is okay so here i've got to add this connection so we need to add x there okay so now what do we have it's a little bit messy but so we have our input ids and we have the shape here the attention mask so these are our two input layers they lead into our bert layer then we have our pre-classifying layer here the densely connected euro net with 1024 units and we have our outputs which is the softmax and we have five of those now if you would like to what you can do if you don't have enough compute to train the vert layer as well you can also write this so go model layers and we select number two because we have zero one and two so bert is number two in there and we can set trainable equal to false and this would just freeze the parameters within this bert layer and just train the other two here okay but i will be keeping that so they can be trained as well although you don't need to it will probably give you a small performance increase but not a huge amount in most cases now you want to set up the model training parameters so we need a optimizer which is optimizers and for this we're going to be using adam we're using a pretty small learning rate of one e to minus five is because we've got our bert model in here we also want to set a weight decay as well so this is adam with a decay and what we also want to add is a loss function so we want to do tf keras losses and because we're using categorical outputs here we want to use categorical cross entropy now we're going to set our accuracy as well and that is tf keras metrics this time and we're using categorical accuracy for the same reason i'm just gonna need to pass accuracy in there as well and let me change that to metrics and then we're going to do model compile optimizer which equal to the optimizer loss to loss and metrics is going to be equal to a list containing our accuracy okay so that's our moral training parameters or what's up so the final thing to do now is train our model so to do that we call modelfit like i would in any tensorflow training and we have our training data set which we built already the validation data we will be using is our election data set and i'll train that for three epochs and that will take some time and immediately after we we do train that i'm also going to save that model to sentiment model okay now we'll just create a directory here and saw all the files that we need for that model in there so i will go ahead and run that and i will see you when it's done okay so finish training our model we go up to an accuracy of 75 still moving and also validation accuracy of 71 here so inside the sentiment model directory that we just created when we saved our model we have everything that we need to load our model as well so i'm just going to show you how we can do that so what we can do is let's start a new a new file here no notebook okay so let's just close this and what we'll do here is we need to first import tensorflow and after we have imported tensorflow we need to actually load our model so we use tf keras models load model and then we're loading this from the sentiment model directory here and then let's just check that what we have here is what we built before okay great so we can see now we have our input layers but and then we have our preclassifier and classifier layers as well so that's exactly what we did before and now we can go forwards and start making predictions with this model so before we make our predictions we still need to convert our data or our input strings into tokens to do that i'm going to create a function to handle it for us first we are going to need to import the tokenizer from transformers we're using tokenizer just like we did before and we're going to use the tokenizer that tokenizer from pre-trained but base case okay so exactly the same as what we used before so that is our tokenizer and all we need to do is we're going to say call it prep data and here we would expect a text which is a string and we'll return our tokens and this is just the same as what we were doing before okay so we do encode plus our text we set a match length which is going to be 5 12 as always but we are going to truncate anything longer than that and we're going to pad anything shorter than that and we'll pad it up to the max length and then we want to add the special tokens that is true and then there's one other thing that we don't need which are the token type ids so token type ids and this is just another tensor that is returned and we don't need them so we can just ignore them and we're going to return the tensorflow tenses okay and that is all we need and now we can just return our tensors in the correct format so the format that we need is like we used before with the empire ds but if you remember before within the data set we were working with tensorflow float 64 tensors so we also need to convert these which will be integers i believe into that format so to that we do tf cast tokens input ids we say we want to cast that to a tf float 64. and we can copy that across and we'll repeat the same thing but for our attention mask so attention mass we'll just copy that across and that is all we need to prepare our data and just fix that okay and now what we can do is we can [Music] just prep data we just put something like hello world okay and we get all these values and see here i have entered that wrong we don't even need this uh necessarily but just for to be explicit and we just need to add the s onto id's so rerun that and move that error and you can see here we have our cls token hello world separator token followed by lots of padding okay so our prep data works and let's just put that into a value there and what i want to do now is get the probability by calling model predict and we do test here so we've already done prep data and let's see what we get okay so we have these predictions which is not that easy to read and we also need to access the zero index so we just get a simple array there and what we can do to get the actual class from that is we'll just import numpy as nb because we just want to take the maximum value out of all of these and to do that let's do np arguments probs zero okay so it's had a neutral sentiment but something like this movie is awful and we should get zero okay and we'll do this and we'll get four okay so it's working so that i know is pretty long but that is really everything you need from start to finish we've pre-processed the data set up our input data set pipeline we've built trained our model we saved it loaded it and made predictions so that's really everything you need so i hope that has been a useful video and thank you very much for watching i will see you again in the next one

Info

Channel: James Briggs

Views: 3,651

Rating: 5 out of 5

Keywords:

Id: pjtnkCGElcE

Channel Id: undefined

Length: 43min 23sec (2603 seconds)

Published: Thu Mar 25 2021