HuggingFace Crash Course - Sentiment Analysis, Model Hub, Fine Tuning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone i'm patrick and in today's video we are going to learn how to get started with hugging face and the transformers library the hugging face transformers library is probably the most popular nlp library in python right now and it can be combined directly with pytorch or tensorflow it provides state-of-the-art natural language processing models and has a very clean api that makes it extremely simple to build powerful nlp pipelines so today we have a first look at the library and build a sentiment classification algorithm i show you some basic functions and then we have a look at the model hub and then i also show you how you can fine-tune your own model so let's get started all right so to get started you should either install pytorch or tensorflow first and then in order to install the transformers library you just have to say pip install transformers or there's also a conda installation command that you can find on the installation page so let's install it like this so i already did this and then we can start using this so we can save from transformers and then we import a pipeline as first thing and have a look at this and then we also import some utilities that we need from the pytorch library so we import torch and we import torch dot nn dot functional sf so we're going to use this later and now we can start using this pipeline so let's say classifier equals and then we create a pipeline and we need to specify the task that we want so in this case we want to do sentiment analysis so we have to call it like this and you will find the different available tasks on the website so here we can see for example we have this sentiment analysis which is just an alias of text classification but for example we also have a question answering pipeline or a text generation or a conversational pipeline so yeah this is how we can define a pipeline and what a pipeline does is that it gives you a great and easy way to use model for inference and it abstracts a lot of the things for you so you will see what i mean in a moment so now we can just use this classifier and classify some text by saying res for results equals and then we call this classifier and we want to classify a example text so let me copy and paste some example text for you so we want to classify we are very happy to show you the smiley face transformers library and then let's print the result and see how this looks like so let's run the code all right and as you can see we get the label is positive and the score is 0.99 so it's very confident that this is a positive sentence and as you can see it only takes two lines of code with this pipeline to create a sentiment analysis code so yeah this is exactly what we need so we need to see the label of the text if it's negative or positive and we also get the score so yeah this is really nice and now let's have a look at some more things that we can do with this pipeline so first of all we can put in more texts at once so we can not just use one so we can give it a list so let's for example use a list and then let's use another example text so let me copy and paste this one in here as well so we also want to classify this we hope you don't hate it and then we get multiple results back so let's call this results and then we can iterate over this so we can say for results in results and then we want to print the result and now let's run this code and have a look at how this looks like all right and as you can see for the second text we get another result back so here the label is negative and the score is maybe not that confident in this case so this text might be a little bit confusing we hope you don't hate it but basically this is how you can pass in multiple texts at once and now so right now we only use the default pipeline with the default model but now let's have a look at how we can use a concrete model and then also how you can use a concrete tokenizer so what we can do is we can specify the model name and say model name equals and in this case i use this pillbird base uncased and then fine tuned sst to english so i will show you where i got this string or this name in a moment but for now yeah this is basically just a distilled bird model which is a smaller and faster version of bird but it was pre-trained on the same corpus and then you see that it also was fine-tuned and this is just the name of the data set so in this case it's an english data set from the stanford sentiment tree bank version two and yeah so now if we have the model name we can give this to our pipeline with the model argument so we can say model equals and then we use this model name so now in this case i can tell you that the default model for this sentiment analysis task is already this model name so this should do exactly the same but later we will switch this and then have a look at how we can use different models so first of all let's run this again and see that this is still the same all right so we see this is still the same result so this worked so now we um just use this string to define our model but now let's have a different approach to define a model and then also a tokenizer so this will give us a little bit more flexibility later so in order to do this we want to say from transformers and then here i import a auto tokenizer class and auto model for sequence classification and this is just a generic class for a tokenizer and this is also a generic class but a little bit more specific so in this case i want to have it for sequence classification and then it will give me a little bit more functionality specifically for this task so don't worry about this right now you can also find all the model classes available in the documentation so if you're interested then have a look at this and also if you use tensorflow then here you have to say tf and then the name of this class but the rest is actually the same so yeah this is how you use tensorflow and now after importing this we can create um two instances of this so we can do we can say model equals and then we use this class so auto model for sequence classification and then we use a function that is called so let's say dot from pre-trained and then it also needs the model name and we do the same with the tokenizer so we say tokenizer equals the auto tokenizer dot from pre-trained and then it needs the model name so this dot from pre-trained function is a very important function in hacking phase that you will see a lot so you will see this later a few more times so now that we created this we can also just give the actual model and not just the string to the classifier or to the pipeline so we can say our model equals our model and our tokenizer equals our tokenizer so now if we run this we should still get the same results because these are the default versions and yeah as we see we still get the same result but then later um if you want to use a different model or tokenizer then you know how you can switch this so just by using a different model and tokenizer here for the pipeline so now instead of using this pipeline let's see how we can use this model and tokenizer directly and do some of the steps manually and this will give you a little bit more flexibility so down here um let's first use the tokenizer and see what this does so first let's um call the tokenizer.tokenize function so we say let's call this tokens and then equals tokenizer dot tokenize and then the string or the sentence we want to tokenize so let's copy and paste this in here and then once we get the tokens we can use them and get the token ids out of it so we can say token ids equals and then we again use the tokenizer and the function convert tokenizer to it's called ids and then it needs the tokens so this is one way how to do this or we can um do this directly by saying token ids equals and then we call this tokenizer like a function and then again we give it the same string here so now let's print all these three variables to see where is the difference so first we print the tokens then we print the token ids and then here let's actually give this a different name so let's call this input ids so now let's run this and see how this looks like all right so here is the result so as you can see when we call tokenizer tokenizer.tokenize then we get a list of strings or the list of the words back so now each word is a oh sorry each word is a separate token and for example this one is our smiley face or our emoji so yeah this is what the tokenize function does and then once we call this convert tokens to ids we get this one back so now it converted each token to an id so each word has a very unique id and this is basically the mathematical representation or the numerical representation that our model then can understand so this is what we get after this function and if we call this tokenizer directly then we get a dictionary back and here we have the key input ids and we also have the attention mask so for now you don't really have to worry about this but let's have a look at the input ids so if we compare the token ids with the input ids then we see we have the exact same sequence of token ids but we also have this 101 and 102 token and this is just the beginning of string and the end of string token but basically it's the same so yeah this is the difference between these three functions and then these input ids this is what we can pass to our model later to do the predictions manually so now like before we can also use multiple um sentences of course to for our tokenizers so um for example usually in your code you have your training data so let's say x train and in this example let's just use these two sentences so this is our x train and then we can um and then we can pass this to our tokenizer and let's call this batch so this is our batch that we put into our model later so we say batch equals tokenizer and then we call this tokenizer directly with our training data and then i also want to show you some useful arguments so we say padding equals true and we also say truncation equals true and then we say max length equals 412 and we say return tensors equals and then as a string pt for pi torch so this will ensure that all of our samples in our batch have the same length so it will apply padding and truncation if necessary and this is also important so in this case we want to have a pie torch tensor returned directly so i will show you later what's the difference if you don't use this but for now let's just use this and then um first of all let's print this batch and see how this looks like and then we see we get a dictionary and again it has the key input ids and the key attention mask and then here it has two tensors so the first one for the first sentence and the second one for the second sentence and the same for the attention mask so two tensors so yeah as i said these input ids are these unique ids that our model can understand so yeah now we have this batch and now we can pass this to our model so and let's do this manually and see how we can call our model so in pytorch when we do inference we also want to say with torch dot no grab so this will disable the gradient tracking i explained this in a lot of my tutorials so you can just have a look at them if you want to learn more about this and then we can call our model by saying outputs equals and then we call the model and then here we use two asterisks and then we unpack this batch so if you remember here this is a dictionary and here basically with this we just unpack these values in our dictionary so for tensorflow you don't do this so you just pass in the batch like this but for pytorch you have to unpack this and now we get the outputs of our model so let's print the outputs and as you might know this these are just the raw values so to get the actual probabilities and the predictions we can apply a the softmax so let's say predictions equals torch or we also have this in f dot soft max and then here we say outputs dot logits and we want to do this along dimension equals one and let's also print the um predictions and then let's do one more thing so let's also get the labels labels equals and we just get this by taking the prediction with the or the index with the highest probability so we get this by saying torch dot arc max and we can either put in the predictions or we can put in the outputs and actually don't need this but just for demonstration uh let's use the predictions and then again dimension equals one and then let's print the labels as well and now let's actually do one more thing so let's convert the labels by saying labels equals and then we use list comprehension and call model dot config dot id to label and then it needs the actual label id and then we iterate so we say for label id in labels to list and now what this does you will see this when we print this so we print the labels and now let's actually run this and see if this works all right so this worked so as you can see um here we print the output so these are our output this is a sequence classifier output and as you see it has the logits argument so that's why we used outputs.logith and then we get the actual probabilities and then to get the labels we used arcmux so this is a tensor with the label one and the label zero and then we converted each label to the actual class name and then we get positive and negative so by the way this function i think is only dedicated to a auto model for sequence classification for example if we just used a autumn model then i think it won't be available so that's what these more um concrete classes will do for you it gives you a little bit more functionality for the dedicated task so we see that the loss is none in this case so if you also want to have a loss that we want to inspect then we can give the loss or the not the loss but the labels arguments to our model that um it knows how to compute the loss so we say labels and then we create a torch dot tensor by saying torch dot tensor and then as a list we give it the labels one and zero and now let's run this again and then you should see that we should see a loss here and yeah now here we see the loss and again this labels argument is i think special to this auto model for sequence classification so yeah this worked and now if we have a careful look at the probabilities so first of all we see we get label positive and negative and here for the first one this is the highest probability so 9.997 and here for the second one this is the largest number so it took this one and this is 5.30 so if we compare them with the results that we got from our pipeline then we see these are exactly the same numbers so now you might see what's the difference between a pipeline and using tokenizer and model directly so with the pipeline we only need two lines of code and then we actually get what we want so we get the label and we get the score we are interested in so this might be just fine but then yeah if you want to do it manually you can do it like i showed you and you will get the same results that you can then use so yeah that's how you can use a model and a tokenizer and yeah so using the model and the tokenizer will be important when you for example want to fine-tune your model so i will show you roughly how to do this later but yeah so this is how you use model and tokenizer and let's just assume we did fine tune our model then what we can do and we can say save directory and specify a directory so let's call the folder saved and then we can call tokenizer and then we can call dot save pre-trained and then the location just the safe directory and the same with our model so we can say model dot save pre-trained save underscore pre-trained and then again the safe directory and then we can load them in another application for example tokenizer equals and then again here we use this auto tokenizer class and then the from pre-trained and then here we can give it a directory so this from pre-trained we can either give it a model name or we can give it this directory and again the same for the model so model equals and then we use this auto model for sequence classification dot from pre-trained and then the safe directory so this should work and then you should get the exact same model and tokenize it back and yeah as you might see these um model these dot from pre-trained functions are very important and you will use them a lot of time all right so i think these are the basic functions you need to build a pipeline or to apply the model and tokenizer manually and now let's have a look at how we can use a different model so like here you can either load this from your disk if you already have a pre-trained model somewhere on your computer but what you can also do is you can go to the hugging face model hub so you can find this at hugging face dot co slash models and here we have the model hub and you can search through different models so for example you could filter for the tasks so in this case we want to do text classification which is the same as sentiment analysis and then it filter is applies this filter so you can see the most popular model is already this one and then we can click on this and get some more information and as you could see so this is the exact same model name that we used in our code so once you've decided for a model you can click here and copy this name and then paste into your code so let's say in this case we want to use a different model so in this case i want to do sentiment classification with german sentences so then of course i need one that is trained on german so you can filter here so you can search so i can either again search for distilbert and see what different versions there are available or let me search for german and then here let's take the most popular one so by oliver gore and then we see this is a german sentiment bird and then we get more information and sometimes we also see some example code which is helpful so yeah this is nice and now what we have to do is we want to click here and copy this will just copy the name and then in our application let me comment this out and then let's again say model name equals and now i hit paste so now it pasted this string here so now we have this and now here we can give our model and tokenizer the model name so model name and model name and now let's do this for some example texts in german so let me copy and paste this in here so basically let me quickly translate this for you so this says not a good result this was unfair this was not good um not as bad as expected this was good and she drives a green car so basically these three texts are negative this one is rather positive and this is neutral so let's see if our model can detect this correctly so now again like above we do the same steps so we could copy and paste this so let's copy and paste this and then the same as above we say width torch torch dots no graphs and then we call the model so we say outputs equals model and then here we unpack our batch then we have the model then we want to have the label id so let's say label ids equals and then we use the torch.arc max function with the outputs and along dimension equals one and let me remove this one and then we print the label id so print the label ids and then we do the same as we do here so we want to convert them to the actual label names by calling model.config id to label label id for label in here we call this label ids to list and then print the labels and now let's run this and actually let's also print the batch in this case and uh let's have a look at how this looks like so let's run this and i get an error so here i forgot to say outputs dot logits like we did before so let's try it again and this is only two results so of course here in our tokenizer we want to use these texts so let's call this x train underscore sherman and then let's use x train underscore german here and let's run it again all right and as we can see we get the labels one one one zero zero and two and this is equal to negative negative negative then two times positive and then neutral so yeah this is exactly what i told you the first three sentences are rather negative than two positive ones and this one is neutral so yeah now our german model works as well and this is how we can use different models so we simply search the model hub and hopefully there is an already pre-trained version for the task we want and then we can just use this here as our model name and then we are good to go or if there is not a already pre-trained version then we have to do this ourselves and fine-tune our own model so i will show you how you do this in a moment but now one more thing i want to mention so um i want to talk about this return tensors equals pt so um if we here we print the batch and here the input ids and then we see this is a tensor so right now it's already in the pi touch format so we could use tensorflow here or we just um omit this and if we omit this then we don't have this in the tensor format so now it is just a python list i think but then what you could do is you could convert this so we can say batch equals and then we convert this to a tensor by saying torch dot tensor and then we give it the we call this batch and this is a dictionary so we can say batch and then access the key input ids like we see here and now we created a actual tensor out of this and then we don't have to unpack it like this here so now we remove this and then if we run it again then this should work as well and yeah this worked too so we get the same result and here we printed our batch and now we see this is a tensor directly so yeah be careful here to specify what you want so it's actually if you use pytorch then it's just simpler to use this as a return argument so return tensors equals pt but if you don't use this then you know what you can do otherwise all right so now we know how we can use different models so yeah try this out for other models in your language and see if this works and now let's have another look at how we can fine tune our own models so this is very important and i already prepared some code here and i will go over this very roughly but there's also a very great documentation about this so we can go to this documentation page here and you can also open this in collab so either with pytorch or tensorflow code so this is really helpful so i encourage you to check this out um but now let's go over this briefly so basically there are five steps you have to do um so in this example it's for pytorch so we have to prepare our data set for example loaded from a csv file or whatever then we have to load a pre-trained tokenizer and then call it with our data set so then we get the encodings or the token ids then we have to build a pie torch data set out of this with these encodings so if you don't know what the pi torch data set is then i will have a link for you here where i explain this then we also load a pre-trained model and then we can either load a hugging face trainer and train it so this abstracts away a lot of things or we can just use a native or normal python training pipeline like in our other pytorch code so yeah this is what we have to do so let's go over this very quickly so in this case we define our base model name so we want to start with a distilbert base uncased version but in this case for example not the fine-tuned one so just this one then step one we prepare the data set so we write a helpful function to create texts and labels out of the actual text and here we downloaded some data set and put it in our folder so i already did this here and yeah this is available at this website and this contains movie reviews so we want to fine-tune our models on movie reviews for sentiment classification so here we create training texts and the training labels with our helper function and we also do a trained test split to get validation texts and labels and yeah then as a next step we create or we define a pi torch data set so this inherits from pi torch data set so torch utils data we import data set and then we define this here so again i have a tutorial where i explain how this works but basically it needs the encodings and the labels and it stores them in here so yeah this needs the encoding so for the encodings we need a tokenizer so again we use this from pre-trained function with the model name and in this case since we know we use the distilled bird one we can use this class so remember before we used a generic tokenizer this auto tokenizer class and here we use a more concrete one so we use the distal bird tokenizer fast then we apply it to a training validation and test set and get the encodings then we put them in our data set and create the pi torch data set and then we import a trainer and the training argument so this is in available in transformers library and then we can set this up so we can create the arguments so here for example we specify the number of training epochs the output directory the learning rate and other parameters we want and then we create our model again from a concrete model class and then with this dot from pre-trained function and then we set up this trainer and give it the model and the training arguments and then the training set and the validation set and then we simply have to call trainer.train and this will do all the training for us and afterwards you can test it on your test data set and then you have a fine-tuned model so yeah this is basically all you need and then i also want to show you that instead of using this trainer if you want to do it manually and have even more flexibility you can just use a normal pie touch training loop so for this we use a data loader and we need an optimization so in this case we use a optimizer from the transformers library and then here we specify our device then again we create this model we push it to the device and set it to training mode then we create a data loader and the optimizer and then we do the typical training loop so we say for epoch in num epochs and for batch in our training loader and then we do the stuff we always do we say optimize the zero grad we also push it to the device if necessary then we call the model and we calculate the loss with this and in this case um this is already contained in the output so we can just access the loss like this then we call lost.backward and optimizer step and iterate and afterwards we can set our model to evaluation mode again and yeah this is how we do it in native pi touch code and yeah so this is basically how we do a fine tuning and then can fine-tune our own models and then afterwards you can also upload them to the hugging face model hub if you want so yeah i think that's pretty cool and yeah that's all that i wanted to show you for now i think that's enough to get started with hugging face and i hope you enjoyed this tutorial and then i hope to see you in the next video bye you
Info
Channel: Python Engineer
Views: 5,542
Rating: 4.9209485 out of 5
Keywords: Python, HuggingFace, Transformers, Deep Learning, NLP, NLP Python, Huggingface Python, Huggingface Sentiment Analysis, Sentiment Analysis, Text Classification, Tokenizer, Huggingface PyTorch, Huggingface TensorFlow, Huggingface fine tune, huggingface pipeline
Id: GSt00_-0ncQ
Channel Id: undefined
Length: 38min 12sec (2292 seconds)
Published: Mon Jun 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.