Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone today i show you how to get started with hacking face and the transformers library the hacking face transformers library is the most popular nlp library in python with over 60 000 stars on github it provides state of the art natural language processing models and a very clean api that makes it super simple to build powerful nlp pipelines even for beginners so today i show you how to get started with it i show you how to use the pipeline how to use model and tokenizer how to combine it with pytorch or tensorflow how to save and load models how to use models from the official model hub and also how to fine tune your own models so let's get started so first of all how do we install the transformers library so the transformers library should be combined with your favorite deep learning library so this could be pytorch or tensorflow or even flex so go ahead and install these first and then you can install the transformers library by saying pip install transformers and that's all you need to do first let's have a look at the pipeline so a pipeline makes it super simple to apply an nlp task because it abstracts a lot of things away for us and the way it works is that we say from transformers import pipeline then we create a pipeline object so we say classifier equals pipeline and here we put in a task so in this case we want to do sentiment analysis there are a lot of more tasks available and we will have a look at them in a moment but for now let's do the sentiment analysis so we create our object and then we apply this classifier and here we put in the data that we want to test so in this case we only put in one string and the string is i've been waiting for a hugging phase course my whole life and then we print the results so now let's run this and see how the result looks like all right and here's the result so we see the label which is positive and we also get a score so almost 96 percent so yeah this is super cool and the way this pipeline works is that it will do three things for us so the first one is the pre-processing so it's pre-processing the text so in this case it's applying a tokenizer then it feeds the pre-processed text to the model then it applies the model and then it also does the post-processing so post-processing means it will show us the result how we would expect it so in this case of a sentiment analysis pipeline it for example shows us the label positive or negative but it can also look different for different tasks so yeah that's how it works and now let's look at a few other examples of pipelines for example we can also use a text generation pipeline and we can also give it a specific model so in the first example we just used the default model which you can also see here in the output but you can give it a specific model either one that you have saved locally or one from the model hub so we will also have a look at this in a moment so yeah let's apply this example to generate some text and you can also see there are different available arguments so for this i just recommend to check out the documentation so yeah here's the result so we wanted to have two possible return sequences so the first generated text is this one in this course we will teach you how to play chess or here's a second one in this course we will teach you how to use a combination of a traditional and simple blah blah blah so yeah this also works and now let's have a look at a third example for example we can do zero shot classification this means we can give it a text without knowing the corresponding label and then we put different candidate labels for example this text can be education politics or business and then let's run this and see the result and here we get the results so all the different labels and the different scores and the highest scores with over 96 is the education which is correct so let's have a look at the different other available pipelines so for this i recommend to go to the official documentation and here you see all the available tasks for example we can do audio classification we can do automatic speech recognition we can do image classification question answering and translation summarization so yeah this is super cool and yeah i just recommend to play around with different ones and see how it looks like now let's have a look behind the pipeline and understand the different steps a little bit better so for this we have a look at a tokenizer and the model class so we can say from transformers import auto tokenizer and auto model for sequence classification so this one here is a very generic class and this is also a generic class but a little bit more specified for the sequence classification task so for this i just recommend to have a look at the official documentation but for example if you know you want a specific one there's for example also a bird tokenizer class and a bird model class so yeah so we import those classes and then we create instances of this so for this we specify a model name so in this case this is just the default model that is used for this pipeline and then we call the model class and say dot from pre-trained with the model name and the same for the tokenizer and this from pre-trained method is a very important method in hugging phase that you will see a lot of times so just keep this one in mind and now that we have this we can for example copy and paste the same code and now for the pipeline we can say model equals model and tokenizer equals the tokenizer and now since this is just the same default model this should produce the very same result so let's run this and have a look at the output and the result is the very same like i said so this works so yeah this is what's going on under the hood so there will be a tokenizer and a model so now let's have a look at the tokenizer and see what this is doing so a tokenizer basically puts a text in a mathematical representation that the model understands and in order to use this we can call the tokenizer directly and give it a text as input or we can also put in multiple texts as ones as a list and we can so here we do this and print this and we can also do this separately so we can call tokenizer.tokenize this will give us tokens back then we can call tokenizer.convert tokens to ids this will give us the ids and we can do it the other way around so we can call tokenizer.dcodeids and this will give us the original stringback so let's run this and have a look at the different outputs all right so here we see the output so if we apply the tokenizer directly then here we get this dictionary and the dictionary contains the input ids that look like this then we also have a attention mask so for now we don't have to worry about this a attention mask basically is a list of zeros and ones and a zero means that the attention layer should ignore this token then if we do this separately so if we call tokenizer.tokenize then here we see the different tokens then if we convert the tokens to ids then each token has a unique corresponding id so we see this here and if we decode this then we get the original string back but here please note that we basically removed the capitalization but yeah and now if we compare um this one with this one then you see this should be the very same ids but here we also have this id and this id so this means beginning of sentence and end of sentence but basically yeah it's the same and yeah and this is how a tokenizer works now let's see how we can combine the code with pythog or tensorflow so in this example we use pytorch but the code is very similar with tensorflow so with tensorflow usually we have a tf before all those classes and yeah in the first case i simply apply the pipeline like before and now we use multiple sentences so usually we just put in one sentence but we can use a list of all those sentences so we call this our x train data and yeah here we feed it to the pipeline classifier and print the result and now we do this separately so first we call the tokenizer with the x train data and we call this our batch and then we can give it different arguments like padding equals true truncation equals true max length equals 512 and return tensors equals pt so this will be in pi torch format so you will see how this looks in a moment because we print the batch so yeah usually we apply the tokenizer directly instead of doing the different functions separately and then we do the inference in pytorch so for this we say with torch dot no grad then we call our model and here we unpack this batch because this is a dictionary and then we can apply different functions like f dot soft max to get the predictions or torch dot arc max to get the labels and again these predictions should be the same scores that we get from our pipeline because it essentially is the same step except that now we do it for ourselves so let's run this and have a look at the result so yeah here we print the batch and you see this is a dictionary with the input ids and now we see this is a tensor and this is because we specified in pi charge tensor format so without this this would just be a normal list and then we had to take care of putting it in the correct format ourselves but yeah this makes it super handy to work with pytorch and tensorflow and then here we print the predictions and the labels and you see if we compare this prediction score with this one then it's the very same so yeah this is how it works if we do it step by step and this could be useful if we for example want to fine tune our model with a pytorch training loop now to save a tokenizer and model we can specify a safe directory and then we can call tokenizer.save pretrained and also model.save pretrained and when we want to load this again we can pick a class like this one and then call autotokenizer.frompretrained and also for the model we say dot from pre-trained and then we get the loaded tokenizer and model and this should get you the same results as before now let's have a look at how we can use different models from the model hub so on the official home page we can click on models and then you see there are almost 35 000 models available created from the community which is just awesome so here on the left side we can filter for example we can filter for the different pipeline tasks or we can filter for libraries or data sets or languages and we can also use the search bar for example if i want a german model i can simply search for this so here let's filter for text classification so this is the same as the text analysis task and in this case this is the default model and usually the name says the name of the model so in this case it's a distal bird base uncased model and then it's fine-tuned on the sst2 data set and it's in english so yeah and then you can read through this and find out more information so let's for example clear this and search for summarization and then click on this one and yeah sometimes you even find code examples and the way you can use this now is either you grab the code example or here in the top next to the model name you can click on this copy icon this will copy the whole name and now we can jump back to the code and then for example here again we want a pipeline and in this case we know it's a summarization pipeline and then as model now here we paste in this model name and then it's applying this one from the model hub and this is how you can use the model hub to use different models now let's briefly go over how we can fine-tune our own model so i'm not going into detail here but they have excellent documentation on the official pages so i will put the link in the description and by the way also you could switch here between pytorch and tensorflow code and then open a collab and have a look at the exam code so this is super helpful but usually the way it works so of course for fine tuning we use our own data set so we prepare this then we load a pre-trained tokenizer and call it with this data set and get the encodings then in case of pi torch we prepare a pi torch data set with the encodings then we also load a pre-trained model and now we can use this trainer class from the transformers library and also the trainer arguments and then we set this up with the model that we want to use and the data sets that we prepared and then we simply call trainer.train and this is how we do this we could also again do this with our native pie charts training loop but this just makes your life super simple and this is how you fine-tune your own data all right i hope you enjoyed this tutorial if you have any questions let us know in the comments also you might enjoy this video about how to get started with open ai and gpt3 so if you haven't already then check this out and then i hope to see you in the next video bye
Info
Channel: AssemblyAI
Views: 245,207
Rating: undefined out of 5
Keywords:
Id: QEaBAZQCtwE
Channel Id: undefined
Length: 14min 48sec (888 seconds)
Published: Sun Apr 03 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.