Hugging Face Transformers: the basics. Practical coding guides SE1E1. NLP Models (BERT/RoBERTa)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone uh welcome to my first episode in my practical guide series today we're going to be uh looking at the hugging faces transformer library going through some of the basics um in future episodes we're going to be going on to some more advanced topics taking some of the language models from the transformers library retraining them and applying them to our own downstream tasks so stay tuned for that so in today's episode i said we're just going to go through some of the basics and um you know the basics include what is the huge face transformer library what can it do um we're gonna be looking at how to now how to navigate their website and having a look at their documentation when i first found um the transformers library was trying to use it for some tasks i thought there was a bit of a lack of a sort of um a guide on actually focusing on how to implement these models how to navigate their documentation so that's what we're going to be trying to to do today so we're going to get some of these hugging face models these language models from from their website or from the library and applying them to some very simple examples of the beginning in future episodes we're going to be training our own models and applying them to our own tasks so let's jump in so let's start just on their website so i'm just going to go to huge face dot co so what is hugging face so hug and face is a company uh their most sort of successful uh product is this thing called the transformers library so you can you can have a look on um on the github hub face transformers it's got a bunch of you know a bunch of um favorites a bunch of styles and stuff so what can you actually do with this library well um their sort of main purpose is to sort of um offer this platform to access these large language models so um and they're all based off the transformers architecture underneath uh which is why it's called the transformers library um so let's just have a little look at some of their models that they offer so these are their most popular ones at the top so but based on case distilled roberta roberta bass distilled bert so what does this all mean so um back in sort of 2017 i think it was 2017 the attention is all you need paper came out which was the first sort of paper to apply the transformers architecture to a language model a year later i think it was 2018 but i have to have to double check um the bullet paper came out so what does bert stand for bur is the bi-directional encoding representation from transformers and it's a large language model training on a mass language modeling task and it's trained on gigabytes and gigabytes for text data so the model comes in a couple of flavors it comes in a base on a large version so the base is it's got a smaller set of weights more parameters the large has a larger set so the large is um going to be more effective but you know you have a training overhead and inference overhead of you know processing all the weights and parameters in the model there's also for bur there's an uncased and a cased version so whether you're going to be using capital letters or not um so yeah you're going to want to pick what's right for your task and you know what sort of compute power you have um there's also the distilled versions of these models so the distilled versions are basically just a same model but with a much smaller set of weights and parameters so fast inference faster training you suffer of course a little bit performance but it's not that much so you can you can weigh that up for your use case um there's also the roberta models which stand for the robustly optimized bert approach which was released i think 2019 by facebook um and that is basically a bert model trained in a slightly different way trained on much for a much longer time on much more text um and there's a few other special techniques they did while so training to massively improve the performance over the bass but monologue so roberta i think is always the case so you just have you have the base model you have the large model so let's have a little look at just the bert based model the most popular model they offer and let's jump in so over here so you know you can read the model description you can read the paper if you like um so you know it's focus off this paper and it'll take you to the archive so it was released in 2018 i thought it was so let's have a little look at their hosted inference api so as i said it's trained on a mass language modeling task so you mask out a token so be careful here it's not a word you're masking out a token and you're trying to predict what the um what the missing token is so let's just have a look at what this thing says so paris is the blank of france and they put capital with a very high probability um and you know the output of this model by the way it's not just five five tokens with um a probability it actually produces a probability distribution over your whole vocabulary and it just happens that capital was the largest probability here by quite a long way but perhaps let's put in another example where you're not actually gonna you're gonna want back a token that's part of the previous word so let's say i strolled along the river mask and full stop and riverbanks actually one word so here we see a hashtag bank and the hashtag hashtag that actually means that it's part of the previous token so i strolled along the riverbank one word and that's assigned it quite a high probability but you can see the other um options here i strolled along the river bank at the river again the river path the river thames these are all totally um you know decent they you know they might it might be the real mass token and i think that's the important thing to understand about um the mass language modeling tasks is that you know there's not always going to be one correct answer there's there's often the mast token could often often be many things and that's why when you're uh training but you have to see lots and lots of examples but at the end you get this model that has a sort of inherent understanding of the language so this is an example of a mask task but what happens if we go to maybe a sentiment classification task so let's go to text classification here under tasks and we'll go to the distilbert base uncased fine-tuned on this i think it's stanford sentiments tree bank english task um so instead of trying to produce a probability distribution over the vocabulary here we're going to be wanting to probably you know either have a um well we probably want to have since this binary classification here of positive and negative we're going to be having want to have like a single uh neuron at the end of your neural network that's going to be one for positive or zero for negative or you could might have two at the end um something like that um and so here it's obviously put i like you i love you and it's put a very high probability to a positive but i hate you and you get very high probably probability for negative and they probably you know in the underlying code they can use a sigmoid function or something like that so you're going to be squished up to either one or zero um so this is all good you know you can use these models straight out the box if you like and it's sort of fun to play around with the hosted inference api but how do you actually use these um models in practice so you sort of have two options you can either use the hugging face library itself so they've got some useful pipelines for sentiment analysis which will go through or they've got a trainer class to train your own models or the alternative is to sort of implement the models yourself with an existing ml framework such as pytorch or tensorflow and i think we've started to um support some more things as well so let's just go up to resources here and we're going to go down to the transformers documentation so this is sort of a really really useful section for when you're actually using the library where you'll be sort of trying to implement models um so down here on the left-hand side they've got some some guides using transformers some advanced guides and if you keep scrolling down you'll find the model section um so although in the previous page i showed you they have lots of models so you know there was that sentiment classification model for example for example that was fine tuned on the stanford tree bank data set now that's a specific model but here uh slightly different slight tweak it's actually the architectures that are supported so the bert architecture or if i go down i'll find the roberta architecture down here where is it roberta down here so you can go into these so i'm just going to go back up to the burp model just to keep it simple for now so if you once you click one of these models you'll see down here a list so you've got a burp model you've got a burp for sequence classification then you also see you'll see these duplicates you'll see you know tf for the sequence classification and that's talking about actually how the model is implemented so tf is for tensorflow but the standard model is done in pi torch so we're going to be focusing on pie torch in this um in this series um and we're also going to be looking at some of uh the sort of inbuilt transformers stuff as well we're going to have a look at their trainer in some later episodes so let's go to um but for sequence classification so here you'll find the parameters to the model that you're going to need you you find out what it gets back you're going to see some little example code of how to implement it and some other things one other really useful thing is if you actually just go up to birth sequence classification you can just go to the source code itself and you can have a little look at the source code and this can often be maybe a good way to figure out what's actually happening um when you're you're using the model because it can sometimes be a bit black boxy and you're not really sure what's going on um you can also use this to have a little look at some i don't know uh what sort of bert model names they have in this uh in this class so you know you've got the large model here you've got the case you've got the encased you've got a multilingual version a chinese version so you know it can be it can be pretty useful to have a look at this stuff um so now let's actually jump into some coding and have a look at uh how to instead of just looking at the documentation or using their online inference api let's have a look at how to get your own models into a runtime and use them so let's just go here i'm just going to connect this runtime on google colab if you don't know what google colab is then google it and figure it out it's really useful it's just an online coding environment but one thing that's really useful when you're using the transformers library if you're going to be retraining any models you can get access to a gpu online for free so let's just have a little run through i've made a little example here so first thing we're going to do is we're going to want to install the library and i've just put this percent capture here at the top that just consumes any of that insulation sort of dialogue that you normally get um i don't i don't like that so get getting rid of that um so let's just have a little look here at their inbuilt sentiment analysis pipeline so from transformers import pipeline we're just going to make a classifier for sentiment analysis and i've made two example sentences here a positive one and a negative one i love dogs and i really hate dogs so we know which switch and also to note here it says no model was supplied so it's defaulted to distilled bur uncased fine-tuned on the stanford sentence tree banks and that was actually the model that we looked at earlier so you can already see you know maybe i could swap that out and i could maybe use a different one train on a different task on a different you know sentiment analysis data set or something for example so now let's downloaded this uh model we now have it in our run time so i'm just gonna put in our positive and negative sentence and we'll have a little look so the first one got labeled as a positive with a very high score the second one got labeled as negative with a very high score exactly as you know exactly what we'd expect so that's all well and good uh we've used their sort of pipeline straight out of the box and we've sort of basically done what they have on the inference api but we've done it in our own runtime but now perhaps let's um have a look at sort of getting one of those raw implementations of the model and we can maybe do some more interesting things with it so let's just have a little look down here so from transformers we're going to import the auto tokenizer and the auto model sequence classification and that's also one thing that's quite useful about the hung face library is they've got these auto classes that or this auto tokenizer and auto model that is basically just going to infer what is the best model and what is the best tokenizer to use based off the model name so you see here i've actually used the same model that we've used here this model and i've just input this in as a string and i'm just going to get this model back so i'm just going to run that and i'm also going to run the auto tokenizer to get a tokenizer which will be based off the distilled encased model but we could actually instead of doing the auto tokenizer we could use the distilbert tokenizer instead and that's absolutely fine it's the same thing because it's just automatically choosing it for us if we were to perhaps do instead the birth tokenizer so i'll run that again perhaps we'll use the bert tokenizer here and what's it going to say it's going to say the token is a class you chose tokenizing class you load from the checkpoint is not the same type as the class this function is called from it may result in an unexpected tokenization the tokenized class you load from the checkpoint is the distiller tokenizer so this is just a warning that's saying you know you've got it wrong um this model up here um is using distiller you've tried to make a tokenizer from a pre-trained distilled vocabulary so that's not gonna work um i mean it might work but it's not gonna work how you want it to work so let's just change that back um you know what i'm just gonna probably get rid of that and we're gonna use this auto one so let's have a little look at what the output is um from this tokenizer so i'm just gonna put in my positive um my positive sentence i love dogs and what do we get back we get a list of five numbers and this thing called an attention mask with five numbers and so let's break this down so we get special tokens at the start and the end of the sentence so remember i love dogs three words okay but we get a beginning of sentence token at an end of sentence token okay and then this maps in just this scenario maps to a one-to-one so i the id the token id for i is uh 1045. the token id for love is 2293 and the dog's token id is 6077. let's maybe change this and we'll have a look at the negative sentence okay um and you know what let's do let's do them side by side so undo the pause and inputs okay so change that back to that okay let's have a little look here so let's compare these so we already had a look at the positive one up here and now let's have a look at the um negative one so i is the same uh really um obviously there's no match um dogs matches at the end but i really hate dogs and again we have got a one-to-one mapping here um based on what the tokens were but perhaps um we have a token that's not or a word that's not so common so maybe i do i don't know let's i thoroughly hate dogs okay so i'm gonna input this sentence again we're gonna have a little look at what they look like now so did that go one to one it did okay so that's not um okay let's choose something more obscure so swimmingly swimmingly i swimmingly hate dogs doesn't really make sense but there you go so let's work out our input ids again here we go okay so we've got something we've got something longer here than uh our input so i is the same swimmingly okay is now two um two tokens so we've got hate and we've got dogs at the end here but swimmingly has been split up into two um into two tokens so what we can actually do as well is let's just put uh tokenizer and then i think if we go from i can't remember it off the top of my head it's uh token ids to tokens tokenizer let's have a little look at this so we've got tokenizer and [Music] let's have a look so convert tokens to string convert ids to tokens let's do that so i think it's convert ids to tokens convert ids to tokens and we're going to give it a list and we're going to put in our numbers here and we're going to see what that outputs there we go so tokenizer convert ids to tokens and we have swimmingly and you see here again we see that hashtag hashtag lee at the end and that's been split up because the vocabulary of distilber in this example doesn't contain the token for swimmingly so the way it actually breaks it up is it breaks it up into words that it knows on a in a greedy fashion so it had swimming in a vocabulary and it had l y which is a common ending and this can actually be done down to the character level so if i just put in a little absolute mess here then we'll see what's reflected here as a much longer yeah here we go let's just split this up in a crazy fashion so let me try and get that here so what's that converted to that's converted this random uh string of characters into split it up in this fashion um and then obviously i've actually just accidentally inputted in that last one which was hate so anyway let's just go back to what we had i really hate dogs that was a little bit of a detour just to sort of explain how the tokenizers worked um so now we've got these things called attention masks as well now what are the attention masks well in this scenario i've tokenized input uh separately you know i've tokenized uh the positive and the negative sentence separately and so they they have these attention masks that are of slightly different length um but what actually are they well they're actually useful when you come to implementing or inputting multiple sentences into the model at once um because of the way machine learning models work you know you can't pass in variable length so you can't pass in a short sentence on a long sentence together and expect the model to sort of adapt to that because all you know all that's happening in the back end is these large tensor operations or matrix operations if you will um and so what you actually do when you're using these models is you pad your input to a certain maximum length so standardly it's 512 when you're training the model but perhaps when you're retraining you're going to need to have a shorter length probably maybe 128 tokens in length but what happens if you you know so let's say we set the maximum token length to um 128 or perhaps we set max token length to the largest sentence in our data set for example what do you do to the other sentences well you normally just pat them with zeros and the attention mask is uh also used when you're calculating the when you're calculating the loss it's a sort of simple lookup it's basically saying okay is my token um id in my sentence is it actually a real token or is it a padding token so let's have a little look at that now so i've just created a little batch using our positive sentence and our negative sentence and i'm going to say okay we want to pad these and we want to truncate them if they are too long of a maximum length of 512 and i want to return tensors for pi torch because that's the implementation we're using i could if i wanted to change that to tf and that would give me back tensorflow tensors so now we've done that let's have a little look so i've done pie torch dot attention mask and it's all in fact let's just look at the whole batch first of all um okay so the batch has returned me two things it's returned well it's returning one thing which is a dictionary and inside the dictionary we've got two things input ids and attention masks just like we did up here in this when we're doing it separately except you'll notice now that the first example has had a zero appended after the end of sentence special id and also its attention mask has got a zero where the padding token is in relation to the input whereas my sentence with my maximum length it hasn't been changed because it's the largest sentence of the data set and also the attention mask for it doesn't have a zero in it because all of the input is is valid um so now what we're going to do is we're going to pass this batch into the model itself so we've just unpacked this basically into the model that we created up here and then we're going to have a look at the output so what we've done is we've passed this batch into the model and now we're having a look at the outputs so what have we got here we've got a thing called logits so logits if you don't know are basically just the raw output of the model um and often especially for um sentiment sentence classification tasks you're actually going to want to apply a sort of softmax or perhaps a sigmoid or something like that but often a soft max to your logics and that will amplify anything that is a large signal and basically sort of cancel out anything that is a weak signal so now let's have a look at the predictions okay so we've got a a large number here so we've got a 0.999 seven one and this is our first one okay and then our second one we have the opposite we have a very large value here or not very large but with a 0.996 here and then we have a very small value here and this actually now compares to what we had in the pipeline up here so our first example which was marked positive had a score of zero nine nine nine seven our positive class here so we've got back both our negative and our positive scores here basically we've got 0.997 and if we go up and we have a look at the negative score for the second example we had a 0.996 and again we have 0.996 down here so that's how we've actually we've actually got back the sort of raw outputs here from the model given these two sentences given the attention masks and we pass them in together at the same time and we've got the same output that we had from the sentiment analysis pipeline provided by hugging face itself so yeah so that sort of wraps up um a really basic run-through of how to use the home phase model in the next episode we're going to be taking a model and we're going to be applying it to a more advanced task and that's going to basically evolve taking a data set and retraining a model from the hugging face so we're going to use the roberta model or something like that and we're going to apply it to our own specific task and that's going to involve retraining the model and then you know maybe adapting um how it works at the end because maybe instead of classifying one thing like positive or negative maybe we want to classify a bunch of things maybe we want we have eight classes um so stay tuned for that in the next episode um and in future episodes we're also going to have a look at some um mass language modeling tasks as well um and we're going to be using a pi torch implement implementation for that and we're going to retrain them and yeah apply them to our tasks so i hope this guide has been useful for you let me know if you have any questions in the comments down below um and for any of the sort of technical background details about um bert or roberta i'll try and include some useful um links that i found useful when i was first trying to understand actual technical details of how the model actually works in the background as i said this this sort of guide series is meant to be more practical fos practical focused um and it's going to be focusing on like implementing it with code rather than the sort of technical details of how the model works because there's lots of good guides and lots of good lectures on how the models work in the back end but i thought that was a bit of a lacking of sort of hands-on coding guides getting up to speed with how to use the library itself so thanks for watching and let me know if you enjoyed it cheers
Info
Channel: Ru AI
Views: 565
Rating: undefined out of 5
Keywords: nlp, hugging face, bert, roberta, sentiment analysis, masked language modelling, python, pytorch, google colab, coding guide, educational series, transformers, transformer library, natural language processing, gpt, gpt-2, gpt-3
Id: DQc2Mi7BcuI
Channel Id: undefined
Length: 29min 52sec (1792 seconds)
Published: Wed Sep 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.