HuggingFace Fundamentals with LLM's such as TInyLlama and Mistral 7B

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey welcome back so as you know I've been trying to train my own large language model from scratch and today I'm finally going to start to show you how I've been doing that but before we can get onto training a model we actually really need to understand how a large language model works so today I'm going to present a really simplified reference model that I've been using to describe a large language model and then as the videos go on we are going to delve into each layer so by the end of this you should really be able to understand how an l M works and when you start reading things like papers you'll understand what the terms are and actually you'll be able to build your own model now you're probably thinking to yourself actually I'm never going to build my own model but maybe one day you might decide to fine tune a model and therefore having this deep understanding of how the models work is really going to help you but actually just interacting with AI and large language models having this knowledge is super super important so to get started today I'm going to present that reference model we're going to have a little bit of a play with the hugging face Transformer Library we're going to look at things like model configuration and we're going to look at things like model architecture at a high level but enough to get you familiar and interacting with it and then maybe at the end we're going to have a little bit of fun with tokenizers and we're going to see what happens if I take a tokenizer from something like llama and then apply it to mistro just for a little bit of fun so before we start looking at that reference model what we're going to do is use the Transformers library from hugging face just to load and run a model nothing more than that and then we'll start to break down a little bit what's going on so we are going to use uh Google collab so you can go to cab. research.com and then all you will need to do is click on new notebook now don't worry about it they give you a free tier so you can do some computer on that for free so to get started I'm just going to paste in this piece of code which allows us to use the hugging face Transformers library and specifically the model I'm going to use is the tiny llama model it's it's a 1 billion parameter model uh and it's an open source model but it's small enough to run in the free tier of Google collab and we can do a lot of fun and experimentation with it and it's a very very powerful model so in this case uh we're going to get the tokenizer from the model I'm going to explain what that does uh a little bit later on and then we're going to load the model again we're going to Deep dive into how that works a little bit later on and then we are going to then create a pipeline uh for the text generation pass it in the tokenizer in the model and then we're going to ask it the question who is a loveely and then the model is going to come back with an answer and we're going to print that out to make that run I am just going to uh select my runtime type in this case CPU is probably enough I'm going to set it to uh High Ram but I don't think we actually need that in this case but when we're using other models like mral for example then you may need a high Ram uh model but uh once that's working I'm just going to click connect and then that is going to create me a uh uh Google call laab run time uh on the back end now if I just hit play here for a second then it's going to uh walk through this code the first thing it's going to do is uh download the tokenizer then it's going to download the model uh the model is quite big it's a couple of gigabytes so it'll take a few seconds to get that downloaded and then once it's downloaded it will be able to do the inference and as you can see it's came back with uh who is a LEL and then it says Ada love LA with a mathematician and writer who made a sign ific contribution to the field of computer science she's the daughter of Charles babage so that's completely wrong she's not the daughter of Charles babage she worked with Charles Page uh she was actually the daughter of Lord Byron but that's beside the point you can see it's coming back with a reasonably intelligent looking answer albe it wrong so now that we've been able to load and interact with the model let's break down what this reference model looks like so at the very bottom there you can see this question who is a love La that was the question we asked this is our input text and then at the top there you see the output text is a love La is the daughter of blah blah blah and that is the final result now the first thing that I want to say here is those two layers the input text and the output text is never seen by the model the model knows nothing of those words it deals with the idea of embedding so it's actually looking at a tokenized representation of those words so so everything that the model deals with is essentially numbers and it has a vocabulary so to be able to ask the question of who Ada love lace is we actually need to take that question and tokenize it I.E convert it to a number form first and then we would pass in those numbers into this embeddings layer in the llm which we'll come back to in a second and then eventually once the llm is processed this it is going to return its output as tokens and you can see that in the tokenized output layer so 23255 uh is the token for Ada and then we would run it back through the tokenizer again and then that is going to Output the Ada LEL is Parts translated from its tokenized form back to English so if we come back to our original code and look at the Imports you can see from Transformers Import Auto tokenizer and then you see our model name is Tiny llama and then we've got this piece of code here which is tokenizer equals Auto tokenizer from pre-trained uh model name that is actually retrieving the tokenizer from hugging face and that is what is responsible for doing the tokenization so underneath the hood here in pipeline it is actually going to take that phrase who is a a love lace and do the encoding into its tokenized form and when it gets a result it's going to do the decoding without us seeing that encoding and decoding process and if I wanted to we could actually do that for ourselves so if I typed in here uh from Transformers uh import autot tokenizer and we will just uh break this down as we go along then we put back in the model name is Tiny llama 1B chat again we'll look at that model in hugging face in a few minutes and then if I set tokenizer is equal to uh autot tokenizer doore pre-trained and then I pass in the model name and then I print out the tokenizer if I now run that you can see it's going to come back as saying it's using the Llama tokenizer fast uh It's associated with the model tiny llama that's where it's pulled it from it's got a vocab size of 32,000 we'll talk about that in a second and then it's obviously a fast uh tokenizer what do we mean by a fast token Tozer it's really simple a fast tokenizer is a tokenizer that is written in Rust if is fast is equal to false then it's just a standard uh python tokenizer that's all it means so underneath the hood somebody's written a rust tokenizer so it runs a little bit faster now there's a lot of details about this tokenizer here about the padding and truncation special tokens beginning of sequence tokens Ender sequence to tokens unknown tokens padding tokens again we will go through this in later video um but this just allows you to see that if you want to get into more details of what the tokenizer is then you're going to be able to do that now if you're thinking to yourself hang on how does the autot tokenizer know which tokenizer to use for this model well I'm going to show you if you go to https and you go to huggingface doco and then you just paste in the name of the model tiny llama tiny llama 1B chat you are going to end up in the hugging face uh page now hugging face is really just like a kind of GitHub repository for large language models um so you can go and explore that and you can find lots and lots of other models in there whether it's meta Lama or falcon or whatever but today we're looking at tiny llama and then if I just click on files for a second you can see a whole load of different files there but as you can see here there's a file called at tokenizer config.js if we select this you can see at tokens decoder at balls token and yeah and the sequence token Etc and if we come back to a Google collab you can see it's the same data that is being displayed there so it doesn't take a genius to realize what is actually under happening underneath the hood is it's just reading this config file from hugging face so it's just essentially downloading it and reading it what is actually particularly cool here there's actually more data available here so you can see the tokenizer model so actually the model they used to train the tokenizer has actually been uploaded by tiny Lama so that's very cool and then tokenizer Jason is actually the dictionary file for the token as well so if you want to see how everything is mapped together you can download that tokenizer Jason and look at it if I wanted to I could actually encode The Prompt myself so I could just type in prompt equals uh who is adaah love lace which was exactly the question we asked the second ago and then I'm just going to have a new variable called encoded prompt and then I'm going to call tokenizer do encode and now I'll just pass in the prompt that I want to encode and then of course I can even print it out so we'll print out the prompt and the encoded version and again you see who is a love lace and it comes back with 1164 338 2325 and if I come back into my reference model for a second you can see 1 111 16644 338 2325 and as you can see it exactly represents the numbers in the reference model if you're curious one stands for beginning of sequence 11644 is is who 338 is is and then 23255 is Ada and then so on so fourth so that presents our tokenized form but if I come back into my original inference code that is happening underneath the hood within that pipeline function if I really wanted to if we wanted to see the exact encoding we could just uh do a for Loop over the encoded prompt and then I could call the tokenizer decode and then that will give me the thecode version and then I could just print that out so if we run that for a second you can see exactly as I said one is the beginning of sequence uh who is adaah and then 23974 is the L and then uh e is 295 and then Ace and then the question mark is 29973 as you can see there it obviously doesn't have the word love lace in its vocabulary because it's not a frequently used word so it's actually using subwords to make up the entire word love la in this case is L elel Ace uh there so three tokens are making up the word love lace and that's again due to the restricted vocabulary size in this case the Llama tokenizer has 32,000 tokens in its dictionary and love lace isn't one of them and this is the idea of subword tokenization to be able to keep your vocabulary size short and again in the next video I'm going to explain exactly why you want to have vocabulary sizes short and how that affects uh things like the embedding layer in our reference model and again that decode process that we were showing you there that actually happens underneath the hood in that pipeline again it's not part of the model but it is happening in that pipeline function so if I come back into this reference model for a second you can see we've now covered who is AA love lace and we've covered the tokenization input and we've covered the tokenization output and the output tax a love laes that is all happening by the tokenizer it is not happening in the large language model now because the tokenizer Json is actually the thing that uh the auto tokenizer is using uh then if you don't know what tokenizer is being used by a model you could just go and look up in hugging face that particular model and then go and see which tokenizer is been used in fact it's the tokenizer class that is the specific attribute that tells you which tokenizer so in this case it's using llama tokenizer if I want to I could get rid of the auto tokenizer and I could paste in llama tokenizer and then I could paste llama tokenizer here um and now run that again and then you see it's going to continue to work you see it's using llama tokenizer again in this particular case it's saying is fast as equal to false because I've used the standard uh llama tokenizer in this case as so it's going to be using the python implement M mentation but similarly if I wanted to because it supports fast I could just put at fast here and then I could replace uh this with fast here as well run that again and now I'm using llama tokenizer fast and you see the is fast is equal to true so the auto tokenizer is just using this tokenizer config to be able to figure out which tokenizer it's using it's funnily enough the model itself has no clue clue about the tokenizers it is completely abstracted from the model it just knows it's about its vocabulary which is held in the embeddings layer and again I'll cover that in the next video now to prove this works really well if I wanted to I can actually put in a completely different model name so in this case I'm going to use the uh the very popular mistal 7B model so we're going to put in uh mistal AI SL Mistral uh 7B and we are going to take ver version 0.1 again similarly you can take this go into hugging face replace this piece here and then you can actually look it up click on files and versions and then as I said before you've got this tokenizer config here so as you can see here it's actually using a llama tokenizer as well which is really cool so mral has reused the Llama tokenizer uh but interestingly enough it actual uh dictionary is complet completely different so although it's using the same tokenizer uh model architecture it's actually got a completely different dictionary so different things map to different numbers and we can prove this as well so if I uh now run this and you can see that I'm not using the auto tokenizer I'm using llama tokenizer fast in these cases but instead of using tiny llama I'm using mral 7B as the model name and this continues to work so it's come back who is adaah love lace it's came back with an encoding so you see who is Ada love lace uh but these numbers here are completely different from the Llama tokenizer so it's using the same tokenizer architecture so it's reusing the Llama tokenizer but it's actually got a different dictionary so the numbers map to different tokens so if I want to do something completely Bonkers because M uses the Llama tokenizer I could if I want to uh use the llama tokenizer with the Llama model the tiny llama model and then pass that into mystal AI for tokenization it will generate complete chipperish but because they're compatible architectures this would work so to prove that I'm going to pull in mystal AI mystal 7B version one I'm going to change this to tokenizer uh model name so rather than using myal AI as my tokenizer which is the sensible thing to do I can actually pass in the tokenizer model as Tiny llama so I could use the tiny llama tokenizer but use Mr Lei uh as the model and then if I run that I'm going to ask the same question who is a love lace this will work but it's going to come back with complete gibberish because who is a love lace as you remember actually maps to completely different numbers from what it's expecting so rather than having 11644 for who and and 338 for is it's actually going to get 6526 for who and 349 as is now in this case I have no idea what they are mapping to in uh the Mist version of the llama tokenizer in its dictionary um but either way it's going to come back with complete gibberish but it will work work because it's got the same embeddings dictionary it's a 32,000 size dictionary with the same size of Dimensions so uh we have compatibility on the embeddings layer it just happens to be that we're passing through gibberish and there you go as you can see it's came back I mean it's complete gibberish it's came back with but you know it it it certainly has actually came back and made a prediction um if I want to of course if I want to get a sensible output from Mr AI I can just replace the tokenizer model name uh back with uh Myer AI 7B and then run this again and then it will come back with a sensible answer and I think this is quite important because as we start to build our own large language model there actually becomes a decision point which is do I want to use an existing large language model architecture so in the case of mistro they decided to reuse the lum architecture but they made a decision to have a different token dictionary from llama um so you know and you can understand that because you're going to have different influences as you build these models you may want it to handle different languages maybe something like rust or c u differently from how another model does but you want to reuse the architecture IE you don't want to rebuild your own tokenizer which is a lot of effort so again that's a sort of uh decision point you make as you build these models um so in the case of mistal they decided to reuse the Llama tokenizer but obviously have their own dictionary in the case of tiny llama they chose to use both the tokenizer and the tokenizer dictionary so it has full compatibility uh with llama and that opens up interesting scenarios for the future if you want to do some things like merging in transfer again we'll cover that in another video and I think there is some other advantages to re using an existing tokenizer without the need to actually have and I think there are some other advantages to reusing an existing tokenizer so actually if you look at what went on with mistal even just this weekend uh there was a leak of an older mistal model and that model which was I think it was a 70 billion parameter model they actually took the existing llama 270b model and then they fine-tuned it with their data whilst they were doing the pre-rain of their base model and then of course once they got their pre-train worked then they plunk their uh their data they use for the fine tuning on top to create their own model now that's an interesting approach because it means that you can go pretty far with fine-tuning and then as you build your own pre-trained model in the background you can kind of replace that and again you're going to have more consistency if you're using the same tokenizer Etc in the architecture so I think that was an interesting approach and that may be an approach that we take as we build our our large language model that we use a fine-tune version of an existing model whilst we uh train our own uh base model uh in the background as you can see now it's came back with the response I'm now using the Mr is's uh tokenizer token dictionary I'm using the mystal model itself and you can see now when I ask the question who is a love lace comes back with a sensible uh answer so I think we've covered tokenization quite well at a high level but now I want to get you a little bit familiar with model configuration and model architecture so if you want to discover what the model configuration is for your model again you can actually use the Transformers library to to help you again so if I do form Transformers Import Auto config and then I set my model name and we'll we'll use uh tiny llama again just CU it's a smaller model and it's easier to work with so I'll set my uh model name equals tiny llama and then I'm going to say config is equal to autoc config and we'll do from uncore pre-trained and then I will pass in the model name as we did before and then I will print out the config uh and if I just run that for a second you can see it's came back fairly quickly and it's saying llama config you see it's tiny llama the architecture is llama for causal LM we'll talk about that in a second and then you can see it's coming back with what the beginning of sequence token ID is so in this case it's one remember when we did the infer earlier you see that number one there and then forward uh you know the uh the S tag there that's the beginning of sequence you see that's actually called out in the model config and again there's an Ender sequence token which is uh token two it's not coming back in that particular case there's a whole set of other things that we're going to get into in uh other videos uh but we're not going to focus on that uh too much today but as we go through each of these videos we will start to go into these things in more detail the thing I want to point out though is what you can see here is vocab size is 32,000 that in that particular case is actually uh the embeddings layer we were talking about the vocab length so the tokenizer reports 32,000 tokens in both the Llama tokenizer case whether it's mistal or whether it's tiny Llama Or llama 2 um and that is kind of where that is pushed in and again we'll go into more details about embeddings in another video but you can access the config and of course if you are curious about uh where the config comes from we just go back into uh the file section here and then if you look at the comig Json uh for tiny llama you can see architecture uh llama for coal LM and then there is all the information that I just showed you over here so it's really just reading that llama config file uh from uh the hugging face repository and then of course if I wanted to I could go to something like uh Mistral uh look at the 7B model and then of course if I go into files and we click on the config Json there for example you can see there is the architecture in this case case mistal for caal LM you can see it's got the same vocab size that is why I was able to do that trick of putting in the uh the tiny llama vocab into uh the mistal model and uh it work uh even though it generated gibberish again whole other things there similar beginning of sequence tokens end of sequence tokens again we'll go into all the other stuff in later videos but uh you can explore that and of course if I wanted to I could just change the model name to mistal here and then it's going to come back and uh show mist's details but let's just set that back to Tiny llama so that we can see uh llama's uh details now one of the things I want you to notice here is llama for causal LM again if you're wondering what a causal LM is it's actually just stands for causal language model and if you're wondering what a caal language model is it just basically means a model that does next token prediction so in the case of uh all large language models they are actually next token prediction models so when I type in something like who is adaah love lace uh it's predicting the next token is going to be adaah and then it is going to look at who is a loveely question mark and now adaah and then the next token it's going to predict is is and then so on so forth eventually you're going to get to the kind of full answer but it's just predicting next token over and over again and that's what it means by call LM it is a next token prediction model and if I want to delve in particular into the architecture not just the config of the model I can actually do uh a from Transformers and we'll do uh an import auto model uh for Cal LM so we're going to look at so we're again you I think you're probably working out what's going on auto model uh for coal LM is really just going to go and look up hugging face and then find the the kind of model architecture we'll put the uh model name as Tiny Lama and then uh we'll do a model equals Auto model for coal LM do from pre-trained and we'll pass in the model name and of course I can print out the model and as you can see it's came back with the at model architecture um and you can see it type llama for coal LM so you're probably working this out for a second if I change this from Auto model for coal LM I just put in llama for coal LM which is the same one that's came back from the config here uh and then I replace this little uh Auto model with llama for coal LM run this again you're going to see it's going to come back with the exact same answer so the auto model if you think about what's going on it's just looking up in the uh config.js figuring out uh what the uh the architecture is in this case llama for Cle LM and then it's using that to then load up the model uh and that's really about it that's all the auto model is doing but of course I'm just short Circ it because I know what the architecture is so I'm just putting llama for coal LM and then off we go and now it's coming back with a response you're seeing it's a llama model here's the embeddings layer that I keep talking about which is the kind of uh where the vocabulary is uh listed in my next video I'm going to cover that as well and then in my reference model here is all of these attention layers so you see the kind of llama decoder layers in this case it's the Transformer layers in this case and then there's an output layer uh at the bottom so if I come back into my reference model for a second you can start to see here that the output layer was that sort of output that I showed you before the attention layers was those sort of uh 22 layers of Transformers or whatever it was in tiny llama and then that embeddings layer of 32,000 so that is what the large language model is so if you want to go and look at any of those open source models you can that are in hugging face you can just go and literally inspect the architecture and the config uh using those commands and it doesn't matter which one you use you can do this for mral you can do it for llama you can do it for Falcon uh or tiny llama or athia you can just choose whichever ones you want and now finally if I come back into my inference code for a second I'm I'm hoping you're starting to see that if we look at this code here we've pretty much explained everything apart from pipeline right which is you know if we come back here we've set our model name the tokenizer is loading the tokenizer from the hugging face uh directory the uh the auto model for coal LM again is just looking up the model in hugging face getting its config uh reading through that and then loading the model in fact if we look in here again the the actual model itself all of the weights Etc we'll talk about that in another video is actually in here so you see model safe tensors Etc 9.94 gig uh 4.54 gig this is for mistal you know that's actually what it's downloading that's what it's uh loading the model for uh so it's all stored up in hugging face so um it's then going to go and load that and then we pass in what the model we're using what the tokenizer is and then underneath the hood the pipelines doing all the encoding decoding underneath the hood and then loading it into the model doing the inference and then coming back uh with the response when you call the generator that is essentially what is going on but as you can kind of see from the reference model uh we've got a good explanation of what tokenization is and what at the large language models are doing on the architecture you've got this output layer you've got this embeddings layer which is really the bit that you're interacting with and then the attention layers in between and again in future videos I'm going to explain exactly what's going on in these layers as well so hopefully this has got you really familiar with tokenization it's got you familiar uh with what large language models are doing and what the hugging face Transformer classes are and it's important for us because as we start to build those models if we want to be compatible with um hugging face then we're going to have to build or reuse tokenizers uh that are in that library and uh same with the models and understand what those model configs are and be able to understand how other people have been building their models and again when you start to look at people's papers you will start to understand what all these layers are doing so anyway I hope this has been a useful video hope it's giving you a bit of an understanding of what's going on under the hood and then as we drive through uh you know future videos we're going to get more into the details of what's happening on each layers and on that I will catch you in the next video
Info
Channel: Chris Hay
Views: 2,766
Rating: undefined out of 5
Keywords: chris hay, chrishayuk, hugging face tutorial, generative ai, large language model, hugging face tutorial python, large language models project, generative ai explained, large language model architecture, hugging face, artificial intelligence, mistral 7b, mistral ai, llama 2, how to train llm, artificial intelligence course, mistral 7b huggingface
Id: bypzqJgK6BU
Channel Id: undefined
Length: 30min 29sec (1829 seconds)
Published: Mon Feb 05 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.