Meta's LLAMA 3 with Hugging Face - Hands-on Guide | Generative AI | LLAMA 3 | LLM

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone I'm sidan meta has launched its next generation of llm called Lama 3 and a lot of people are discussing about the potential it has to compete with the industry standard models like gp4 Gemini Pro and other llms so the best thing about the family of llama models is that Facebook op sourced these models so we can use these models for commercial use and other research purposes too so in this video Let's understand how we can access this Lama 3 model and how we can you know build a text generation system with it in Google collaborat so this is what we will be covering in today's video and uh let's get started so as I said we will be using agging phase so the first thing that you need here is you need to create an agging phase account if you don't have one so again agging phase kind of host a lot of models and uh even you can access like llama and llama 2 model and they have added this llama 3 model and this llama 3 has like two variations one is 8 billion and the other one is 80 billion so the previous uh family of model was Lama 2 and that came like 7 billion 13 billion and 70 billion so this is like 8 billion and 80 billion and within this 8 billion and 80 billion there are also two other variation one is just this base model and the other one is instruct model so I'll explain you about all these things so for getting started right so just uh sign up for agging face if you don't have an account so once you do that right so you can go to this aing face page in Google so sign up and in this search bar even before signing up you can just uh go to this Lama 3 so you will see this in the initial place if not you can just go ahead and search Lama 3 and you will see this so you seeing this 8B right so you will also see this 80b instruct and and 8B instruct and so on maybe let's just go with the base model first and here you will see this so you need to agree to share your contact information to access this model so there are like some terms and conditions you need to agree and it's like completely free it's not like we need to pay anything or something like that just scroll below uh you will see this option to log in or sign up so once you log in or sign up right so there will be this accept terms and conditions thing so once you do that uh it would say that your uh you know request is being reviewed and maybe after an hour or couple hours you will get this access so if you refresh this page you will see that you will have access to this model so you will also get a mail in which you have like signed up in your agging phase account so this is how it would look like once you have the access so you'll see that gated model you have uh been granted access to this model so once you have this you can go ahead and and start working on it so this is how uh this works and as this is a DAT model we need to provide the access token from our agging face account so we have this collab environment right so here we are going to load the model from mugging phase so here we need to get this access token put it in this collab environment so that we can authenticate that this is the account through which I'm trying to access this model and and I have like reviewed all this you know terms and conditions and so on so that's about it okay so now let's understand how we can work with this so the first thing is we need to install uh bunch of libraries and uh we need this we need to save this HF token or this agging phase access token so let's maybe discuss like how we can do that first and uh you need to connect to a GPU environment so go to this runtime go to this change runtime type and uh select T4 GPU okay so if you have Google collab Plus or collab Pro right so you'll kind of get access to more powerful gpus but I mean this is not required you can choose T4 GPU and the size of this 8 billion model is about 15 16 GB even that you won't be able to do this in collab because collab just offers you 15 GB of GPU Ram so what we are going to do is do a process called as quantization so I'll explain you about all those things so for that we of course need this GPU setup so that's why you need to kind of have this T4 GPU setup and you can give this connect this thing then go to your fil section so you seeing this files right so once you are connected it we are going to create two files one is requirements.txt to mention all the libraries that we need and the other one is config.js so I'll just right click here and I'll see this option called as new file so here I can say requirements.txt so in this Untitled I'm going to say requirements.txt and here uh I enter it and this will save this file now double click this and this file will be open here so here we can install the require libraries so we'll be needing uh three libraries the first library is called as accelerate so I'll say uh accelerate PPI in order to get the package index so again this is used to do the quantization so quantization is nothing but we would reduce the Precision of the model so that we can uh let's say the 15gb model won't be 15 GB anymore so it will be around 6gb so again it's a tradeoff between the accuracy and the performance that we would get but again we can't load a 15gb model if we if we don't have that much computation power so that's about it okay so let's install this accelerate and uh next we need this bits and bytes configuration so again come to Google search bits and bites so all the latest version should be good so this bits and bites this again if you have a laptop or a PC with let's say a larger GPU you can do this in your local Jupiter notebook too but again if most of us may not have this so that's why I'm doing this on collab or if you are working in an organization and you can use their AWS Sage maker notebooks then you should be good because that gives you offer to larger machines so that's good too and then we are going to check Transformers Transformers is nothing but the agging pH Library so that's called as the Transformers one so you can see this so uh here I'll copy this Transformer 4.40 and paste it here okay so we need to install these three libraries if you want any other libraries or if you're working on something else you can install that too so I'll save this and now here I'm going to save uh exclamatory marks to run this as a command and say pip install R requirements.txt okay so this will install all the libraries uh present in this requirements file so the next step is creating a config.js file and putting your access token there so there are like a few ways to do this one is you can use a EnV file uh so that you can save the access token in the environment variable or the other step is you can create a Json file and load it and pass that uh HF token so here I'll create another file and I'll call this as config.js okay so config.js so you should kind of have the data in the format of uh dictionary so double click this so now this is empty and here I'm going to create a key value pair and I'll call this aore token so this is this would be my key for my uh Json fil so I'll come to this aing phase account go to so this on this right side you're seeing this right so this is your profile uh thing thing so click this and go to your settings and in this settings you we would have this access tokens so in this access token you can uh you know copy your access token and and and paste it there right so here I've already create created like a access token maybe I can delete it and I'll show it to you how you can do this so I'll create a new one and maybe after I posted this video I can delete this so make sure you are using your access token so the token that I'm going to show in this video won't work because I delete this so again these are like pretty confidential information so you shouldn't share your uh tokens publicly and that's what they have mentioned as well but just for this tutorial purpose I'll just show you so go to this access token in your settings tab and in this give new token and let's name this as Lama 3 getting started so this name is not like important you can give any name and you can like kind of write it or like you can uh basically give different permissions so for this particular thing read permission is enough and say generate a token so you can show this or you can copy this from here so it will say copied now come back to your UHF token and then uh in this again put double you know quotes and and put your access token here okay so I'll save this and close this config.js so now what we are going to do is uh save this config Json to a variable so maybe that we can load later now we can uh import all the libraries so installing of this Library Li may take some time so we need to wait for that so here it has installed all the libraries now let's go ahead and do this code of importing it uh configuring the Transformers Library ding face library and then you know loading the model quanti session so we can start with like all those things so first I'll maybe say this part is installing the dependencies okay so next step is we sorry this is yeah installing the dependencies and now we are going to import it so here I'll say import Json so I'm importing Json to read this Json file and you know uh later pass that HF token to this Transformers Library uh and then I'll import this torch so torch also we need in order for this uh setting up the data type for this quantization and then from Transformers import so here I'm going to M import like multiple things so I'll put a parenthesis here just like a tle and say that again this is like a better way to import if you have let's say five or six methods or classes that you need to import so you can just put a parenthesis and within that you can import it so I'll say first thing that we need is a Auto tokenizer from this we would uh kind of import the tokenizer of the Lama 3 model and then we would have this Auto model for casual ml so all this other model as well so Cas uh okay sorry casual LM and uh basically the language model and bits and bytes configuration so B should be in in caps we have uh bits and bites so pay close attention to the upper case and the lower case letters so bits and byes config and here we will pass the quanti session configuration and finally pipeline so we would create a text generation pipeline so let's run this so these are the things that we need so we have Auto tokenizer uh from which we will load the tokenizer of the Lama 3 model and auto model for casual LM so okay I think I made a mistake here casual Auto model for casual okay so not input name let me check this quickly me check this uh sorry so this should be caal should not be casual so rename this to caal okay so let's run this again and there is a typo so this will uh input all the required methods or the classes that we need so the next thing uh is we are going to configure the Transformer Library so I'll create another Texel and here I'll say HF so the HF basically means agging phase so agging phase account configuration okay so here uh let's load this config data so I'm creating a variable called as config data and within this I'm going to say json. load open config.js so here what this does is this json. load will open this config.js file and save this uh in a dictionary convert this load will basically load will convert this Json to a dictionary store this in this variable and then you can store this token using this variable HF token is equal to config data and in this you can say hore token okay so let's run this so this will read this file and save the token so I can copy and paste it here and can show it to you so you can I mean you can directly copy and and and pass this to the variable but that is like not recommended at all because you shouldn't post any AP keys or access tokens directly in your code so that because if you're sharing your code then others could see it right so for that purpose you shouldn't do it but this is just for your learning purpose I'm just showing this to you but this is not recommended okay so we have loaded the configuration token to this variable called HF token and next we are going to give the name of the model so let say model name or you can also use the variable model ID so that is also widely used and we need to copy the name so come back to your agging phase uh and search for meta Lama 38 billion this is the name so you can copy this and paste it here So Meta iph Lama meta Lama 38 billion so run this so the next step is we are going to create the configuration for our quanti so here I create a text and and can call this quantization configuration and uh as I said the purpose of this quantization is the llms that we are working on it can be a GPT 4 or it can be a Lama 2 or uh you know Lama 3 or any other llms like these models are pretty big so what we do is we do a process called as quantization and this will shrink the model so how this does is uh this will change the Precision of the weight so we know that all the models the Deep learning or any models would have parameters of this we right so this would be in higher Precision like 32 bits so here we can I mean we have this uh ability to use instead of 32 bits you can use like 16 bit or even like 4 bit uh weights and parameters so that's what we call as quantization and we kind of do this and of course there is there's going to be a trade-off because we are reducing the size but the loss in that performance or the accuracy won't be that big so this is just to kind of uh do this analysis of working with M even if you don't have like that large of a GPU so here if you see this right so the GPU Ram is 15gb when you load your you know LM and just pass one prediction your GPU is going to fill and your collab session is going to terminate so we don't want that to happen so we will quantize this and load this to the GPU and when you load it right so that uh is going to take only like 5.5 or six gabs or gigabytes of GPU Ram so that's the reason we are uh kind of doing this so for this we are going to create a variable called as BNB config and within this I'm going to say bits and bytes the one that we have imported put a parenthesis and say load in 4bit like all the ones that are mentioned here so load in 4bit so here I'm going to say true and the next is BNB 4 bit use double qu use double qu is equal to true so this double Quant is equal to True is like this is nothing but so we are telling this bits and bytes that I want to quantize this model but I don't want to lose a lot of information so that my model is you know performing well so this double quantization helps us to preserve the performance of the model or like it doesn't allow the model to lose a lot of information or its ability to perform uh by reducing this Precision so that's what like this parameter is is used for and then I have this parameter called BNB 4bit uh Quant type so this is like one of the types of doing this quantization so Quant type is equal to 2 that is this type called as nf4 so this is what we are going to use and then we have this BNB 4bit compute D type so we need to pass the D type so the for passing this D type we'll be using this dos so we should uh that's right actually we have imported this so here I can say ts. BF float 16 so we can use this 14bit floating point so I'll run this so now we have to pass this B andb config this bits and bytes configuration for quantization while we load the model so that's uh the purpose for this so once this is done the next step is uh loading the tokenizer and the model so here let's say coding the tokenizer and the llm llm or the model you can call it whatever you want so here I'll create a variable called as tokenizer so I hope you know what is a tokenizer so this tokenizers are used in NLP cases it's just to convert the raw input string to a format that the model needs so all these models right so it can be uh any llms that you're working on so each of this these models would have like a separate llm for that and for this L this Lama 3 we have to use the tokenizer that is used for this Lama 3 so you can go to this tokenizer and we have this Auto tokenizer right so you can give that here so I can copy this and paste it here and within this Auto tokenizer you can say Auto tokenizer do from pre-train so this is a pre-train model right the Lama 31 so say pre-train and you have to give the model name again if you want to use a Lama 2 model you can you just have to find that name so for Lama 2 you can just say llama 2 L to has like a chat variation too so chat so this performs well in uh you know chat scenario so you can try this model also you you just have to change this model name to that one so the other things can stay the same so I'll copy this model name and the next parameter that you have to pass is your HF token okay so here here I'll say token is equal to HF token which is what we have loaded and I'm going to say tokenizer do pad token is equal to tokenizer do EOS token so let's run this okay so you can see the tokenizer say 9.8 MB this has been loaded and it says like special tokens have been added in theab and so on so this is how you can load this tokenizer or like any pre-trained tokenizer you can just give that name and and I mean all the models doesn't require this HF token so Lama 3 models require your access tokens there are some models for which you actually don't need this tokens for that you don't have to include this uh tokens and then we have this tokenizers do pad token right so this is like an interesting thing so pad token is nothing but the padding tokens and EOS token means end of uh you know sequence tokens so it's like when you pass a set of sentences or like let's say batch of input right so all the sentences of the data points that are being processed by the tokenizers should have the same length and what would happen if you don't pass this end of sequence right so the let's say there is like particular sequence is short the input text is short in that cases the we would like add Extra Spaces to it and then we would pass this to the tokenizer but when we add this EOS end of sentence sorry end of sequence token it would add a marker uh to say to the tokenizer that this is the end of the sentence just process until this point so that this is like a better approach to do this so we are setting this pad token to this end of sequence token and the next step is similarly from this you know we have imported this model from this tokenizer from this Auto tokenizer right so similarly we're going to load this llm from this Auto model for caal LM so for that I can say model is equal to auto model for caal LM and within this parenthesis okay I have to save from pre-rain so here we have to pass some parameters the first parameter is the model name so the meta Lama 3 model and then device map so this is a very important thing so here this tells you whether to load the model in your CPU or GPU so here I'll say device map is equal to auto means if there is a GPU it will automatically kind of uh allocate that space in the GPU and it will load the model in this GPU if not you can also say Cuda or you can say that I want to load this in the C CPU right so that will basically load your model in your RAM and again so this is kind of crucial because when you are working with a larger machine so let's say that you are working with a stage maker notebook instance that has like eight uh gpus so you would probably work with like g54 x large machine so that's like a pretty large machine that has like 192 GB of GPU Ram so there again as I said it does like gate GPU so let's say this 80 billion model would be somewhere around 200 maybe 150 to 20000 GB so I think that's the size of the model that it would be and this model right would be loaded in uh you know all these gpus it would kind of be distributed in all the gpus So when you say this device map is equal to Auto it will kind of uh automatically allocate the model to the GPU or or the ram that is available or you can also configure this like custom load this model so that's possible too so here I'll say device map is equal to Auto so that it this will like load the model in my GPU Ram only so I tell this and the next thing is I'm going to pass this quantization config so if you don't do this quantization config it will load in this default precis and again if the model is 15g your GPU Ram will be filled with 15 GB so again we don't want that to happen because of our know lesser GPU size so pass this BNB config which we have created here and then you have to pass your token to so my token is hore token now let's run this so this will download the model so here you will see like the size of the model so the we totally have like four shards for this model and the first Shard is like 4.98 GB so all the these four sharts will be downloaded first so here if you see this right so what is happening is like it's downloading this shards to this disc so you have 78 uh 2 GB of dis and again we are just working with 8 billion model if it's 80 billion the model of course it's going to be more than 150 GB so even your dis will kind of fill so it's not like ideal to work with it in Google collapse of course you need a larger machine so it's as as it's downloading you'll see this increase in in this disk space so once the you know model has downloaded uh what would happen next is this will uh load this model into the GPU so that process will happen next and you will see how this GPU Ram is filling once all this uh you know safe tens as this shards kind of completed loading so the first Shard is 4.98 GB and the next Shard is like 5 GB so similarly we have like two more shards and as you can see it's like downloading in a very good page and if you're in a J notbook in your local J notebook you probably won't be having like this fast of a you know speed so I can I mean this is like pretty high speed right so it's happening in collab so that we are kind of able to load this but again if you're working in your local so that might take a lot of time so the other option is meta should have given this repository access so you can also download the model from there and then you can load it to so even that is possible so you can try that out too so let's wait for this download to complete and we can check how this model is loading in the meantime we can do the other things required so here I'm going to create a pipeline so for that only i' imported this pipeline thing so I'm going to create a text generated pipeline so here I'll say text generator is equal to pipeline which I imported from Transformers library and within this let's say the first parameter is what's the kind of pipeline that you want so there are like different uh pip pipelines that are available in Transformers so again a question answering pipeline or a text classification pipeline but we are interested in text generator so we are kind of using that okay so you can go to Google and search Transformers Pipeline and uh there you can find like the different pipelines available and and you know you can use like different pipelin range I mean of your choice so these are like TX classification it would like give an output so you can expand this and see like so we have audio classification automatic speech recognition so all these pipelines are there but we have focused on Tex generator and in the meantime you can see the shards have downloaded and now it's loading the checkpoint shards so if you open this your GPU will start filling so this this is loading the downloaded model to the GPU Ram so you can see this so as you can see the total downloaded size is like 4 9 + 5 + this is also almost this is like 15 and almost more than 16 gabes uh but once the model is downloaded it's around this space I think this will slightly increase right so uh it has taken like 5.6 GB so this is what we have done with quantization so right and then now we can create this pip plane so text generator and say model is equal to model this is the parameter and pass the model and you can also pass the tokenizer tokenizer is equal to tokenizer so this we have already kind of loaded these things and we have to pass this to the respective parameters of Pipeline and then Max new token so this is like kind of like the length of the output that you are expecting so default I think it would just return 20 uh tokens but again if you want a larger response you can use like a larger number you can also use like 256 or like 52 or2 so that's uh completely fine as well so let's run this so this is our pipeline now we will use this pipeline okay unknown or text generator sorry this should be text generation let's run this again okay so this has created my text generation pipeline now I'm going to create a function called as get response right and say prompt so this is the prompt that the user will be sending and let's say sequence is equal to sequences is equal to fix generator which is the name of the pipeline that we have just now created and within that you have to pass the prompt and then say gen text generator text is equal to sequences of zero generate text so I'm doing this because this is of the response from the model will look like return gen text so when you pass this prompt to the text generator pipeline right which has the llm the Lama 3 Model it would return a dictionary and that dictionary we have to look at which is basically I I would say the sequence sorry not a dictionary but a list so within this list we are taking the first element and finding like the generat text so this is this would be the response maybe I can I can show you how this output looks like later explain you this function so here let's say the prompt that the user is sending is what is machine learning let's say this is the question and the model should answer this and I say text generator and pass this prompt oh sorry I have to run this let's store this in a variable called as response and I'll copy this to so let's see the output so this may take a few seconds and the other thing is like this is the base model so this is not optimized for chat yet so if you ask a question right so it's it's likely that it may not give you a very accurate response we are seeing that what is machine learning machine learning is a branch of artificial intelligence that deals with development of algorithms and so on so here we are seeing right and and sometimes it's possible that you may not get this answer in a really good way because as I said this is not fine tune for chat we also have this other variation called as instruct as I told you so that is like instructors basically fine- tuned to understand instruction better so that's again better if you're working with RG task like let's say you have a document or you have a context and you want your model to answer from that context so in those cases instruct would probably work better than a chat model so again the point that I'm trying to make here is like don't expect like very accurate answers because this is not a chat based model but a text generation model but again for Lama 2 we have a chat based model as we have seen so you can use that if you are application kind of is completely like a chart based one but yeah this output seems to be fine but okay so this is how this would look like so keep all these things in mind so how this output is looking like is I have get a uh list right and this is what I'm doing so in this sequence this response that I've gotten I need to look at the first element the first element is nothing but this and this is a dictionary so from this dictionary I want to get the generat text key alone so when I pass this so take the first element get the value of this key which is generator text this would output this what is machine learning thing okay so that that's the reason that we have created this function so I'll move this prompt down and I'll delete this one now we can basically use this get response and pass this prompt and so on so let's run this prompt and call this function so here maybe I'll create a variable called as Lama response or I can call this as llama 3 response is equal to response and within that pass your prompt okay and now you can print your Lama 3 response this will just give you the output that the model has generated so you can uh just run this or you can also print Lama 3 right so you can can kind of get this so if you don't want this question that the user has printed right so what you can do is p this and say you can just slice it so length of my prompt and so this would remove that question The Prompt that we have sent and this would display this so this is how you can use the Lama 3 Model loaded from hugging phase and you can get a response so you can deploy this is right so that is a package called as gradio so that lets you deploy this as a APA so you just have to pass this function to this gradio you can maybe look into some reference or later I make a video on that too so you can get a API endpoint for that maybe build a stream app and once the user has asked you know some prompt right you can pass that to the Stream app and get this respond response and display it there but this is how you can basically access your llama 3 model and work on it so I hope every is clear until this point and I'll just give you a quick recap of what we have done so that like it's it's really clear for you so the first thing to do is make sure that you signing up to aing phe and accept to the terms and conditions and and once you will receive a mail or you can see this here that your uh you know you have been granted access to the Gated model so once you have this access go to your account as I shown you get the access token put it in a config.js file and also give all this requirements txt with all the proper version numbers and so on and finally you can uh install the libraries and then import the required libraries configure the you know HF account with the tokens mention the model name we have to do this quantization which is like a critical part in this case which is uh reducing the Precision of the model so that we can work with a smaller model SI without losing any performance or the efficiency of the model then you have to load the tokenizer from Auto tokenizer and and load the llm with the model name device map to say that loaded to GPU and so on and finally you can create pipeline text generation Pipeline and get a response out of this pass a prompt and you can get a response from this so I hope this is useful so please try this out so this is like not complex at all so you can you know try this uh with your application let's say you are working on some chat applications or tck generation applications you can definitely try this uh model out and the important reason being this is like a open source model as good as gp4 or Gemini Pro if you're going with the gp4 right so there is always going to be uh you know APA cost that's going to be uh coming along with it so but if you have a larger machine you can host your dma 3 Model here and you can work on it so that's it from my side and I hope that this is useful to you so subscribe to my channel for more videos on this AI ml gen related stuff and I'll see you in the next upload thanks for watching
Info
Channel: Siddhardhan
Views: 9,846
Rating: undefined out of 5
Keywords: Meta LLAMA 3, Hugging Face tutorial, Generative AI, AI language model, LLAMA 3 setup, Hugging Face API, AI programming, Tech tutorials, AI development, Language model integration, Machine learning, Deep learning, AI technology, Programming tutorial, AI model access, AI research, Open source AI, Advanced AI, AI tools, Tech education
Id: J7afRW5XEb4
Channel Id: undefined
Length: 36min 54sec (2214 seconds)
Published: Sun Apr 21 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.