Deploy Llama 2 on AWS SageMaker using DLC (Deep Learning Containers)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to AI anytime Channel in this video we are going to deploy a large language model llama2 on AWS salesmaker yes so we are going to take the model which is available on hugging face and will use AWS salesmaker to deploy this okay once it is deployed we'll also create a Lambda function and probably your API Gateway or directly a function call from that Lambda function in one of our application so it's going to it's going to be uh a couple of videos so it's going to be a playlist and so it will be a series of videos and we'll start with this deployment and then we'll build some cool application okay uh where we will leverage this deployed model okay and I'm going to use something called DLC you can see it on screen it's called Deep learning container it's in partnership with AWS and hugging phase and where they have this DLC setup for you so you don't have to you know put a lot of efforts when you are deploying this Transformers based model probably or few of the other models which which are available on uh having phase you can use deep learning containers DLC uh it's available in the sales maker hugging phase class I will show you when I'm writing the code and you can just use the DLC frame basically this deep learning containers to you know utilize the pre-installed uh deep learning Frameworks and libraries such as Transformers and data sets few lines of code select an ec2 machine deploy it and use it wherever you want to use it that's what we are going to do so remember we are going to use DLC this is not the only way of deploying a hugging fish model on AWS you can deploy it manually download all the model weights files put it in an S3 bucket configure it manually write model function and deploy it using hugging phase salesmaker class glass but you can do that manually as well you can configure it you can do it through DLC there are many ways of deploying these models to be honest but I find this is the easiest one to deploy let's see what we're gonna do here guys so you can see currently I'm on my AWS console this is not an AWS video by the way but I'll try to cover a few things so you can see I am on the console and this is recently visited so I'll go inside salesmaker so let's go into uh Amazon sales maker which is one of the like AWS Flagship offering you know when it comes to uh Ai and machine learning right and you can see that I'm currently on this salesmaker part and you can see some domains where you can set your domains as well for example and I'm more interested in Notebook I'll go to notebook instances there are a lot of other things guys here they have a studio studio lab they have jumpstart where they have foundational models now you have all this lfm large foundational models okay all of this financial models which has you know which sales maker has for example Falcon 40b mpts Etc and then they have studio and studio lab where you can also use their one of the recent offering called AWS Bedrock Services they have Titan models you know they want to compete with gcp and Microsoft Azure because they think that they are left behind you know when it comes to generative AI and I live up to you guys you know AWS is powerful but you know they they think that they have left behind and they are trying to catch up you know with all of this uh recent development in the AWS ecosystem and I appreciate that so you can see we have something called notebook and they have notebook instances they have inference will go to inference later once we deploy it using DLC let me go inside notebook instances so I'm gonna go inside this notebook instances and then I will you know create a notebook instance here now if you are deploying a large language model through DLC deep learning container you can just take the smallest of the machine you know just to create a notebook because for deployment anywhere going to pass a machine that supports that model at least for deployment standpoint so now here I'm just going to call it llama to deploy you know something like this okay llama to deployment or something and it's fine you can see I have my IM role given if you are coming and using it for the first time on salesmaker you have to assign all the you know uh like if you have roles if you want to assign roles you assign roles if you uh and you have to attach inline policies for example S3 full access sales Maker full access or you can also choose what you really want to you know attach in the policies depending on how you want to configure it so make sure that you have an IM role and the policies are attached to that role okay anyway now let's click on Notebook instance when I click on Notebook instance you can see it says success your notebook instance is being created it will take little time so let it let it load so right so as I said we're going to deploy llama2 which is uh you know which is like a new version of the Llama which is by meta AI a commercially available open source large language model okay which has been trained on around you know two trillion tokens and you know supports context context length Windows of up to 4K tokens and it's a great it's a breakthrough you know when llama was a breakthrough and llama 2 has really been helping you know the researchers the academy are the Enterprises to build Solutions on open source last language model and So Meta has basically fine-tuned llama 2 on reinforcement uh learning from Human feedback on over 1 million uh human annotations so I'm not going to cover anything with llama do there are multiple videos already available so it will take little time so what I'm going to do here I'm going to go inside my gist on GitHub I'll take few things from there I'll just go inside your gist I'm going to write a lot of codes and let me also go inside hugging face so let me just go inside hugging face here guys okay and I'll just go on this models Etc and you can see I'm already here I'm going to go inside settings we need access tokens here so I'm going to click on access tokens and you know we're gonna use this for uh probably to because these are gated models and for gated models you need a API key to authenticate with hugging face Hub and that's why we need an access tokens I'm gonna use couple of you know uh uh just I'm not going to write a code because these are very redundant you can get it in the sales maker documentation as well okay uh fine so let's do that okay and the first thing that we have to do we need sales maker on this a version at least greater than or this 2.175.0 okay now let's come back here on the notebook it's still taking a little time it will take few seconds it's okay so the DLC that we are going to use is guys powered by TCI again and I hope you aware of TGI text generation in France it is sclerable optimized solution for deploying a large language model if you want to deploy go ahead and consider it easy yeah I for your workplace okay and uh for this we need a virtual machine uh an easy to excuse me uh we need a number scene of uh let me go to sales makeup pricing by the way I'm gonna come on salesmaker pricing uh Amazon says maker pricing I'll show you which machine we need at least the minimum uh specs to deploy this 7B model we're going to deploy a llama 27b here and if you see let me come down and excuse me we need G5 we need G5 okay here we go guys we need at least G5 to x large okay uh to so we need at least MLG 5 2x large you know to deploy a 7B model on sales maker you know and you can of course it is available on my here as well I'll just show you so let me just so the hardware requirement you can see G5 2x large which has one number of GPU and 7B you can deploy it if you want to deploy 13B you consider G5 12x large which has four gpus and and then you have 70b you also have there are two ways you can deploy if you want to use bits and bytes you know just to like Deploy on a little affordable machine you can go at 48x large but if you just want to go raw you can take p4d 24x flash which will be really costly for you you know uh so anyway uh let me just do a refresh I really don't know why they are not able to start this but it's taking a little time okay now I'm gonna start my Jupiter lab here once you click on open Jupiter lab it will open our Jupiter lab inside it and it takes mistake of take a moment okay I'll go back we're gonna use this sales role dot pi and I can give this gist to you and anyway the code will be available on the GitHub repository so you can just take it from there and use it for your deployment I do have couple of videos on deployment itself that how you can take you deploy a large language model through DLC deep learning container so I have deployed monocyte lamini and couple of others models here you can find that in the AI deploy deploy AI model series okay on this channel okay anyway let it open you need an AWS account to basically use this guys okay so well let's let's do that okay so this is a requirement let me see if it's still not able to load it it's loading it okay uh so let me pause the video here guys you know I'll resume once it you know gets com like no it starts the machine is really on so I'll just pause the video here okay so you can see that uh now I'm inside this Jupiter lab so we can also use Jupiter notebook if you want to use you know to deploy it so I am inside this Jupiter lab it's called Lamar to deployment ip1b and I'm gonna do let me just go here on the boilerplate copy this again and first I'm gonna upgrade that and let's do that so it's a peep install salesmaker something and it will do it I also gonna use this that will print your air in low Etc so let's come here and what we are doing is you know it use boto 3 you know the client you know for python and it says im.getrow looks at your role and all and this is what it does okay now let's do this this is marker go to three and it will print your uh a roll Arn and session region okay now after that what we need is when you have to do DS so you have to use this dlc.pi that that's that you just get underscore hugging face underscore llm underscore image URI and you're gonna use this with the right version and that will retrieve the last language model image you know from that DLC the Deep learning containers that we have okay and we'll come here you can see it has printed out okay and now let's run this and it says llm image URI you know something something something and you can see this by torch TGI inference as I said text generation inference so now we're using pytorch TGI inference here and uh the next that we need is this okay now here is what we have to make few changes so let me just do that uh and I'll show you what are the change changes that I am making so let me just copy come here and make it paste and let me just do this here okay now what we are doing here guys okay you can see we have import Json and sales recorded fogging face input hugging face models we are using hugging face model class and we're going to use instance type here and I'm gonna use G5 as I said mlg5 Dot and 2x large this is the uh instance the ec2 instance that we have ml dot g5.2x large it has one GPU so number of GP you have to give one okay and then it has some help check timeout that looks for the timeout for that container you know the health check timeout and then we have health check sorry some configuration for the model and endpoint configuration parameter so you can see it uses Lama 270b chat HF and you know of course we are not using 70b so I'm gonna replace this with 7B so 7B chat HF Json dot terms number of GPU is one then we have maximum input length maximum length of input text which is 2 naught for it then we have maximum length of generation including input text so both input and your response and you can see it suits up to 4096 and you can also have limits of parallel you know processing you know when we are doing this generating the responses which goes to 8192 okay now it says hugging phase Hub icon and in that case we have to use the hugging face I have icon so now let's come here and let's use one of them uh let me just click on this copied and I'll just replace this with this so let's replace this and you can see I have replaced this with this now let's put this inside Json dot dumps as well so I'm gonna just put that in Json dot Doms so let's do that and here also it says check if token is set and I'll just remove this beats and bytes because this this might this is okay to keep a 70b if you want to go ahead with uh 48x large machine in that case and here you can see it says check if token is set so let's also put our tokens here you know it's for assertion so you can see it says assert config so let's just put that and now it looks okay to me and it says not equals by the way so I said check if token is set so assert config hugging phase of token not equal this please set your hugging face sub token so probably we'll get an error to basically set that now let's run this and you can see it's okay and now we'll go back to our GitHub gist so let me just go ahead GitHub just and here we have excuse me I just not this we still need that deployment class the module not class the module on deployment module from that class the hugging phase a model that you see now we have a hugging phase model with the email URI in the above strip where we have a variable called llm Model that uses that class and has assigned some role which is my jnai role and then image URI so the image URI is nothing but the llm image you know that we have here you can see llm image and then we give the EnV as your config so this is your config the model name the right set of gpus input all the parameters that you have to pass it over there now you know it can say it can take time uh you know to deploy and let's see let's deploy it and see if we are able to deploy this so you can see it's it has started and of course it will take time or it will probably throw you an error you might get an error you know to increase your quota okay and in that case you have to reach out to AWS support you have to create a case you know basically a ticket on the support center on the console and then you have to create a scene you have to tell them that okay what huge case that you are currently working on you can see my deployment has been started you can see this here in the progress bar or the loaders kind of a thing now if you are facing any problems reach out to them you also have to interact and you have to give it your huge case because they are giving it to people who they think that okay you are working on the right set of huge cases for llama to deployment or any compute heavy instances if you are using so you have to give your use case you have to interact with them and you have to follow up you know a couple of times to get the quota increased and of course if you're part of an Enterprise would probably have this already but I'm talking about individuals or the people who are in the Academia or university Etc right so for your hobby projects you need that quota to be increased okay and it will take time so I will pause this video here guys and I will come back once it is completed okay fantastic uh you can see that the model has been deployed guys here and once you see successful it means that you have declared the model let's see what uh what we can do here so basically this this is a conversational chat model guys okay so it's it can have chat through some prompt structure and the prompt structure for this starts like this you know when we see it has instructions and then it has the closing uh so basically it's everything is caps so system and then you write your system prompt here so system prompt and then you basically and of course it's I'm just showing you it's underscore system prompt and then you end again with this and something like this okay excuse me this is how it it looks like and then you have user messages Etc into it so we have to you know convert basically using a function so we have to build the prompt so I'm going to copy this and this is we have taken from salesmen blog so let's take it from here and we'll just paste it okay now what we are doing here it says build llama to prompt and we you can see the start prompt and how the end prompt so it will start then it will end and there are a lot of things that you can use it here you can use negative prompting different type of flags if you are getting so most of the people were saying that when they use llama to excuse me I'm getting the server uh displacement we just dismiss it most of the people say that you know they face this problem of you know generating duplicated tokens or words okay repeating words okay and then you can also take care of that you have to basically write couple of flags there that you can use it now you can just run it and they have written this build gamma 2 prompt function that will probably you know take a uh the way you want to you know interact with that large angle model gamma 2 in this case okay so we have built this prompt and what we are returning is start from plus join the conversation plus end prompt and if you see the messages system content also let me just write something so for example if I've given instruction okay so instruction and the instruction is for example give me some ideas you know uh what to do what to do when I am free you know something like it okay what to do when I am free let me just put a question mark this this one of your instruction then what you can do you can do messages because we have defined messages on top if you can see messages dot append so let me just do messages dot append you can see it excuse me it takes so it's let me just close this messages dot append and it can key value pair so let me just keep the key value pair messages dot append so it becomes role so one is role you can see it over here in the above sale and after low user and user and then you have context so let me just give con sorry not context content so let me just give content and content is your instruction so content become your instructions so let me just write this instruction and then you have prompt so let me just do prompt and you can use this function that we have defined on top build underscore Lama underscore llama underscore two prompt okay build armor to prompt so let's use that I'm gonna pass message so let's pass the messages and then just gonna use in a chat man so let's have a chat variable I'm gonna use llm dot predict you're gonna use this pretty so the end point would have been created I will show you the endpoint uh in the salesmaker dashboard so llm dot predict and we're gonna pass it's a key value pair so we're gonna pass your inputs and inputs are nothing but your prompt that you have on top prompt that you have on top so let's close this now and just gonna print the chat and so let's just print chat that's what I'm gonna print here and the format is generated underscore text in an underscore text and you can also get the length of that prompt so Len and the function so pass prompt excuse me and yeah that's it so now let's close this with this so print and now let's run it and see if you are able to get our response okay so you can see it's generated it says oh wow I am so glad you asked now you have to again uh run inference with different parameters because to impact the generation because it's it might not be what you are looking for so the parameters can play a huge role we have temperature we have penalties we have sampling we have you know uh then we have our temperature Etc as we discussed so let's now let's create a payload you know as in most of your AWS salesmaker GitHub repositories or blog you would have seen it that we create a dictionary here and inside this we write all of the payloads so let's now write the payload here so the first thing is your inputs so let's just write Define that inputs and your inputs are nothing but becomes a prompt so let's just write that prompt so I'm okay with it now you have prompt now what else we have so now here our goals all of our parameters so in that parameters first we Define the parameters here so let's define a parameters and in this parameters again it's a dictionary so let's me put it over here okay in this parameters and the first thing is do sample so you can find all of these parameters on TGI uh Swagger specification I can give that link in description it's available there it's also available on the GitHub repository by the way it's a true or false so let's just make true for do sample and then we can use top p so pretty much if you would have seen work with many other lamps you would know all this uh inference parameter okay not hyper parameter okay so top t as 0.6 and then temperature for Randomness creativity so let's have temperature and I'm gonna write let's write let's have a higher temperature to uh so model can become a little creative on that and if you want to if you want your llm to be to be more deterministic then you should keep this model to 0 0.1 0.2 or something like that okay now we have temperature I hope I am spilling it correctly temperature then we have top k and it goes let's keep a value 50 and then we have Max new tokens so and that's this plays some huge role here guys you know the max new tokens so I'm gonna say Okay Max new token should be let's give 512 for now okay the maximum tokens and then we have penalty so let's call it repetition penalty okay let's set this repetition penalty on the tokens that you are generating so repetition t i t i o n I hope it's right repetition penalty and this is uh let's set something 1.03 the standard you'll find that guys so stop and so how do we stop this so we stop we already have seen above that how do we stop so we stop with uh okay so stop and I think I think we are okay with this now so let me see so payload and then we have inputs which is your prompt so inputs prompt parameters to which is which is which I guess fine let me just give a you can see okay you know your parameters do sample let me just give a space true and let me just give a space here to uh the intellisense that is working uh Max new tokens and then we have top K everything looks nice here now let's let's run this I'm just going to run this here payload and now again when I use so now let's send this payload to the end point that we have you know created or the model has been deployed I'll show you how an endpoint looks like and response equals LM dot predict the predict function you know uh from that salesmaker here llm.predict payload let's run this you can see it says response llm dot predict payload and now I'm gonna just print the response so let's print it so print response and of course at the same way that we did let's have generated underscore text in underscore text and then also pass uh length see I did a mistake generated text this looks nice and then we have length of that prompt so length of that prompt colon end and the function ends let's print it fantastic okay and you can see it it says oh wow I am so glad so if you see that this here it says oh wow I'm so glad you asked now here if you see it says oh wow I'm so glad you asked there are so many fun things to do on vacation depending on your interest and the location you are in so if you are looking for something adventurous have you considered trying water sports like snorkeling and cacking it's a great way to explore the underwater world and you know fantastic right so we got a good response you know guys from uh this diploma now let me go back to my sales maker here so I'm going to open sales maker again here so let's open that notebook instances so I'm opening my sales maker I'm just gonna open it it probably uh will take few moment and we have deployed a large language model we have deployed llama to successfully on AWS and now we are going to use that deployed model on few of our applications I'm going to work on with embeddings and we'll see how we can create some cool apps with this okay now if you come here on this you can see your notebook instance Lama 2 deployment which says in service and I have taken an mlt3 medium but when we deployed we deployed on a G5 2x large okay and now if you come here on the inference and you can see the end points now click on endpoints you will see endpoints couple of endpoints failed and the end point that has succeeded okay so you can see this endpoint has succeeded which is the couple of endpoints will have failed because due to reason of my hugging phase tokens okay now now if you see here you can see this has been succeeded so if let me just click on this uh you can see pytorch TGI okay python sticks generation inference it's in service real time model container logs you can find all of your model container logs you have your CPU utilization memory utilization disk utilization etcetera you can completely you know track it out here with modern latency and all now so this is fantastic right we have deployed a model here guys okay uh lava 2 model okay and the next thing here is that we're gonna create a Lambda function of this and we'll also use the function from there we'll set up increase the time uh of that timeout from the Lambda function and we'll then use this in an application okay but this is this is fine right you know you can you have deployed a model without writing a lot of codes manually you know using downloading the model where it's creating this inference dot by putting it in a S3 and then shipping it all together and all right so it's like a little little tedious process but now here you have with DLC this is what I wanted to do guys you know uh in this one now I will create a Lambda function and we'll use it and we'll use it in few of our applications as well so that's all for you know uh in for this video guys I hope you like the uh content you got some basic understanding how you can use DLC deep learning containers to deploy a large language model easily and in the coming videos we'll re gonna utilize this model in few of our applications and uh uh please like the video if you like the content if you wonder if you got some knowledge please like it share the video with your friends and uh to your peers share the channel as well and if you haven't subscribed the channel yet please do subscribe the channel so that's all thank you so much for watching see you in the next one guys
Info
Channel: AI Anytime
Views: 8,437
Rating: undefined out of 5
Keywords: llama 2, doctorgpt, llama2, siraj raval, langchain, aws, dlc, deployment, deploy llama 2, huggingface, meta ai, llm, generative ai
Id: rQq1m2aJ_fk
Channel Id: undefined
Length: 31min 11sec (1871 seconds)
Published: Thu Aug 17 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.