Deploy LLMs (Large Language Models) on AWS SageMaker using DLC

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone welcome to AI anytime channel in today's video we are going to see that how we can deploy a large language model on AWS sagemaker so if you are not familiar with AWS sagemaker it's a machine learning platform that helps you create train and deploy machine learning models okay it was you know launched by AWS I think five or six years ago okay so we are going to leverage sagemakers capabilities you know to deploy this large language model Okay the reason being because you know most of the time you might have been working on some Cloud platform or the hyperscalers so how can you deploy a large language model on these hyperscalers starting with AWS because AWS being one of the most prominent hyperscalers that a lot of clients images said right and eventually we'll also see how we can leverage gcp vertex Ai and Azure machine learning platform to deploy large language models in in the coming videos so I'm trying to hear create a few of videos on the hyperscalers and how these models can be deployed you know for basically app for your clients and also for the Enterprises right if you are building an Enterprise grade solution how you can spin off an instance and deploy these large language models create a Lambda function and invoke through an API endpoint right how we can you know follow this step uh you know end to end and how we can deploy this model so you can integrate this uh in an application right that's what we are going to do here so any model which is available on hugging face okay uh you you will probably able to deploy on sales maker and there are around 14 to 15 base model you know and of course their variants like quantized models and the other variants on top of that base model they have a very unique and a new uh surveys that AWS has basically you know created that's called DLC that basically works with the hugging phase model so you can you know very easily deploy hugging face based model you know on AWS sales maker using that DLC there are basically a lot of images that they have created in ECR you cannot just use those uh basically from AWS and you can just deploy that model in a few minutes that's what we are going to explore in this video guys we'll take a model from hugging face we'll you know we'll spin up an instance on sales maker and then we'll go and deploy and then we can also invoke that through Lambda right so that's what we're going to do here in this video guys so let's let's get started so you see currently I am on AWS console okay uh you can see the console AWS amazon.com okay I am in Ohio maybe you can look at some other version let's keep North Virginia for example okay so if I do that that's how you basically find the versions here you can see the region okay and these are the region that we have okay uh Oregon California and in Asia we have you know most probably you know the data centers that if you use this region your data center will be located into this region itself so that's the first step you can see it's my personal account okay so I am on a personal account currently I'm going to explore then how we can you know use these services so you can see I have some Services over here I am ec2 S3 RDS Lambda right so this is the first you know time I'm creating a video on on any of hyperscaler but I will create some videos on Azure and gcp as well the first thing that we have to do meanwhile we have to look search for salesmaker let me just search for Speed maker and you can see here we have something called Amazon Sage maker build train and deploy machine learning models that's what we are going to do in this video guys let me click on sales maker I'll do a right click and open it and I opened it or maybe I can also do I can just start this when I start this you can find that sales maker so basically what happens when you're using this Cloud right the hyperscalers you will probably use a lot of services regularly for example S3 right S3 bucket you can also start those here and you'll be able to see it right there so you don't have to search every time you can just click that basically works like a bookmark okay now if you come over here it says Amazon sagemaker build train and deploy machine learning models at scale okay and you have some documentation how it works and a lot of other things right so what we have to do here if you come on the left hand side we are interested in Notebook because we spin up our notebook instance and then we'll start writing code and we'll see how we can deploy the model and automatically we'll see that how DLC will help us you know on the inference part we know creating an endpoint right from Sage maker that's what we are going to do here guys so let's do that so what I'm going to do here to click on Notebook instances now when I click on Notebook instances you can see currently if sales there are currently no resources let's create a notebook instance now when I click here you will see a different types of you know notebook instances type okay let's come here and use for example Maybe and use uh compute optimized and I will use mlc52xlr now you can also do one thing you can come here and say mlc5 2x large they will show the Amazon sales maker pricing you know go here on the stage makeup pricing and you can find it out so I am using for example mlc5 2xlr instance it's 8 CPU it has 8 CPUs and it has 16 GB of memory which I feel that you know I will take a small model just for this demo purpose I'm not going to use falcon or mpts or llama for example I will take a smaller model in parameters and I'll just show you the process end to end of course the process Remains the Same you just have to change the back end when it comes to you know trying different models okay so I'm gonna use this machine mlc52x large you know but if you are interested you can you may look at some other machines like you know G5 mainly G4 DN and G5 machines if you really want to uh run Falcons and mpts right you can use these instances maybe this is very famous ml4 g4d and 12x large and also this number of gpus when you want to use 4 gpus there so let's let's use this C5 2x large I'm just going to give a name here I'm just going to call llm experimentation llm experimentation okay and this video will we will help you guys you know if you are basically working at some workplace that basically where they demand and they ask you to deploy models on AWS or any other Cloud uh hyperscaler that we have so it says to the IM role I'll say use existing role or create a new rule for example so let's say any S3 bucket it's fine okay SC bucket you specify so you have created an enroll you can see okay but you can also give some custom names and all okay so let me just do one thing you can see your notebook instance is being created okay so it will take little time to you know uh create that you can see it's pending here in the status and we have this instance and our instance name is llm experimentation now if you come over here this is the model we are going to take guys lamini t5738 again which is which is fine tune okay on T5 large you can see the model is fine-tuned version of T5 okay T5 is very underrated language model I think T5 was created by Google please let me know if I'm wrong they are okay so this model that I'm gonna use because I also have a lot of videos last two videos has been on laminate t5738m the reason being it works fine for a lot of use cases like code generation you can generate when I say coordination it text to code okay instruction based models and it also works fine with question answering you know one of the best model I have seen for question answering and summarization to be honest okay if I talk about the smaller language models okay you can consider using laminate uh uh t5738 million parameter model you can go through this video guys you know I've created this couple of videos which is uh like training these LMS or fine tuning basically not training fine tuning with help of this embeddings that we create using chroma and stuff you know you can basically create a question answering tool that's what we have created here in this and in this one we have created a chat board okay with with streamlit chat so you can consider watching that videos if you are interested and want to create something similar okay now in this notebook instance let me just do maybe you can click on view details to see it will show you okay it's currently in pending notebook instance and you can find a lot of other things here as well you can connect with repositories and all as well okay so let me just go back on the Node backings notebook instances it's pending and this is the model we are going to set so what we will do guys right in this video we will click when you click on deploy on most of the hugging phase spaces or the models repositories you will find a lot of options how to deploy this okay now if you see Amazon sales maker when you click on this you will see the deploy.pi code right which is on salesmaker which uses salesmaker SDK if you see this it's the same thing that we are going to use here guys okay yeah in this video but we'll make some changes not the exact same thing there will be few changes right how we can use this hugging face llm image URI that's what we're gonna do here okay we're gonna explore let's do that okay uh it's taking time and you can again deploy Falcon 7B you know if you want you can see Falcon 7B you can click on deploy I says maker it works the same way your muscular instance will be on little on the higher side okay you have to you take you have to take a better machine because it's a compute heavy model right so for that you have to take a a bigger machine on sagemaker but you can deploy Falcon 7B it works perfectly fine I'm gonna focus on lamin Elms and see how I can you know uh create basically an API endpoint that I can use in my application so if I come over here let me just go to notebook instances you can see now it it's in service so what I will do I will open a Jupiter lab okay and in this Jupiter lab I will start writing you know my code we will try something okay in this little notebook we'll first see if we are able to basically use lamini LM basically see if it's working fine and then we'll go on deploy it okay so let's let's do that so it will take a little time to open up and it's opening and you can see uh in US East one salesmaker AWS slash lab so let's see that so here you can find all the details guys let me just says make up pricing I'll give the link in the description but you can also find it on uh of course on internet so you can see I am on Jupiter lab I prefer you know conda uh Pi torch default notebook which is having python 3.10 3.10 is the best version to work with language model okay so in my last few months of experimentation both individually and at the workplace for clients I have came across that python 3.10 works the best because we have to work the way of you know for with libraries like uh Excel rate bits and bytes save tensors Etc which supports 3.10 I most of the time it works completely fine so I'm going to use uh kunda python 3.10 which has torch installed by default okay and I'm just gonna let me rename it so let's rename this and call it llm trial or something like this okay I'm going to rename it now once I rename it let's do a couple of changes here guys we're gonna go in settings I prefer a dark theme I'll just make a dark theme here it will also increase the font size so let me just increase the font size one more time you can also set it up through you know NB extensions and all but I'm just not going to do that here increase code font size let's first install a few of the things libraries that we need so the first is keep install what we need is Transformers Transformers the first thing that we need einops and we need accelerate and then we need bits and bytes so I hope I spelled it right Transformers AI knobs accelerate deep sand bytes now let's let's basically install these libraries so you can see that I am installing these libraries pip install Transformers ainos accelerate and beat sign but that's what we need right guys uh to work with accelerate will help us inference you know this bigger models and you can see it's installing you can also see it here and you can also see it will say kernel is busy okay the kernel status okay basically if you have work with jupyter notebook it's a jupyter notebook inside you know the sagemaker instance that we are using is the same thing now let's import all the things that we need so the first thing is from Transformers so from Transformers Import Auto tokenizer I'm gonna use Auto tokenizer and someone also going to use Auto model for sequence so Auto model for sequence to sequence LM that's the first thing that I'm going to need maybe you can also take it from here guys let me see if they've given the code base I'm not going to use through pipeline okay so let's let's write our code here okay so the first thing is from Transformers Import Auto tokenizer and auto model for sequence to sequence language model now the next thing is the pipeline so from Transformers we're gonna use the text to text generation pipeline so pipeline from Transformers import pipeline let me torch so import torch and we need base64 for example so let's add this 64. let's now run this invalid syntax Auto tokenizer okay I have forgot to give the comma there now it will import it now let's uh write the next line here I'm gonna do checkpoint here the checkpoint is nothing but my checkpoint remains this sucking face repository from where I will just uh pull the model weights okay so let's do that this is my checkpoint and I'll also add few more lines of sale here the checkpoint has been defined now let's write tokenizer and base model kind of a thing here so tokenizer and then we write Auto tokenizer so this is how it goes Auto tokenizer an auto tokenizer and then in this basically from prepane so we are getting it from pre-trained and from pre-trained and you can see this has been imported now let's run this checkpoint I'm just going to pass the checkpoint here guys so check checkpoint so this tokenizer is done now we are using checkpoint you can see uh from pre-train Outlets have a base model our base model is the lamini model that goes with auto model excuse me Auto model for sequence to sequence and inside this this is what I'm gonna I'm going to do guys like Auto model for sequence to sequence and let's sequence to sequence LM Dot from free train and you can see that from pre-trained by intellisense let's pass this you know uh checkpoint and device map and kind of a stuff checkpoint e so let's keep checkpoint and then we have device map I'll keep device map Auto let's define some of the uh let's it basically offload some of the weights to CPUs so I'll keep it auto okay device map Auto you can also use device map equals Cuda if you have high compute power and it has a Cuda kernel there you can do device map equals to Cuda and you can also do device map CPU if you have you know I can do that CPU because I have eight CPUs there nobody can also do that you can explore it completely right so I'm keeping device map Auto it will automatically offload the weights okay so device map Auto and then I will also have torch the type okay the data type for this dot 10 source so torch D type and my toss D type is nothing but the torch dot float32 I'm just gonna use the floor 32 you can also use floor 16 but here I'm trying to you know save some of the compute power guys okay so checkpoint device map Auto torch D type torch 432 and this will take a few minutes because it has to pull the weights guys let's do that you can see now it's downloading the downloading the pytorch model which is around 2.95 GB let's let's let it download let's write the function guys I'm just gonna use llm pipeline or can I also get it from my githubjectally I don't want to write a lot of code there but you can focusing on deployment currently so let me come back and I will take uh the code from one of my I you know uh one of the videos that I have created where I have used lamini so let's take from App Pi I'm just gonna use this uh Define llm okay pipeline a kind of a function let me do that so I'm just gonna just paste it over here and you can see now what we are doing here let's see that so we have top five local llm and returns so this this looks nice so this is a pipeline that we are using a text to text generation if you want to use summarization as well you can use summarization because you know a lamini LM has a summarization pipeline as well okay so if you want to create an application where you want to generate summary from your documents or from your text basically Text data now you can do that as well so let's do that here I have defined this function llm Pipeline and just what I'm going to do here is I'm going to write an input prompt so let's do that so my input prompt and I'll say okay right uh write an article on generative outlets right on first one artificial intelligence intelligence now this is my prompt okay input prompt write an article on artificial engine it's not a prompt I'm not following any prompt techniques like zero sort or Chain of Thought you know if you want you can do that okay so let's let's uh use this input prompt where I have defined now let's use that guy so I'm just going to do model equals LM pipeline so LM underscore pipeline here guys llm Pipeline and inside this uh I'm just gonna you know my model will now be uh assigned to this model variable from LM Pipeline and now I'll have a generated text variable which will hold the generated text from the model itself and here we'll do model input underscore prompt so let's do that and then we have generated text underscore text so let's run this and see if we are getting any error or if it generates some response and you can see it says hugging phase pipeline is not defined okay hugging phase pipeline is not defined uh okay so okay let's let's uh load this from langkian for example okay so we can load that from line so let me add it from here but for deployment we don't need that you know but because I'm loading it from hugging face pipeline because I think that will help us create chains and also just for this video quickly I'll just install it from line chain so keep install line chain and it will install the line chain and then I will import after this let's let's do let's wait for it so what I'm gonna do here is after this I'll just add and I'll just say you know uh from Langston Dot llms LinkedIn dot LMS import uh hugging phase pipeline okay hugging face pipeline and I think I have typo here that would be caps hugging face pipeline from 19. LMS import hugging phase Pipeline and now this should work let's see so guys the reason why I'm also trying to bring the luncheon element because in the coming videos uh because I'm also working with Azure machine learning and right now you are seeing AWS here maybe we'll also see that how we can basically bring framework with this llms on hyperscalers okay deploying a model EG because they have given the code but how we can use this framework is something that you know that that's the agenda for the future videos guys so you can see that this is the article that the response that we have got and fantastic response okay and artificial intelligence is rapidly growing failed that aims to create machines that can perform tasks blah blah blah so we have got a response so this is what you know wanted to show right in this video guys so now what we will do we'll see that how we can deploy this now for deployment you know the first thing that you have to do have to update the version of sage maker okay that's what we first have to do guys so let's let's do that so what I'm going to do I'm gonna first uninstall this and install sales maker again so tip uninstall I'm going to do y okay like sales maker says maker click uninstall and why sales maker okay so let's first finish all the same makeup because we need a specific version of sales maker because most of the time when I was trying it out I I got errors uh while the deployment because it says that it cannot uh the the issue was with the hugging face image URI okay the llm image URI you can see now we have uninstalled now we'll install sagemaker guys again so let's do that and the specific version of that so pip install sales maker and this goes 2.168. uh not okay this is the version that we need for sales maker for the deployment with this uh DLC and image URI that we have so this is what now we're gonna do so let me just see that on my return one second yeah so what we're gonna do here let's let's take this course snip it and see what are the things that we can change so when you come on stage maker you can click on copy come here paste it over here now you can see when I pasted it out what we are doing we are importing Json we are importing sales maker bo23 and this is what the interesting thing gate hugging phase llm image URI Okay so and then we have execution role model 3 client blah blah blah and let's do that so here we will Define a few things you can see that this checkpoint has already been you know defined because we have taken from that hugging phase repository but we are not going to use any you know number of gpus here guys so I'm just going to remove this no because we are not going to use any GPU okay but maybe we can use that we can have that but it's not required so let's what I'm going to do here is HF task what the task that we are going to use hugging phase task and the task is text to text generation right so text to text generation this is the task that one may use and the next thing that we have is device map we can Define all this config here so device map and here it will be Auto my device map is Auto and the next thing is torch D type so let's do that torch underscore D type and the start underscore D type is nothing but the tors tors dot floor 32 okay this is what I want to do here so this is what the Hub so Hub model configuration so hugging phase model configuration how you're going to load the model you know here you can configure all these things so our model ID task device map cross D die okay now if you come down you will see the hugging phase model get hugging phase with the versions and all this is completely fine we're not going to make any changes now in the predictor we have this G5 so let's change this to G4 DN which has a Cuda kernel just for your information if you want to deploy through this DLC the new feature of salesmaker for hugging face models you need a Cuda kernel okay that can be any Cuda kernel so I'm just going with G4 DN for now but for deployment it needs a Cuda kernel in back end right to basically handle your computation there okay so ml dot G4 dn.2x large okay so I'm not gonna use 2x flash I'm just going to use x slot here so instance type ml dot G4 DNX large okay and you can see this timeout 300 that's that's okay I'm not going to remove that so in input let's remove this and say you know for example uh write and article sort article write a short article on blockchain something like this block chain okay this is my thingy here separated or predict this is the predicted of predict class sending request and this is basically the deploy class you can see hugging face underscore model dot deploy that deployed total instance count so one instance instance type is ml dot G4 DN dot x large and then we have container restart type health check timeout and then we have this models and hub so this is a simple this will basically create configure the end point for you and this will automatically create the end point for you guys model endpoints and that you can find it over here so let me just go back on here in frames and you can see we have two things endpoint configurations and end points but this will basically you know configure the end point and also it will create the end points for you the model endpoint and then you can create a Lambda function to basically use that and test it out so this is what we are going to do guys in in this one so let's do that now so yeah so it's okay now let's run it and see if we are getting any error guys so you can see it's currently running so it's it will run it will take little time of course you know to basically create the infrastructure for you okay you do not have to do anything you know it will automatically be created and your model endpoint will be created of course okay so they will configure it and then automatically create it for you so you just have to Define it over here and the same way you can also do it for Falcon and MPT and any other model as well so let me just see MPT seven MPT 7B by Mosaic ML and mosac ml has been acquired by databricks I hope you aware about that news and you can also use MPT 7B okay now you have a lot of other models like Star coder for example sorry star a star folder and also deploy that so click on Save maker and deploy it over here so you can also do that in that case you can now see you can see a loader there it means that your model is getting deployed okay and the end point will be available very soon but let's wait for it it will take a little time so what I'm going to do here guys I'm going to stop this video and come back once it is done okay so you can see guys right we got the response here now generated text also blockchain is a decentralized digital laser uh that records transactions in a secure and trans transparent manner right and if you see the model has been deployed let's go here on this notebook instance come to these endpoints now when I go on endpoints you will see a model right there okay model endpoints by the way so if you see it says hugging face by torch TCL in France and date and the IDS the Arn the creation time if you click on that it will show you some more details about it either uh Monitor and all this details that we have tied you know URL Etc so we have an end points now okay now the what we will do okay you can see the response you might say hey I only got one line because we have to basically use the other parameters basically to generate a linear response so let's do that in that case so what I'm going to do here is I'm just going to define the payload so let me first Define a prompt and my prompt is for example uh right a sort article write a short article on blockchain The Prompt remains same this is my prompt and now I'm just going to create a hyper parameter payload okay so hyper parameter payload and I'm just going to write it payload and in this payload I just want to do this and the first thing that I'm going to do here is inputs so my inputs are nothing but the prompt inputs prompt okay that's the input then I have parameters this is how we Define it so parameters and these parameters will again be a dictionary so parameters and in this parameters we have first thing called do sample this is a Boolean value by the way so do sample and do sample equals to true not equal that's likely your key value pair so do sample and the next thing is top p 0.7 so here you can also Define temperature and all other parameters that are required guys right for getting the response so temperature for creativeness and Randomness how creative response you want from a language model that's what temperature does okay so temperature in this temperature I will keep it very low it's between 0 and 1 so temperature at 0.3 and yes I'm gonna write here top k so in how many responses to top a it's again on integer so top case 50 and this I'm going to do maximum new token which is very crucial here so lamini supports maximum number of tokens as five twelves that's what I'm going to do here so maximum number two cans maximum new tokens let's keep it 512 and what I'm gonna do here is repetition penalty so let's have some penalty repetition penalty and I will underscore penalty and this is nothing but the 1.03 so this is my parameter guys okay so let me just close this okay so I'll just close this parameters so you can see the payload that we have created so inputs prompt and I'm going to pass parameters into it okay now the payload has been defined now let's send Rich send send request to the end point endpoint and our response so response equals predictor dot predict you see the class that we have used on top and you can see I'm just going to pass my payload here and in that I'm just gonna do print response and now what we are doing here so we are using defining a payload inputs prompt and parameters and then sending it to Bluetooth update and let's wait for it it says expected after parameters ah okay this will be this and now what it will do guys it will connect with your and you can see the response how fast it is guys right and I can also see the in the previous response we have a same query write a short article on blockchain and here we have a write a short article on blockchain but here we are getting a very good response you can see blockchain is a decentralized digital Ledger that records transaction in a secure and transparent manner till here it was you can see right but now once we have the payload Define the parameters now you can see the beautiful respawn that we are getting it is a decent uh relied system that uses cryptography to secure and verify transaction and blocks blockchain technology has a potential to reverse noise various you know Industries blah blah blah now we have got it what next guy this is what you know Assist We have basically reached the first day so now let's define end point here basically the end point that we have will just pull it from here so come here this is your endpoint name so we'll just click on this and we'll just take this end point so let me just copy this end point you know let's come here I'll just paste this is our endpoint okay and then the endpoint name here and I will also bring import Moto 3 so import Moto 3 and now we are going to inference it on the end point okay so import Moto 3 and we're going to use basically invoke dot invoke underscore endpoint which is the house it's a sage maker class runtime dot invoke underscore and point that's what you're gonna use here so import moto3 let's have a runtime this is go to three dot client so go to three dot client sorry client and in this client go to three dot client we're gonna pass runtime Dot salesmaker so runtime dot C is maker that okay our client is salesmaker on time this is our runtime and now next is response so let's do a response and in response is runtime dot invoke invoke endpoint you can see intelligence is able to capture it invoke endpoint I'm just going to pass my endpoint name so endpoint and name and this endpoint name is nothing but the end point that I have you know here you can see end point and then we have content type which will pass it as application Json because we are doing a get request here you know maybe you can also you know you know if you are trying to work with post request and you are trying to send data which are bigger in size maybe we will you can also explore multi-part you know in that case Okay so as a content type so but you have to see that if that supports okay so application slash Json okay this is my content type now in content type and now we have the body so in this body I'm gonna pass Json dot terms all the payloads would be dumps okay now this is what we are doing here guys okay in this endpoint name body Json terms okay let's do that now you can see it's getting the response okay and now what I'm gonna do here I'm just gonna print the response it will give me uh of course it will give me a generated response but it will add metadata and all that I don't need it we will basically you can see the response metadata we have uh we have a body right there okay so we have a body you can see so let's get that body so what I'm going to do here I'm going to have something for prediction in prediction uh let's do Json dot loads and in this Json dot loads will have response and then we have our body this is what we have to retrieve suggestion body and then we'll have read read function so read which is the python function read and then decode utf-8 okay so UTF 8 basically not caps okay utf-8 and that's it so you can see and now let's just do a prediction so I'm just going to do prediction here and again you can see that it's it's a it's a Json response so what I'm going to do here in this case let's do that prediction the first one which will be 0 and the generated text excuse me generated underscore text generator underscore text and excuse me this will be like this and let's try it out guys it says ah excuse me sorry this is this is not how it works okay now you can see that response that we got so now you can see how fast it is because we have a decent enough machine you know and this is really going to help you guys that how we can deploy how you can take a language model or large language model on Sage maker how we can deploy it with DLC Amazon sales maker DLC that is basically built for uh hugging phase models the models which are hugging phase it supports around 10 to 15 base model and of course their variants as well and now you can see that how we are you know basically automatically deploying it and getting an end point and we are inferencing with that endpoint using invoke underscore endpoint okay now this is one thing this is on Sage maker okay how will you get the API endpoint how we can test it in Lambda let's do that guys so what I'm going to do here in this case is now I'm just gonna go on Lambda here so come over here so you can already see this is our end point this is completely fine let's keep it as it is now come here and you can already see something called Lambda so I'm just going to do a right click and open Lambda here Lambda is a serverless compute service okay it's completely serverless you know it helps you trigger events okay so you have multiple events for example you know when we go on e-commerce website we have shopping carts right so when we add something in the shopping cart some objects some Goods that basically that's basically can be called as triggering an event okay uh on you're putting some objects in the shopping cart so Lambda is basically a serverless compute service very very intuitive to you know understand and basically it has some time out limitations and that we'll see how we can handle it okay so you can see let's lets you run code without thinking about server that's edit serverless so let's create a function guys so I'm going to create a function so in this function what I'm going to do here I'm just going to write something like you know generate data okay or generate response so this is our Lambda function name and I'm gonna use Python 3.10 now that's what I'm more comfortable with let me excuse me let me see if I can use an existing role no item okay so let's create a new role with basic Lambda presentation something like this let's see in default let's see that we get any error here so author from scratch function name runtime and we are creating a function there so let's see that you can also okay so now you can see that this is the function overview we have a generate response you can add layer you can add trigger as I said right you can trigger it so this is what you can do here basically now if you see we have a few things over here the first thing that we have is a test some code snippet and we'll Lambda Handler that basically handles that trigger and then we have incremental variable that's not required okay so let's now write some code here guys to do Implement you can see that here we have to write some code the first thing that we're gonna do here we're gonna bring boto three because we have to we need a client python client so go to three and I'm just gonna use an end point name here guys so let me just come back on the sales maker and in the sales maker I'm just going to copy this end point so let's copy this endpoint and in this where we are excuse me here yeah so end point so this end point we are basically it's a global so we are you know defining it outside the function endpoint let's also have runtime local so runtime you know go to a three dot client go to three dot client and I'm just gonna do runtime dot salesmaker here runtime.save maker now we have imported the libraries defined the end point and the runtime we have a Lambda Handler here which has event and context now what I'm going to do here is let's have after this let me just here let me just write uh query params what is going to be our params parameters query param feel nothing but the event and inside this will have a query string parameter so query string query string parameters parameters okay this is what query params is query string parameters as query patterns so the query patterns variable now contains a dictionary of the query parameter with this so query params even query string parameters this looks nice now let's specify the query parameters guys okay so what I want to do here is query equals so the query equals nothing but the query params.get so query params.get and in this git nothing but a query okay this looks nice excuse me and here I'm just going to write the payload so can I just copy the payload from there okay so let me just come back and take this entire payload thingy okay so the payload let me just copy this and come back on the Lambda here okay you can see now we have taken this payload inputs everything looks nice and this will not be prompt now this should be query because we have we are taking a query as an input and query and then we have parameters and blah blah blah and now after this what I'm going to do here is and let's do one thing let me just query input parameters excuse me okay now let's here write uh this looks okay yeah so now what I'm going to write here is the response so let's copy that right from there guys so I just want to come here and copy this entire thing runtime and just paste it over here so this is done response let's also get the prediction so this is prediction and prediction is also done and now the final result so let's have a final result kind of a thing the final result is nothing but the prediction and generated take this is our final result and I'm just gonna use this finally the prediction zero generated text and now what I'm gonna do here is so we have a return you can see return status code 200 this looks nice this will not be hello from Lambda in body so let me just remove this Json excuse me so this will be Jason derms and we're gonna just dump the final result okay so the final result now let me just see it one so we have imported endpoint runtime a Lambda Handler and we are using that in query parameter so this is what the code for Lambda function guys okay now what we have to do we can test it out if you come back here in this uh not in environment variable let me just do what's in configuration we'll look we look at the function URL later create new event you know this is this is okay and let's see on code so it's loading your function now what we have to do we will deploy we have to deploy it guys okay so we will deploy this so now let's test it test this out guys right so if you see this let me first save it automatically saves but you have a test here so we'll create a new event so let's call this you know generate test response so okay generate test response this will be private and where we have to Define okay uh the template right there okay so let's do that so what I'm going to do here is okay let me just first write you know uh so what we will do let's have okay inside this and HTTP method so this HTTP method will nothing but this will be get in our case so let me just do a get and the next is HTTP method is okay and the next is the power maybe we can have path nothing but let's keep this as an example because we're just trying testing it out if this Lambda function is able to generate the response that we are looking at after putting the prompt so path as an example and now here we will have our query string parameter guys okay so query string parameters so this becomes query issuing parameters and in this query string parameters inside this what we're gonna write here is query string parameters the query so let's define the query so our query is nothing but what was the query right uh sort article on block shape okay this is our query okay on blockchain this is the thing you can also click on format Json what it will do to format the Json for you okay now what we can do we can save this and let's see if you can test it out cool we test it out okay and it's its status quo is 200 which is okay now this this looks fine okay now let's go to code and we come back to code and this is fine this is this is okay now we have a Lambda function here you can see runtime let's deploy this you can say successfully updated the function generate response guys okay so we have successfully basically generate the response okay now function generate response okay now what I'm gonna do here is inputs query Json the final result a status code this is okay now click on test it says an error occurred when calling the invite invoke endpoint operation not authorized to perform error type let's let me see that guys why I'm getting this error so I'll figure it out guys the the error was basically due to policies so we have to create a policy you can see now I've created a policies for accessing that Sage maker endpoint and now if you run this you know on execution result let me just run this again test now when you test it says execution is started but after three seconds it gives you an error message right that the task time out after three point not one seconds it means if you go on the code let me just go to code this code this Lambda function basically has a timeout issues because when you go on configuration you can see in general configuration we have time out of zero minute three seconds so let's edit this can we call it five minutes and call this 0 and see if this works and now you can see our timeout has been changed from three second to five minute now it can access that end point and retrieve the response for you okay so what I'm gonna do here in code again come on code and on Lambda function let me just deploy once okay now once you deploy test now what are you doing here is you can see now we are getting a response and you can see the body response blockchain is a decentralized digital Ledger that records transactions in a secure and transparent manner it is a decentralized system and you can just read out right complete response that we got we through this Lambda function so now we have created a Lambda function you can see the function overview generate responses generate response and we have a configuration here okay now if you see this function anywhere let's see if what what happens with the function URL guys okay so it says no function URL is configured because you know if you go and create an API endpoint you know uh that might API endpoint also has an issue so let me just do one thing what I'm talking about let me search your API endpoint or API Gateway sorry API Gateway when I click on API Gateway if you come over here let me just see so you have an HTTP API for example let's create on build for now let's see what we get okay API name so what API name you want to give so for example let's call it generate response click on next next default is okay something like this or create now when you create this API Gateway the problem with this is okay that it kind of times out after 30 seconds okay that's a problem okay so we're not going to of currently or maybe we can create that but not in this video now but my focus is if you come here over uh come here on function URL you have something called create function URL okay now let's see what we get on creating function URL so let's call it none for example and see additional settings buffered is okay and what if I do configure star astrix would be there this looks nice now let's see if we can create this function URL Lambda won't perform IM authentication that is okay let's see what we get now you can see it says you function URL is public but the good thing about function URL is that it has I think it has a timeout limit more than 30 seconds okay I think API Gateway has a hard limit of 30 seconds I think it's by 29 seconds by default to get a response and most of the time when you are working with a large language model you might not get that respond in 29 second when you are using 30b or 40b models it might take more time okay even on a consumer GPU or single GPU for example okay in this case so can we look at function URL and see if that works so let me just do one thing okay let's do a opening link and currently it says in internal server error and so and let's see that so I'm just gonna call generate or what I'm gonna do here are we gonna pass here the query or let's see okay so query and write a blog write an article on blockchain I don't know why this I got server error but let me just come over here continues have been saved function URL okay go forward or maybe you can check that out guys later okay so the whole idea was to deploy a model and you can see we got a response okay blockchain is a decentralized digital layer it's so fast also the function it works same way that in Azure we have edute functions okay and we have an AWS this function URL right so you know you can use this function URL in your streamlit application or any other application you know any other you know you can create an UI and use that right from backend okay so this is you can see right how we are getting these responses from the function URL just pass query and let's try some other query as well okay for example uh write an article on AI something like this okay so now when I hit this okay it should basically fetch the response for me and you can see how fast it is AI is a rapidly growing field that involves and something something something so we got our response through function URL maybe you can of course you can create API Gateway configure it you know invoke the Lambda through an API Gateway you can of course do that no I'm not doing that in case okay uh there's two steps so I like function URL in that case guys okay so now use now we know that how we can take a language model from hugging face you know and you can deploy it through DLC you know using this hugging phase llm image URI you can deploy it you will get an endpoint automatically your endpoint will be created over here you can see this is the endpoint that we have created now once we have an end point we go on and create a Lambda function here a generate response we and when you're creating git make sure you have all the policies in line okay we have all the policies given yeah the access and then we create a function URL out of it and your function URL you can see and you can again add authentication here if you want but for now I haven't done this any authentication here in this so this is what you know uh I wanted to to do in this video guys that how you can you know deploy large language model on AWS okay uh and this is the completely on a Model part okay maybe in the next video we can see that how we can create a simple application using this function URL the function that we have created and uh uh now you can see right how how easy it is to use the sage maker or the DLC there basically to deploy hugging face models now you can deploy models by your own please go ahead and deploy a few models and let me know your findings and insights you know of these Services okay this is what I wanted to do for you so you can take you can create an account on AWS the entire exercise will not cost you more than two or three dollars guys to be honest okay just go ahead and try it out and see if you can deploy it because at your workplace you have to mainly you have to work with some hyperscalers right that can be AWS gcp or Azure itself Okay so I hope you like the video okay uh if you have any thoughts or feedback please drop that in the comment box and you can reach out to me through my social channels and also creating a community and on WhatsApp and Discord please reach out to uh that Community as well and you can join that Community uh you'll find the WhatsApp Link in the YouTube Banner and I hope you like this okay most of the code will be available on the GitHub repositories okay that's all you know for this video guys okay thank you so much for watching see you in the next one

Info

Channel: AI Anytime

Views: 7,791

Rating: undefined out of 5

Keywords: Deploy LLMs (Large Language Models) on AWS SageMaker using DLC, generativeai, chatgpt, deploy on aws, aws, sagemaker, lambda, aws lambda, function url, ai, deploy ai models

Id: A9Pu4xg-Nas

Channel Id: undefined

Length: 57min 6sec (3426 seconds)

Published: Sat Jul 01 2023