Serve a Custom LLM for Over 100 Customers

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

how can I serve a custom language model for many customers that's a very common question I get from those who have fine-tuned their models so today I'm going to show you how to do this most cost effectively I'll walk you through all the steps from choosing a GPU to choosing software to set up the API to choosing software to then allow you inference that API and manage the questions you pose and the responses you get so let's talk about serving a custom LML llm for 100 plus customers and here's what we have for agenda there are three key things that you need first you need to choose a GPU server whether that's your own or when you rent then you need to pick the software to run an API the two I'll talk particularly about are text generation inference from hugging face and VM which is an independent project last of all you need some software that's going to organize your messages and send them to the API then deal with the responses that come back and optionally you might be interested in being able to handle function calls I'm going to take you through all of that in an endtoend example and then we'll have the usual final notes and resources before I do the end to-end example let me walk you through some of the high level choices you can make for a GPU and also for software to do this I'm going to walk through the advanced inference repo you can purchase access to this on tr.com but hopefully I'll go through in enough detail that you'll be able to grasp the main ideas even if you don't buy it the advanced inference repo has just been reorganized and there are two main folders there's a folder called server and API setup which helps you choose a server and choose some software to run an API and then there's a folder called API calls allowing you to make calls to the API including concurrent calls speed tests and calls including functions so let's take a look inside the server and API setup particularly at the readme file I want to talk you through briefly choosing a server and then I want to talk to through choosing API software that software that lets you serve an API that many users or perhaps just yourself can make requests two at once so let's start off with choosing a server at a really high level you're going to consider a few different options the first you might consider is open AI particularly if you're looking for something cost effective you're probably considering the GPT 3.5 turbo model and probably for the vast majority of cases that's going to be the most cost effective option because you only pay for each token that you use in inference if you move to running your own server you're going to be moving to paying for it on an hourly basis or you need to own the hardware and so it's much harder to just pay on a small incremental basis furthermore because open aai have so many users they're able to batch together many many requests and amortise the costs across all of those requests and it's just really hard to be beat that in terms of the cost per token that they're able to achieve particularly on the GPT 3.5 turbo however if you need some privacy and you don't want to send data to openai or if you want to serve a custom model which after all is what this video is all about then you're going to need to either to either have your own Hardware or to rent some Hardware so there are two options I'll go through here for renting Hardware runp pod and V AI I'll also talk briefly about AWS and Azure first let me just mention that you could also run on your own computer hardware perhaps you have bought a GPU or perhaps you have a Mac with an M1 or M2 Chip now these are good for delivering good performance but you will need to leave them online and people will need to be able to access them because your computer would be would be serving would be well your computer would be serving as a server and that's maybe not something you want because you might be using the computer for something else or you might have you might not want the hassle of leaving it online or it might be difficult to reliably leave your own computer online so generally while it's good for testing purposes and maybe even good for training probably you don't want to use your own computer if you're going to be serving customers that brings me to the option of renting gpus and I'll go through a few here in reverse order first off and actually I first looked at AWS and I've also looked at Azure I think for smaller companies in startups it can be very hard to get access to the gpus on their services and I've also found them quite expensive per hour compared to some of the more market rate services so that's why in this video I'm going to focus on two runpod and vast. in a lot of my videos so far I've been using either Google collab which is good but more tricky to set up for apis and I've also covered a lot the use of runp pod so today I'm going to mostly focus on vast AI but let me give you some of the key differences I found between runp pod and vast AI for runp pod I think it's the easiest setup they provide you with a proxy URL that you can immediately make calls to its loss price GPU is roughly around 30 to4 cents per hour although it does depend on the week and just as a benchmark an 100 which is an 80 gigabyte or an 80 gab of vram a100 is about €2 per hour or $2 it supports oneclick templates and you can check out runp podor setup which is the file right here if you want more detail the other option is vast AI that I've recently taken a look to it's got a little bit more setup because you need to SSH securely connect in to the Pod and to do this requires you to set up a key pair uh there's some instructions in this repo for getting that done and it's not too bad if you have some clear instructions now it's largely got the same price for gpus like the a100 or the a6000 or the h100 so very similarly priced to runp pod about $2 per hour however it's got some smaller gpus that are priced at a lower cost and if you want a minimum viable API I think this is probably going to get you the lowest cost of serving I'm going to show you some gpus you can get for as low as about 10 cents per hour which works out to about $100 per month and this is the cheapest way I've come up with if you want to rent a server that's able to serve maybe 50 100 or maybe even more depending on where the requests are simultaneous in terms of number of customers this also supports oneclick templates which I'll be sharing and there's more detail that we'll go through in the vast AI setup. MD file right here so in short I would say if I'm using a larger GPU and I want to serve larger models um probably I would go for runp pod because it's similar price to vast AI but the interface is probably a bit more polished fast I is maybe a little less Polished in terms of interacting with the setup but it does offer these smaller gpus that I think are good if you want to have something at minimal price just to get started before you decide to spend more on larger servers that covers the choice of the server and I will talk a little bit more about the specific server I.E a100 or a6000 or A4000 a bit later so once you have a server up and running you're going to want to use some software so that it can run an API and the two types of software I've looked at are text generation inference and VM now in principle and if you're using your own Mac you could run Lama CPP or you could run XLS which is some software I've just been looking at recently and these work reasonably well for inference although they're less well set up for serving large volumes of requests so if you're doing one request particularly a shorter one it'll work well but if you start to parallelize a lot of longer requests you need some more Advanced Techniques like flash attention and Flash decoding to get the best performance so let me focus on text generation inference and VM text generation inference is I think and I'm not fully confident but I think it's probably the fastest for very long contexts like over 8,000 input tokens and the reason is it supports Flash decoding and Flash attention version two these make more efficient use of the GPU when you have longer context text generation inference it's built by hugging face and it leverages some of their libraries for quantization H notably it allows you to use bits and bytes for 4bit quantization and also EQ which is 8 bit quantization this will come up relevant a little bit later it's useful because if you have a model that say is 16 gbes but you want to reduce it in size to fit in the vram of the GPU you could cut the size in half by doing 8bit quantization with EQ or you could cut it in roughly four by using bits and bytes nf4 and the quality of those quantization is quite good also they can be done on the fly so you still specify the main model and the library will look after loading it in quantized form which is very nice especially if you've just fine-tuned a model it means you can take the fine-tune model and don't need to actively run a quantization script now moving to VM one of the benefits is they offer an open AI style API so if you've built a program already to work with the open AI API um we're literally going to use and import that open AI API and we're going to use it for inference but on our custom endpoint so that's a nice compatibility benefit it offers in my experience generally better support for awq which is activation aware quantization um compared to the text generation inference although text generation inference has got some support for awq now awq is better quality than gptq although it can be less well supported by these libraries awq probably is faster as well than the on the-fly types of quantization that are supported by text generation inference um so if you want to use awq it's probably going to be faster than doing EQ or bits and bytes which are on the Fly quantization however you need to have a pre- quantized model so for example you might need a model from the bloke on hugging face where he is quantized a ton of models and you have to load that quantized model now if you load a quantized model there's the benefit that it's smaller so it's also faster to download and it's faster to load into the server so that's a benefit too but if you're coming from a fine-tuning perspective it means if you've made a custom model you will additionally have to run a quantization script that's something that's covered in the advanced fine tuning repo you can check out on tr.com now whether you're using TGI or V LM one of the quickest ways to get started is to use a oneclick template oneclick templates will typically use a Docker image in order to do all of the installation automatically you could alternatively SSH ssh into the instance and then do all of the manual installation of Cuda which is the software for running the GPU but it's generally going to be much quicker to get a pre-built Docker image just run that image that will pre-install everything and then set the parameters around the models that you want to use so I'll be showing you a few ready too Docker templates that you can use with TGI and VM they are different Docker templates there's one for TGI and there's one for VM that brings me on to some tips on GPU selection for which I need to describe a few of the key parameters of gpus and we'll repeat this later on a GPU uh typically has well it has many important figures of Merit perhaps the most important are the memory size or the vram the video RAM and the computational speed which is typically flops or t flops for Tera flops a flop is a floating Point operation per second another parameter is the memory speed maybe I haven't called that exactly right but what's important is the speed with which the GPU is able to read from its vram which is the gpu's memory into the deep computational units of the GPU so the way that a model uh works is you first have it on your hard drive you load from the hard drive into the vram of the GPU that's the main memory of the GPU and now anytime calculations need to be done in the GPU some information like a small amount maybe 30 megabytes or 50 megabytes of that let's say 16 gabt of vram only 30 to 50 megabytes of that will be read into the deep computation unit and that's where the computation happens and it turns out that's a bottleneck a lot of the time for the inference speed so this is an important uh figure of Merit as well although it's typically less prominently displayed when you look at dashboards like um the runp Pod or the vast AI dashboard now of course higher values are better higher memory lets you fit in a bigger model higher computation speed allows you to get higher tokens per second and higher memory will also allow you to get higher tokens per second in fact you're either going to be bottlenecked by the computation speed so if you ask the model to process more and more tokens um it's going to eventually be limited by the computational speed but it could alternatively be bottlenecked by the memory speed so the speed to read into the deep computational unit and this uh can be particularly the case if you have to read a lot of model weights which is true for larger models okay so let's move on to the first step for picking a GPU which is based on the vram you need um the vram of the GPU to be bigger than the size of your model otherwise you're not going you're not going to be able to load it into the GPU so for example lamama 7B it's a 13 gab model uh let's actually just go and take a look here um we can look at this 7 gigabyte model it's open chat and we can take a look at the files and you see there are two files open chat is based on Mistral which is kind of similar to llama so it's about you can see 15 gigabytes in size and um that means if you're going to load this onto a GPU you need to have at least 15 gigabyt in size when you're loading it in um 16bit which is often called BF 16 that's a format for it's one of the 16bit formats uh brain flow 16 so here you would be recommended to have a 15 gigabyte or maybe even 20 Gab for some Headroom of vram if you want to run a model like Lama 7B now if you've Lama 70b that's going to be 10 times bigger so you need to have something like 150 or 200 gbt of vram now the largest gpus are typically 80 80 gab so you can see if you want to load a Lama 70b model you're going to need to have at least two of these large expensive gpus one further thing to keep in mind is that when the GPU is processing it has the full model in memory so let's say Lama 7B was exactly 13 GB and let's say your vram is 15 so you've got 2 gigb per actually when that uh model is processing a given sequence it needs to store the history uh some history of computed values so you might have a long sentence you're predicting the next token and it's common to store some of the earlier calculations because that proves be useful when you calculate the even next token so every time you calculate a new token it's useful if from the previous token you have stored some of that history and that history is called the KV cache k for key and V for value and that history also will take up space in the vram so that's why you need to have space between the size of the GPU and the model size and the longer your context the longer the history needs to be stored in the KV cach and therefore the more Headroom you need to have above the basic model weights inside your vram so the longer the sequences you process the more extra space and the larger the vram and the larger the GPU you need to choose now we'll get to it later but if you can limit or if your use case is limited to smaller context lengths then you can limit your context length when you set up the API and that will allow you to get away with a smaller GPU you than if you assume you're going to use the full context length capability okay so what do you do if you want to use a smaller GPU let's say for example you want to run let's say you want to run Lama 70b and you expect you need 150 GB but you want to run it on a single A1 100 which has only got 80 gb well with text generation inference which is some software for setting up your API um you can specify quantis ation options so there are two options one is EQ which allows you to cut the size of vram required in half so instead of needing 150 you would need roughly 75 so now it barely fits on an a100 or you can even divide it in four by quantizing to four bit with bits and bytes and F4 and that will reduce the requirement down to 4bit quantization the other option with uh VM if you're using that software to set up your API is you could use an awq model for example here's an awq version of the Mistral v02 Model A very nice model that it's a little upgrade to the first mistal 7B model and this is quantise so if we look at the files you'll see that uh the model is 4 gbt whereas let's just for interest sake uh duplicate this and let's look up um MRA 7 B v0.2 uh yep so let's look at the files here and you can see that the safe tensor here oh nice they've actually pushed safe tensors that's great um the safe tensor files are roughly 15 gigabytes so by quantizing You' saved just a little less than a factor of four and that will improve the loading time but it also means the vram needed to load the model is going to be smaller now just a comment here on the quantization it's not a complete free lunch um it's true that you need less vram if you quantize or quantize on the fly in the case of TGI however your quality will go down but also it doesn't necessarily save you computation because although the models are quantized when they're being loaded on the vram they are often being de quantized uh before inference so you can think of it as being loaded in a compact form to the vram but when that information goes deep into the GPU it's basically being expanded back out to 16 bits in certain cases now it varies based on the type of quantization but you'll see for example with bits and bytes nf4 that this can help you fit into a smaller uh amount of vram however if you ping the server with a lot of parallel requests then eventually the extra computation of having to expand back out into the full Precision is going to start take over your computational workload and it's going to slow down your tokens per second um so it's a handy trick to quantize particularly these on the fly to fit into a smaller amount of vram but if you start to Ping it with a ton of requests um you may have been better off to just try and load it in full precision and have a big enough GPU to fit it in full Precision let me just finish off this page here on server and API setup and then we'll finally get to the full end to end example once you've got a GPU that will fit your model the next things you want to do are pick the GPU first of all with decent upload and download speed that means you can download your model to the GPU quickly you don't want to be waiting forever before your um model weights have been downloaded and second of all you want to look at flops so flops is the computational power we'll see this is specified in Tera flops and you want this to be as high as possible um you know subject to your price constraint so that you're able to process more tokens per second all right before I move to the end to end example I just want to go back here and give you a look at the vast AI setup document there's also one for runp pod as well and you can find it in the server and API setup folder so let's head into vast AI setup so like with runpod you need to set up an account there's an affiliate link if you'd like to support the Tris Channel and I'll put that Below in the description as well um you'll need to add your credit card you can add as little as $5 for vast a I think it's 10 for runp pod now you do need to set up an SSH key pair if you're going to use vast AI um there are some instructions here on setting that up and once it's set up you'll be able to ssh in and connect to the instance and after that point you'll be able to send requests to um an endpoint which will be on your Local Host um there's some tips here on GPU selection but what I'm going to do is show you that in the live example end to end okay so for the end to end example I'm going to show you how to serve uh for as low as cost as possible um a Mistral 7B instruct model it's going to be the v02 which just came out and I'm going to show you how to serve it with VM and awq so it's going to be quite a compact model about 4 GB in size I'm also going to show you how to serve open 3.5 which is a very strong function calling model in my experience it's better than the mistal models including the V 0.2 it's actually better than some of the larger models I've tested as well so this model here will allow me to demonstrate uh the software as well for calling the API with functions and handling the function responses as well for some specific requests around the weather okay so let's get started here um with Ms haven't been instruct and what I'm going to do is find uh assuming I've already set up my vast AI account which I have I'm going to find a one-click template I'll drop this in the description as well what I'm going to do here is go to the vast AI setup um I assume I have my account set up and I'm going to go all the way down to deploying an API with VM which is what I said I'd focus on and I'm going to look for a oneclick template scroll down here and open up this oneclick template we'll start with this one and we can come back so here we are on the vast console and we've got uh the image so we've got the vll image let's just take a look at this here and I'll walk you through what's inside uh so here if we click edit you can see it's using the VM the latest Docker image and here is the docker options we have set up now the VM image it runs the API on the 88000 port um so we're going to map the 8000 Port from the container from the docker container onto the 8,000 Port of the server so there's Docker container server and then there's going to be our local host and I know that's a lot but there's three things Docker container which is going to be 8,000 then we're going to map that to 8,000 on the server and then in a second we're going to map 8,000 on the the server to 8,000 on my local host on my computer okay so we're going to run it with SSH and here you can see um the entry point so this is the command that's going to once the docker container has installed everything like Cuda it's then going to uh allow us to run this command which will start downloading the weights for this awq model and we've set quantization to AQ you need to set the D type to half type is the data type a lot of models are trained with 32-bit Precision um but often they're inferenced in 16 and awq actually uses 16bit Precision but it then further packs it down to um even lower number of bits so you do need specified D type is half half means half Precision which means 16 instead of 32 and I'm going to set the max model length of 2048 um so that I can use a smaller uh GPU in order to save money okay so scrolling down to here we have a little read me and it says when sshing into this um so that's the SSH command that's fine um we're going to do that so I'll show you live how that's going to work so we'll click select and save uh this is the recommended disk space actually you could set that way down to like six or five because in quantise form this model is quite a bit smaller but anyway space is not that expensive it's the G that's expensive the dis space is not so here we're going to go to uh price increasing and let's see if I can increase my screen size so you'll see here on the right hand side the price per hour ooh that's a cheap one there um 6 cents per hour it's even cheaper than uh what I had said so you'll see typically price is starting around um 10 cents and what what I'm looking at now is I'm looking at the size of the vram which is highlighted here and I need to have at least probably I don't know probably around eight or something in terms of gigabytes um but really if I can get a bit more I like that I like being a little bit over and then I want to maximize the number of flops um and also I'm looking here in small numbers I'm looking at the upload and download speed so I want this to be in a few hundred um megabytes uh megabits per second so what I'm going to do is I'm going to see if there's um well let's see if I can take this A2000 it's got 12 gabt so that's plenty um of vram given I'm quantized it's not going to be that fast because I've got 11.2 flops it could be better to go for something bigger like um maybe an A4000 like this here there's 16 GB and 20 flops it's a little more expensive so why don't we use the more expensive one to run the function calling not that it needs it and I for now will just try to minimize cost so I'm going to try run here um can we actually run in this really cheap one I don't see why not let's try it um so I've clicked on rent and if I go to the instances page it's going to be creating so we'll be able to follow up here on what's happening with that instance now I'm going to give that a few seconds just to start up let's take a look here at the logs for the awq instance uh so it looks like the downloading has started and we have downloaded all of the files and you can see here there's a conversation template uh for this chat and this all looks good so you can see the UV corn is been set here so we're now deployed on Port 8000 so now we can head up and click on this little op SSH button this will map port 8080 to Local Host 880 but actually we are running uh the vll image is running on Port 8000 so what we're going to do is copy most of this and I'll take it to a terminal and I'll paste here and instead of mapping to 8080 I'm going to map to 8,000 now I I could La map locally to 880 it wouldn't make a difference but um I've set it to 8,000 and I'll say yes I want to continue connecting so now I'm connected via SSH and if I open up a new browser window here I should be able to make a request um make a quick request to test out that endpoint so just to check how I can do that let's go down here um here's the instructions just reminding you to swap out the port mapping and here are the instructions for doing a quick request to the API which I'll do right here and you can see it's working well it's responding when I request it's responding when I request a list of models with the name of the model that I've loaded so that is perfect and I'm actually going to copy that model now because what we're going to do next is we're going to make some calls to that API and the way we're going to do that is by going to the advanced inference repo and I'm going to clone that all I've cloned it all onto my local computer and we'll be using what's inside the API API calls we're going to be using some of the scripts for Speed tests and later we'll be using some of the function API calls so let's just move over to VSS code um I'm here in my EnV if you're doing this the first time you're going to want to copy the sample EnV and call it rename ITV and what I've done is I have commented out the endpoint that would be used for runp pod and I've commented in this Local Host because I've mapped VM to um 8,000 on the local host and here I'm just going to paste in the model that we're using so now what I can do is CD so change directory into the API calls folder and then change directory um into the speed test folder so let's just uh do LS I think I'm already deep in folders so CD into API calls and speed tests and we're going to start off we're going to start off by doing a via a quick speed test this is just going to send a prompt in to the API and ask it to write a long essay on the topic of spring so I'll just do python VM speed. py and it's going to ask for a response that's going to be up to 500 tokens so it'll take a few seconds to get back and then it will tell us the time per token for making that simple request and after that we're going to test out uh concurrent requests where we will ping it with many parallel requests just to see how it responds in that case so let's see first of all okay so we had uh 19 prompt tokens uh that was just me asking to talk about spring 500 tokens generated and the tokens per second is about 40 so that's pretty good it's well above uh reading speed if you're interested um we could just print out the conversation why don't I do that I'll just run it once more just so you can see what text it's generating and then we'll move to doing a concurrent request where we send in a lot of requests at once okay so you can see here uh write a long essay on topic of spring and here is title spring a season of renewal and rebirth and then some paragraphs on Spring and you can see that it's truncated because I've put in a Max of 500 tokens so next up we're going to move to con current tests and here what we're going to do is we're going to send requests every um half a second and we're going to send 20 requests so 20 requests every half a second uh and see what happens so python vm. speed-c concurrent dopy and you can see from this last request it takes about uh 12 seconds to get the first request and then after that what we're going to see is multiple requests coming back here now because we were hitting the API with multiple requests it's going to be slower but the bottleneck is typically reading the tokens uh reading the model weights deep into the CPU into the GPU rather um so actually you don't see too much uh degradation so you can see we're on request 17 request 20 so let's just look through that once more you can see the first request is 20 tokens per second so it's definitely slower um by a factor of probably 1.7 and this is because we're starting to overload the GPU and then as you move down towards the end of the requests like um yeah as we get to request 20 you can see total time is roughly around 22 so you're pretty well able to serve um at least that many people if you've got about uh5 seconds you could probably increase that and go and look at serving eer at a higher velocity now keep in mind that this uh particular GPU has only got 10 Tera flops so it's not the most powerful in terms of computation uh it's an A2000 um but it shows you you can run on a server that costs about 10 cents um that costs about 10 cents per hour um we'll push it a little more when we get to the function calling example just a brief recap from where we are what I've done is I've shown you how to run on a very small GPU by using quantization which reduces the model size and fixes it into a GPU that's lower in cost the second thing that I'm doing is running a function calling model and I'm going to run that in full Precision that 16bit Precision I'm going to run it in 16 bits and then I'll show you the function calling functionality so let's get that function calling server set up again I'm using the template here for vast Ai and I'm going to edit that that template or rather just check everything is in order so here's the template and it's going to be SSH connection we'll be calling the Open chat V3 model and the max length is 2048 now there's one difference from what I showed earlier I've removed there was an extra uh parameter here deep type of half so I'm just going to take that out uh all together it's only required if using the awq quantization okay so that's all set up and the last thing I need to do is append my hugging face token because this is a function calling model so I've just added in here my hugging face Hub token um I've grabbed that from the token I created a read token on hugging face it's important this goes into the docker options sorry I didn't mean to verse there let's go back into edit that image yeah it's important it goes into the docker options here so that all looks good and next we have uh the entry point so after the docker container is loaded this script is going to be run and this is going to start up the VM server with these parameters notice that we have the max max model length of 2048 I did have a parameter earlier I showed of dtype that's a data type of half that actually is only needed to be specified for the awq model if you specify it here it will give an inferior type of float 16 instead of B brain float 16 which is the default that's a more um efficient data type from a an information Theory perspective anyway in short don't put in the dtype half parameter here if you're going to run it without using the awq quantization next um I do not need to put in this uh huging face Hub token here with the on script it should go as an environment variable so that's why I've got it up in the docker options and now we're ready to go so I'll select that and choose an instance for my next step so let's go across and sort these by Price which is like they are I'm going to want an instance that's 16 GB um because the model when I load it just the weights are going to be at least 14 gigabytes and then I need room for the cache so here um Let's see we have this instance here for 13.5 cents 20 Tera flops that's twice the computational power we had in the awq model albe it at a cost of an extra 3 cents so we're going to create the instance now the docker container will be the first thing to be installed and then it'll run the VM server notice here that I'm specifying the model but also the tokenizer the reason that's important is we want to use the chat template which is specific to function calling you can see the tokenizer here this is the uh open chat 3.5 function calling repo and you can see the chat template it includes handling for the function metadata it also includes handling for uh function response and for the function call so basically it's going to put in the function call when the machine returns it it'll say function call and give the function call and it will also then when it gets response uh the response being what our function gives back it's going to paste that in and say Here's the response to function call if helpful user to respond to my question so we want VM to use this chat template and therefore we specify the tokenizer right here all right so the vlom server looks like um it started I can tell um I can tell from the disk usage that it's probably still downloading the weights because we've only got uh2 out of 16 gabt that are used up so far far and I can also tell that the GPU is not currently uh loaded with the weights because it's only got3 out of 16 gabt being used we can just check in on the logs it should tell us about the status of downloading or if we've hit an error and the good news is we haven't hit an error you can see the weights are being downloaded there are two files being downloaded which correspond to the files here uh we've got model one of two which is about 10 GB and the second one is 5 GB and you can see those being downloaded uh the first one of course well the second file is smaller so that'll be downloaded it'll complete download first and then it will download first file so we'll just take a quick break to allow to download all right so we have the instance up and running and we're going to check the logs here so this should show us that the weights have been downloaded we just give it a second all right so the weights have been downloaded and you can see that the default chat template which is the one from the tokenizer is being used it's injecting the function metadata and it's going to handle everything correctly and we know that everything is running correctly because it says U UV corn is running on Port 8,000 now just um one little tweak I did have to add this is kind of tight what we're doing running in 16bit Precision with just 16 gigabytes of vram with a 7B model it's kind of tight uh so to make it fit I had to increase the GPU memory utilization to 95% the default I think is 90% so you can kind of tweak out a bit more by uh increasing that a little and you can see here our GPU is using 14.8 out of 16 GB um so we're pretty close to the top and in terms of disk space we've got 14 GB of Weights downloaded onto a 16 GB disc which is fine so let's uh SSH so I'm going to click here to get the SSH connection and you remember that VM runs on Port 8000 so I'm going to have to swap this here instead of port 8080 so let's head over to a terminal and let's paste that and make sure that we SSH into Port 8000 so Local Host 8,000 yes and looks like we're connected so I can test out the connection by making a curl request to Port 8,000 and making this open AP open AI style request to ask for the models and indeed the model is open chat 3.5 function calling so let's now head over to our scripts for calling the API and just if you've get cloned this uh repo here we're going to go into API calls and we're interested now in function API calls so I'm going to CD into that folder function API calls and we're going to run a VM function call. py you'll notice that I've set up in the functions folder two different functions one is to get the current weather it just takes in a city and gives back the weet and there's another one here that takes in the temperature and condition and it gives out a resp resp of what clothes you might wear given those conditions so you can see I'll be able to make some change requests where uh I ask for the weather or I ask what clothes to we in a city and it'll have to make two function calls first to get the weather and then to get the clothes I've also uh in open AI style specified the tools here is the metadata for those two functions get wet and get close we're ready now to run VM function call Simply it's going to take in a set of messages is we'll include the function metadata and then ask what the current weather is in London just as a starting point now I do need to specify here a few things so first I need to say what model we're using I specified the model and for fast AI because we're running on Port 8,000 locally that's what we're going to specify as the end point all right so we're ready now and we can run python V LM Funk call. py let's see how it does handling that lung yeah great so it's made a function call to get the weather it's responded with the temperature and it's toest WEA is cloudy or temperature 15 and we can try some alternate requests like what clothes should I wear I'm in Dublin so let's just save that run it again and see if it gets the chain response so here what's happening is it first gets the weather in Dublin and it's 80° party cloudy sounds good for this time here and then it calls the get close function which responds with what close to we and then it summarizes so yeah open chat 3.5 is really really good for function calling it's able to even do these chain function calls which is pretty beautiful all right so that takes us through the function calling while I have the model loaded though let's also run some speed tests so I'm going to go into the speed uh tests folder and let's just run uh the basic speed test VM speed. py which asks for a paragraph about spring it'll generate about 500 tokens and measure the time for generating that and you can kind of compare here one uh one advantage is we more computation power cuz we have 20 Tera flops instead of 10 one disadvantage though is we are using full Precision so for doing a single API call it's maybe a bit slower yeah you can see here it's about 25 tokens per second and if you recall way earlier in the video Let's see what kind of speed we were getting H we're getting about 40 tokens per second so if you make a single request you're going to get better performance um by being in quantized form even in this case if you had less computational power so let's just do a quick concurrent test remember in the concurrent test we're pinging every .5 seconds and we're doing 20 requests in a row so we're trying to put the server Under Pressure to be able to deliver a lot of responses or handle a lot of requests rer so we'll see now how low the speed drops down below the base line of 25 tokens per second and actually you can see that the performance is not dropping down a lot and this is because we have more computational power and also we don't have to do any dequantization we're just directly doing bf6 directly doing 16bit uh computation gpus are designed generally for 32 or 16 bit uh multiplications so if you quantize you have to kind of De quantize which cost you extra computational power and that's not a problem if you're sending one request because you've tons of computation um you end up being ma memory memory bottlenecked but uh if you're trying to send lots of requests you don't want to be de quantizing all of those so actually sending lots of requests you can be better off running in uh the full 16-bit Precision rather than running quantized whereas if you're just running one request you're probably better off running uh running not quantized okay so um or rather with one request you're often better off running quantized so here you can see we're getting uh about 20 down to uh you know 14 tokens per second and let's see if we can get that uh to 100 I'm going to really put a lot of pressure on here I'm going to make requests every eighth of a second and I'm going to send in 100 requests so this is a fairly intensive use case but let's see what happens now each request is taking about at least it's going to take at least 25 to 30 seconds to complete so we're going to have overlapping of the requests uh because there's eight requests every second so 100 requests it's going to take roughly 12 seconds so basically I'm completely overlapping all of these requests which is what I want to put pressure on the server okay so I've let that run out now and you can see when I put in 100 requests at the same time the first uh request takes me 26 seconds to come back an average of 18 toal tokens 19 tokens per second and if I scroll all the way down you can see the last request really is quite delayed uh it takes this is not the time well this is the time from which I made the first request until the last request was completed and it's about uh 3 minutes with an average of about uh three tokens per second so you can see that certainly for this size of GPU with 20 Tera flops you're probably not in a good place if you want to serve uh 100 customers I think I think though if you want to serve um maybe 10 customers in concurrent requests then let's just run that quickly I think you're going to be in good shape and have pretty good tokens per second yes indeed so no problem with 10 tokens or 10 concurrent requests you've got 22 tokens per second and even all the way down as far as your 10 requests you're still getting about 18 tokens per second so keeping pretty much ahead of uh reading speed and that's it folks for this video on inference a few tips though before I go I've shown you particularly small servers that minimize the cost and I think this is good if you start off with a relatively smaller number of customers and you don't have too many concurrent requests I've shown you a case where I've gone up as far as 100 concurrent requests and if you feel the tokens per second is too low then you can simply move to a larger GPU one idea is you can sort the gpus on vast. aai according to the cost per hour per Tera flop and you can see which units are going to give you the cheapest cost per unit of Compu now just to summarize the overall video if you're going to serve a model the cheapest way is probably to use some service like open aai or maybe Google's Gemini but if you need privacy or to serve a custom model then that's why I made this video first off you need to choose where you're going to rent your GPU if you're going to serve a larger model then probably you might want to use an a100 or an a6000 and they're similarly priced from runp pod or from vast AI if you're going to serve a smaller model then you can use a cheaper GPU perhaps down as cheap as 10 cents per hour and if you want to fit inside one of those gpus quantization can be helpful that might mean using awq if you're using VM to set up your API or it might mean using the EQ or the BNB or the bits and bytes nf4 option uh with TGI now once you have those set up you can inference using functions or without functions by using some custom code such as is provided in the advanced inference repo all right folks next up I'll be planning a video on inferencing larger models including mixture of experts so keep an eye out for that one but in the meantime let me know any questions on this one down in the comments cheers

Info

Channel: Trelis Research

Views: 17,946

Rating: undefined out of 5

Keywords: llm inference, custom llm, custom language model inference, custom llm inference, language model inference, custom language model, inference a fine-tuned model, api custom model, custom llm api, llm api, api setup llm, setup api language model, batch inference llm, concurrent inference language model, llm batch inference, large language model inference, tgi setup, text generation inference, vLLM, vLLM setup, vLLM inference, runpod vLLM, runpod TGI, vastai vLLM

Id: 1TU9ZrZhqw0

Channel Id: undefined

Length: 51min 55sec (3115 seconds)

Published: Fri Dec 15 2023