Build an API for LLM Inference using Rust: Super Fast on CPU

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone welcome to AI anytime Channel in this video we'll explore the process of building an API for llms inference using rust we will achieve fast results right on CPU and thanks to the llm library by rust farmers we'll dive into the llm crit and its interaction with llms and see how it uses the ggml's projects model quantization for efficiency so if you are not aware about ggml it is an acronym for General GPU machine learning we'll also create a rust based wave solver for CPU based AI inference and then we will demonstrate that how to integrate that running server into a streamlit application making llm accessible with each so you can see that I have here something called llm inference API in Rust a simple streamlit UI where we are utilizing that running web server that we have you know written in rush you can see it says CSS over here that server listening on Port 8083 and then I'm running on my streamlit app here on 85 note one so this is what we are going to build basically now if you see I am on a GitHub repository called uh llm by Rush Farmers okay so the rush Farmers Library allows integration of different llms including Bloom gpt2 GPT J Neo X llama and mpts to use these models within the rush formal Library basically they undergo our transformation to align with ggml's Technical underpinnings and you can have a look at the documentation you know that might look little overwhelming if you are not worst with rust you know but we'll see that how we can you know basically run this how we can set it up and all okay and you can see that they are using something called cargo that will help you run your rust applications or rust code in your Windows machine or wherever whichever machine that you are using by the way okay now you can have a look uh here the repository but we have to install something called llmcli I will show that how you can install all of this in your machine and set it up to basically inference the llm now why are we doing it guys you know earlier if you see I have released a video on something called llama.mojo where we have seen that how we can basically inference these llms uh through different programming languages on a commodity Hardware like a CPU machine you do not always need gpus to basically run these llms right so there are diff because python is a slower when you compare python with other programming languages like rush for example or Mojo for example right so how we can basically look at some other techniques of running these llms on this Pro using these programming languages and that's what we are trying to do so Mojo we have seen now we'll see that how we can basically use uh in rush now if it's if I saw you let me just go to open llama hugging face let me just write rust formers and you can look at open Llama gzml now we are going to use this model which is you can see it's based on the state data that they have been trained on open llama and open reproduction of llama and if you come to their files and version we are going to use this model which is 6.85 GB on a 3-bit float 16 model of open Llama this is what we are going to download I already have downloaded it and now I'm running it uh through rust in a streamlined application so what I'm going to do here is I'm just going to write something like you know AI is a technology that can be you know for example suppose this is what I want to click now once you do click on generate response what it does it connects basically with this uh running server and you can see loading of model complete model fully loaded model size is this around six point you know whatever five three gigs byte or whatever number of tensors s237 and now it takes your this uh sentence and you start generating tokens out of it you can have a look at that AI is a technology that can be used to create art there are many different ways in which AI has been used for Creative purposes once this gets completed we'll have this here on our you know stimulate application the response so basically now the reason I am creating this video is because I'm already working on how to create a semantic Search application powered by llm in rushed tool to run that on a CPU machine okay so you can see that this is our inference result on a CPU really fast isn't it you can you can read that AI is a technology that can be used to create art there are many different ways in which AI has been used for Creative purposes from creating music and writing storage to producing visual images and you can have a look at here clearly right so this is what we have done you know uh so far now you can also look at Postman and you can just try that API endpoint as well so we have an endpoint called chat you can look at that the rust wave server is running on 8083 slash API chat now prompt AI the technology and now what I'm gonna do here is I have I have you can see it's a raw body in body inside raw we have this query parameter that prom takes as your query and then it goes your sentence over there now once click on hit send while it is sending request and I would have basically changed that but just to show you that now you can take this running API endpoint and can integrate in your existing llm application or even a new llm application that you are building okay that to run on a CPU device or even on a embedded device like Raspberry Pi Etc you can do that as well you know on a smaller model so this is what we are going to build I will I will show that how you can take llm from Rush form or set it up in your machine and then how you can also use llm crate and Hyper to basically build this uh API in Rust and then utilize that through a python application so let's build this uh app guys here okay with with the API running in rest now to do that the first thing that we have to do we have to use this GitHub repository and we'll see how we can use it by the way we the first thing you can see it says the prime entry point for developers is the llm crate so we have to basically install the CLI application which is llmcli I'll show you that how you can do that so I already have in installed it so I'm not going to install but I will just show that how you can do that so you have to come inside this rust llm you can see I have created a folder here that's that's called rust llm now you can create any other name a folder with any other name as well now I have something called llm you can see the entire GitHub repository that I have cloned here and then I have that file called you know open Llama 3B F16 being with 16 dot bin which is more than 6.5 GB you know this is the llm that we are going to basically inference you can take any other gzml uh model but make sure that the llm suits the llm library in Rush would support that model so it supports Lama MPT gpts etc etc now if you want to do that what you have to do is you have to run this command so let me just show you I will give this command as well in my YouTube video description let me just do a clear now you have to do something called cargo install and meanwhile just me let me just show you that if you don't have rushed in your system then what you have to do is you have to do Rust install Windows okay just come here right in Google go to install rustlang.org so here you can see that they have an exe you have to just download the 64-bit and you can also download 32 depending on uh what kind of machine you have and if you want to do it on WSL you can also do that okay so I have Windows machine that you can see currently so I have installed uh through this init exe now once you do that it will automatically also install cargo which is like you know uh a package manager kind of a thing or an uh that helps you run your Rush code through clis that's what that's why we need that okay so you have Pip in Python right and so now what I'm going to do here is you can see I'm just doing cargo install and then you do something called llm CLI and then you can direct the GitHub repository you can see the my GitHub repository you can just come here let me just come on the GitHub click on ADD code and just copy this thingy over here okay now you can also copy that and what you can also do is if you don't want to do that in that sense then let me just so excuse me you can just do https github.com and then rushed farmers reformers and then LM you can also Define that as well if hyphen hyphen git okay you can also clone it uh through different ways I will say and that's how basically you install so once you what you can also do is first you can clone the repository and then you can do inside that repository you can do cargo install llmclis with that's how you can you know install it and then if you want to basically download that model there are two ways of basically downloading this model you can directly come here and click on this and then save it in your that particular folder that you are running the directory and you can also do it through curl command so let me just do uh C URL so you can just do c URL you can see that it's showing uh it's suggesting me the previously run command so you can see hello and then again https and all you can just give that and it will also download it for you so I'm not going to do that because I already have that here in my folder you can see that I have this model downloaded now let me just remove that by the way now once you have installed the llmcli and once you have the model file now what you can do is you can come here and say hey llm info and then you have to you know pass some argument so the first is you can see hyphen a where I'm going to pass the type of the model which is Lama family and then I'm going to give the model name which is hyphen am the model name and which is oh excuse me model name which is open llama and then what I have to do is I have to give my prawn basically which is I can say okay uh for example blockchain I will rotate something wrong blockchain is a technology which is very helpful in the banking sector and you can see once I click enter after that it says loaded 237 10 Source after 152 milliseconds it's extremely fast isn't it and you can see the the token that we are generating here on a CPU machine which such speed right such uh you can see the tokens per second that we are generating I hope you can count that you can look at blockchain in the technology which is very helpful in the banking sector it has been used for secure online transactions blah blah blah right so let me do one thing I'll just do a control c and end this program here so what I'm going to do even see that I have just terminated a program now this is in this is the way you install lmci so first thing what you have to do is you have to do first install rust from here you can see install rust once you do that you can you can restart your system and then you create a directory and in the directory you have to do cargo install llmcli then give the GitHub repository and then you download the model you can directly download it and keep it in folder or you can also do c URL hyphen Lo https blah blah blah and then you can do llm infer you can see llm info to run that model and I'm just going to exit from this terminal now let me just close this I don't need it our focus is the how to you know create an API build that API in Rust and then you know use that running web server in a streamlined application you know to do that let me just show you what I'm gonna do here okay so I'm not going to write the code because I already have that code so I'll just explain that code to you so here I'm just going to do AI anytime GitHub let me just go here on my GitHub repository and I'm going to go inside my gist let me go inside your gists and you can see that I have a file called main rust now let me show you that how you can basically build this okay so let me just show you the directory and all because you would like to set it up from scratch now I already have a folder called rust API and you can see that I have two subfolders one is called language model server the other is called ey thingy and then I have copied the file over here you can see it says open Llama 3B now let me show you that how you can create a file here so a folder here so let me just first create a directory and I will call you I will call this Rush demo now inside this you you just open a terminal let me open in terminal and here what you have to do is this is how you create a a project a new rushed project so if you are you know working with react and angular and all you do a npx create react app right it creates all your schema right of the folder that you are going to work that's what rust also provide and cargo is extremely powerful for that now what I'm gonna do here is I'm going to say cargo new and you can you can give your uh rust project name a name so I'm just gonna call it like you know llm Handler or something like that okay now once you click on cargo new llm Handler it says created binary so it creates the binary for you guys okay LM Handler package and then we'll build that binary by the way we will see that in a bit now once you refresh that you will see that I have a new folder called llm Handler let's go inside this llm Handler and see what's the structure looks like now if I go inside it you can see that I have a folder called SRC Source where we'll have something called main dot rust RS okay we are going to write all of our logic I will show that in a bit and then we have something called cargo.2 ml which is very important now we are going to make some changes in cargo.2 ml dependencies and all this is see this as like package.json you know or or that if you use poetry in Python you would have that right so this is basically your 2ml file where you define all of your dependencies Etc now so you can see this is how you create so if I open my already created folder the code does not make sense to you know duplicate it I just want to explain that how you can do it now here you can see that I have uh a language model server you will have llm Handler now ey thingy is very very much self-explanatory I have a streamlined application I will show that so now let's open this okay let's open this this directory in vs code I'll just close this by the way let me just close it now here I'm going to open in terminal I'm just going to write code dot so let me just do that code dot I will explain you the entire code and what are the changes that you have to make basically now if you come here you will see two folders and the model file open Llama 3B and then if load 16 dot bin now if you see the language model server I have a file called cargo.2ml now here you add dependencies so I will explain that what are the dependencies that we are using in a bit I will explain that when I am showing you the code by the way but this is how the structure looks like you have a name you have a version you have a Edition then you have some dependency that we are going to use I'm also passing the llm that I'm taking okay from this particular git the llm repository that we have and then we have something called hypersword and Tokyo I will explain that basically this is for serialization and deserialization let me explain you quickly now this code will be available on my GitHub repository the entire pro project that we are creating you can just take the same code and run it your machine after installing Rush that's it now but you have to build that by the way I will show the com command that's called cargo build release now let me just go to uh uh something called language model server and inside source now you can have a look that we have a hell of course here you know which I have made some changes here you can see the main dot RS let me explain you that what are these libraries basically means all the data structure that we have if you look at if you look at us the first one which called use hyper service make service function and service function so hyper is a fast and efficient HTTP library in Rust okay that's what we that's what hyper does basically now we have something called socket address over here now socket address is used to specify the socket address which combines you of your IP and the port you know for the particular server that we are going to run on 8083 and then we have something called served so served is nothing but it's it's a very powerful you know serialization and deserialization framework in Rust you know we have civilization Library like pickle and all in Python we have sort here for this realization and serialization now we have something called now what do we mean by you know these realize and uh say like you can see it over here you know we also we're also using this in structure I will explain that in a bit now so third basically trades for automatic realization and this realization now if you look at the now this data structure that we have defined that will be used for deserializing Json request data and civilizing response data so once we have the Json request data from the server basically that the sword will basically use that for deserialization and then for the response rate of the serialization that's what we are defining here you can have a look at that we have something called chat request now this chat request basically up is structured to represent the incoming Json request containing a prompt field so you can have a look at that prompt field right this is your request that we are making to the web server prompt AI the technology now what chat request does it is basically represent that this incoming Json containing a prompt field here now if you go back to the other structure from line 12 to line 15 we have something called chat response now what this chat response does is basically represent the Json response containing a response field so this is your response field that's what basically chat response does the structure that we have now if you go again back to the code we have a function called inference function this is your main function here guys the function inference now it's responsible for performing the llm inference in our rust API server now let me explain you very quickly a few things this function takes a prompt which is a string you can look at pretty much self-explanatory this is how we Define a function this is how we also Define a function in mojo as well if you can watch my mojo videos as well now this is how we are defining functions now prompt as input and then returns a string containing generated tokens you know if you come down you'll have all of these things that we already have okay uh you can look at errors messages say live prepare and all now so it loads and this function basically loads the language model you can see this is how we are loading a language model and here you have to give your path you can see open Llama 3B float16.bin here you define your path if you have your model somewhere else please make sure that you are giving the right path now if you come down you can see we have something called let result this is your result variable now result variable holds the result of the inference okay and a closer handles each inference response okay each inference that response that you have the closer will handle that if you come down and if you if you don't get uh like there's an error message or something this will basically print the error for you okay you can have a look at error you know error return of 400 bad request position decentralization failure if anything goes wrong this is what it returns now we also have something called request Handler okay so in request Handler what we do is guys let me just come here you can have a look at that async function chat Handler this is basically a request request Handler now what it does is basically asynchronously handles it's an async function you can have a look at here it's an asynchronous asynchronously handles incoming request The Prompt that we are basically uh giving to model you know it by deserializing the Json payload okay basically it takes ad that it looks at the Json payload and then deserialize it that's what it this function basically does and it also calls the inference function you can have a look at that match info call the inference function with the received prompt okay and construct a response message to construct a response you can see the prepare the response message over here we are having something called let response message formatting it the inference result and then giving the inference result and giving it to the chat response structure that we have defined on Top This is what it does now the response is against realized at Json and sent back in the HTTP response through hyper that's what basically we are doing here and then we have a few utilities like router and and not found Handler if you come down you know we have router and you can see the router here and not found Handler now router basically if you see it matches the incoming request based on the path and the HTTP method by hyper okay that's what the router does it matches it out you can see the look you can have a look at match request.uri you know the indicator and then the matching it with the HTTP method now if the path is that API endpoint if you go here we have something called API chat this is basically your API endpoint the path basically now and the method is post right so that's what it looks at the method post and then your API chat this is what it does you know it matches with the hyperm uh that we have the HTTP request if no match is found you can see not found you can see the function over here not found then we have function then it return a 404 not found response error okay you can look at here pretty much intuitive coding on it very much intuitive you can you know do a reading about it and all now this is your main function now the main function initializes the server you can see that it initializes the server and then you know it starts listening for the incoming connections that's what this particular function is doing it's your main function guys so now this is what the code does you can take the same code and build it and run it together now once you've you have done that what you have to do is I'll just show you what you have to do now if you go back to your cargo.2ml make sure that you have this hyper Tokyo sword sword Json llm and random so if you come back here now how you can run this now let me just show you you can see that I am already running it I don't know if I'm running it or not let me just see you can see it's already running here on that now how you have to run that let me just write a command for you once you are okay with it you have to run two commands let me just write something called uh instruction dot txt just to give you the command also in the folder so instruction dot txt now instruction dot txt what I'm going to do is I'm going to write a command that you have to do you have to first build that so you have to do cargo build the binary that has been created by cargo you have to do cargo build release now once you do that it will say that build has been you know succeeded or successful once you build this now this will compile your code because the rest you have to compile the code it's not like python right so you have to compile your code and that will produce an executable binary in the Target release directory inside rest okay so that wherever you have installed it now cargo build release no let me just minimize this cargo Bill release now once you do that you have to run the server so this is how you run the server you do cargo run and then release so cargo run release now after building the server you can run this using this command cargo run release and this will start listening to the port 8083 as you can see it over here it's listening currently you can see you know it's a streamlit app you can see this is where it's listening you can see it says server listening on Port 8083 loaded hyper parameters jdml context size etc etc now you are once you do that you can use Postman basically like this you can you have to come here select a post request and once a post request you have to pass this URL which is on API and chat you can also Define in some other ways if you want and then you have to come here on the raw and click uh come here on the body and then click on raw and then you know do a curly braces positive query parameter like prompt and that's it you are okay with it now now how you can basically use this running server in a streamly tab that's what I have shown in this UI thingy here so if you come here I have a requirements txt the text to libraries streamlit and request and in the app.pi what we are doing we are defining that API and after that it pretty much self-explanatory you have a title you have a prompt area text area then we are saying function to generate a response you know we have a function called generate response we are using that API URL passing the prompt having some error handling here like response status goes 200 you know if not then return it and that's what once you run that through extremely run app.pi let me just show you here you have to go inside ey thingy activate the virtual environment and then run it once you run it you will see this here API HTML now let's write one more to just to show you uh let's write something about mental health guys mental health is very important and now let's click on generate response now once I do that you will see here let's go back to you can see it's extremely fast you know in 242 milliseconds the model has been fully loaded and you can see it takes your prompt and it starts generating you can see it says mental health is very important to me I want to be able to take care of myself and not just my physical need blah blah blah and once once it is done it will show here on the uh your streamlit UI basically let's wait for it guys Maybe I would have seen that how to do streaming responses I missed that but maybe you can give it a try I will have a look because this is the first video I am very excited about it and I'm working on as I said semantic Search application now you can see you know inference result we can also pass the output to make it little you know beautiful here mental health is very important to me I want to be able to take care of myself and not just my physical needs but also my mental ones as well and here I would have also used the N tokens or something to basically end it you know on a full stop or something observation or whatever because it's a llama model now this this is what I wanted to show you right it's a very foundational concept either that I introduced and you know and Hands-On application that we delved into this time in a very short uh video now the main goal was to basically brief Theory with practice right how you how you can use the llm crate along the hyper library and to create a server capable of basically understanding and executing llm inference now in the next video I will create a semantic Search application okay yeah using the same techniques same model and we can also have our own data you know Json format of something that will use as an input data meanwhile you can try this code I will give the entire code on my GitHub repository just go ahead and set it up in your machine and let's see for in the next video that how we can build cool applications that can run on CPU machine using large language models if you have any thoughts and feedbacks please let me know in the comment box you can also reach out to me through my social media channels please find them on YouTube YouTube banner and also the about us of my channel that's all for this video guys please like comment and also subscribe to the channel if you are new to the channel and share the video and Channel with your friends and peer thank you so much for watching see you in the next one

Info

Channel: AI Anytime

Views: 6,419

Rating: undefined out of 5

Keywords: llama, llama 2, rust programming, rust language, rust, gen ai, generative ai, langchain, fine tuning llm, rust api, llama api, llama cpp, build a chatbot using llm, llama index, weaviate, pinecone, python, coding, tech, meta ai, gemini, gemini video, gemini llm, chatgpt, siraj raval, onelittlecoder, prompt engineering, andrew ng, chatbot, python for beginners, generative ai for beginners, llm for beginners, mukesh dey, india, youtube, youtube video, mojo, rust llm inference

Id: X4yOi6y8uHI

Channel Id: undefined

Length: 28min 40sec (1720 seconds)

Published: Mon Sep 25 2023