Step-by-Step Guide to Building a RAG LLM App with LLamA2 and LLaMAindex

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello guys my name is krishak and welcome to my YouTube channel so guys we will continue the Lama index series and in this video we are going to create an amazing rack system using open-source models like Lama 2 okay now here there will be a lot of step toep tutorial completely I will try to code this entire tutorial from scratch and then step by step we'll try to see that how things can be implemented with respect to creating this rack system now it is super beneficial people should know uh different different ways of creating rack system not only just by using open apis you should also know various other models like Lama 2 Mistral right so in the upcoming series of videos I will be covering all the specific models but in this video I'll focus on Lama 2 so let me quickly go ahead and share my screen so here uh I've opened the Google collab you can also start working on this specific Google collab itself and uh when I will probably upload the fine-tuning video video at that time you probably need to buy a better uh you need to have gpus in your runtime right so probably later on you'll be able to see that I'll be using this v00 GPU or a00 GPU so that my fine-tuning also happens quickly so that is the plan as we go ahead in the future but right now just to create this rag system we will try to use open source models what is the idea over here let's say that if I have lot of PDF documents I should be able to load those PDF documents index those PDF documents and query along with this Lama 2 model itself right so Lama 2 model I will be using it from hugging face so here you'll also get an idea how you can call any models from hugging face also so let's start the specific uh tutorial again I will try to show you completely step by step now the first step over here is that I will try to install Pi PDF okay Pi PDF so this is the first installation that I will require since I'm going to work with PDFs already you'll be able to see that I've created a data folder and I've uploaded two PDF that is attention is all you need PDF and YOLO PDF so we'll try to query and first of all we'll try to take out the text or documents inside this we will do the indexing for this particular PDF and then later on we'll try to query it so just for my personal experience I would definitely like to say that uh the performance with respect to llama index the kind of accuracy that you get whenever we query an information is quite amazing in this particular case Okay so first of all I have also installed like uh I've done pip install P PDF then the next thing is that I will also be requiring some libraries so let me just talk about all the specific libraries that I will be requiring so here I require Transformers I required inops I required accelerate I require Lang chin because we're also going to use Lang chin as I said that we're going to combine both Lama index and langin for this purpose and this bits and bytes the reason why we are using this bits and bytes because because there is a process called as quantization right that basically means most of this open source or most of the llm models you know they will be specifically in 16 bits but since I'm working in Google collab I will quantise that to four bits okay so this specific libraries used internally for quantization pro purposes so once I probably install this so I will be requiring Transformers because inside transformers I'll be able to interact with the hugging face Library itself um not interact but at least create a pipeline which will be able to use this Lama 2 model with respect to any kind of input that I specifically give then along with that I also have accelerate so that it will speed in the process of uploading or loading this entire pipeline itself okay now this is done uh here I've used Transformers this this and I've used P PDF P PDF is for uploading all the PDF docs okay now the one more library that I'll specifically use and this is for embeddings okay so I'll write a comment over here and I'll write embeddings now for embedding you have different different libraries one of the libraries that I definitely want to show you is with respect to pip install sentence uncore Transformers so if you specifically uh install this libraries this will also help you to perform embedding and this is going to consider the entire sentence for this particular purpose okay so this is the third library that I already uh require so here you can probably see sentence Transformer 2.3.1 is already installed so these are the three main installation one is p PDF Transformers Lang chain bits and bytes this is for quantization pro process then you have sentence Transformers okay uh the reason why I'm making all the specific videos because many people uh in the companies are specifically using this so it'll be definitely beneficial for you so please make sure that you hit like let's keep the target of height likes to th000 okay so now the next thing that I'm going to basically use this again install llama index so I'm going to basically use llama index because this Lama index only is the main tutorial that I'm probably creating with okay so we will go ahead and install this Lama index and here you can see the installation is happening all the comments All The Notebook I will be providing in the description of the specific video okay so done done done done done done done um yes successfully inst every information is basically over here you can probably see there's there's some kind of Errors over here but don't worry that that has basically called uh install itself now my next step will specifically be to load all the PDFs from this data folder right and you can put any number of PDFs over here based on your requirement okay so first of all I will go ahead and import from Lama llama _ index I'm going to import Vector store index and then I will go ahead and import oops from Lama index from Vector store oops Vector store index along with this I'm also going to go ahead with simple uh I think it should be given okay simple directory reader Vector store index is for the indexing purpose okay once we probably read the entire text once we get that entire data documents then we really need to do the embedding part I already have shown you in my previous tutorial the next thing is with respect to the service context now this service context is super important because it will help you to combine the Lama 2 model along with the promt template that we have specifically created okay so I'm also going to use this service context along with this since I want to call the Lama 2 model from hugging face so what I'm actually going to to do over here is that I will go and ahead and import Lama index. llms okay and I will go ahead and import hugging face llm right so I'm going to specifically use this hugging face llm because this will help you to interact with the hugging face Hub where uh you'll be able to call the entire Lama 2 model okay so this is the next library that I'll be going ahead and importing then I will go ahead and write from Lama index the next thing is about prompt as I said so I'm going to import do prompts do prompts import simple I'm going to show you also different types of prompt as we go ahead but here we are specifically going to use Simple input prompt okay so this three libraries I'm going to use for this particular purpose done so here we have imported all the libraries now it's time that using the simple directory method directory reader method I just give my path right so path of my data folder so here I will just go ahead and copy the path and here I'm just going to paste it okay so it will be slash content SL data and inside this I will try to load the data so once this is done in short I'm going to get all my documents over here so whatever documents is basically present so let me go ahead and print those documents also so I hope it will probably get executed here are your entire documents so all the contents with respect to whatever things you want inside that document uh is available over here right so this document also you can probably see now once you got this entire documents the next step is that you take this document apply your prompt template and probably use the Lama 2 model for the further process to get the text summarization or to probably query anything that you have right so here what I'm going to do is that I'm going to create my system prompt and I'm going to basically give a prompt template so let's say that I'm going to give this prompt over here I'll say that hey you are a QA assistant your goal is to answer question as accurately as possible based on the instruction and the context provided whenever we have instruction and context provided right then it is basically going to look at that specific index and provide me the response so once we probably create the system prompt then then what we are basically going to do this prompt whenever we are specifically using Lama 2 model you need to give it in a format okay and that format is basically required in this so what we are basically going to do we we are going to use a default format which is supportable by supportable by llama 2 okay this this default frt is supportable by Lama 2 so here I'm going to basically write simple input prompt okay and the prompt that we specifically use in what format that is given is given in this syntax and that we you can also get it from when you read those uh documentation page of using Lama 2 so it will be something like this/ ususer and we can also make some minor changes on this okay this will be my first uh default prompt and here I'm going to give my query uncore St Str my query string and the next thing here what we specifically give is my assistant so we will use this default prompt and usually with respect to Lama 2 we use this okay what is this query query string and all we'll talk about it uh how it is getting populated we'll also talk about it right but this is the format that we specifically use user query St Str assistant okay so once I execute this so this will be basically be my simple improv promp now here you can probably see this will be my variable parameter over here uh you can probably see query St Str so let me do one thing let me create a variable okay so here I'm going to write query wrapper prompt okay so this query wrapper prompt I've actually created so here you'll be able to see this now I've created my default system prompt I've created this particular format for my Lama 2 now what I'm actually going to do I'm going to do the hugging face login okay so I'll write hugging face CLI login because since we are going to call those Lama to from there right so this I'll go ahead and execute it now here is the token that I really need to I will go ahead and copy this it will ask you for login so you can take this particular token and uh let's do one thing I will paste it over here I will execute it yes I will write so here you can see the token has been saved to this particular uh cache over here and login is successful okay and don't worry about this uh it is saying you to store it in global credential but you don't require it since you currently working for only uh for this particular use cases okay now this is done now the next step will be that how you can call the Lama 2 model right from the hugging face itself so for this we use torch okay so that is the reason here you can see import torch and then we use hugging face llm you can see that we can set up our context window 4096 so internally it is going to set up this context window the size the overlapping size also is there Max new tokens 256 generate arguments the arguments here I'm probably providing two one is temperature I hope you understand what is the importance of this particular temperature your temperature value vales ranges between 0 to 1 to just indicate that how creative you know our llm model should be right then system prompt is nothing but the system prompt that we have created along with this there is a query wrapper prompt which we have already created over here so this will be the query wrapper prompt that we are giving then the next step over here will be nothing but tokenizer name now this if you probably search for meta Lama 2 this is llama 2 with 7 billion parameter so we are basically using this model OKAY Lama 2 7 billion parameter with respect to this so that is the reason we have written the name like this okay then what will be that this is my tokenizer name the tokenizer name whatever tokenizer name is specifically used over here along with this we are also going to use the model name metal Lama same model device map will be Auto and uh uncomment this if you're using Cuda to reduce memory usage since over here as I said load underscore inore 8 bit that basically means quantization is happening probably this entire model will be in 16bit right but we are trying to quize it in 8 bit so that is the reason the value is true over here and uh the data type for this particular model will be float 16 and we are trying to quantize in 8bit so this is what we have done now as soon as I execute it it will load that model it will quantize and it will show you over here running this will take some amount of time um because all the process is happening you are loading the entire model from the hugging phas so over here it will hardly take a couple of minutes so we will wait till this entire uh thing is basically happening okay so let's wait till this gets executed so guys now here you can probably see this entire model has got uploaded uh so here in the llm you'll be able to see this specific model in this variable now the next step is super important that is with respect to embedding so here what I'm going to do because I I need to take that entire documents and embade it and index it right so here I'm going to import some libraries I'm going to use langin do embeddings do hugging face since we are going to use hugging face itself so from lin. iddings do hugging face I'm going to basically import hugging hugging face embeddings okay so hugging face embeddings is one library that I'm going to specifically use and this with the help of this hugging face embeddings I will call that sentence Transformers uh uh from the uh H from the hugging face itself right so there are multiple embedding techniques that you can specifically use I'll show you one and then in the upcoming videos other embedding techniques also we'll go ahead and see then I will also use from Lama index import service context since we need to combine each and everything you'll be able to understand it so service context import service context this is the next library that we'll specifically use okay and then from llama index do embeddings import Lang chain embedding okay so this three libraries I'm going to specifically use for this purpose and you'll be able to see that once we use these libraries um how we are going to combine them that is the most very important thing right so there I will be creating my embed model and here I will go ahead and write Lang chain embeddings okay and here I'm going to use my hugging face embedding itself so this will be my hugging face embedding let me just go ahead and write enter over here and inside this I'm going to give my model underscore name so my model underscore name which I'm specifically going to use for my embedding right there is something something called as sentence Transformers sentence Transformer uh I can use this mpet base 2 okay so this is the model I'm going to specifically use and we'll try to understand what exactly this does this is a sentence model Transformer it Maps sentence and paragraph to 768 Dimensions dense Vector space as I said hugging face almost has each and everything so we are going to specifically use this and I'm going to put that same information over here right so once we do that I will be getting that entire embedded model and again this will also get downloaded like how we have downloaded the Lama 2 now finally we I have my embedded model uh I have my llm model I have my documents so I'm going to combine all of them with the help of service context that we have done right uh so if I give you an idea with respect to Lama 2 service context let's see okay so here uh you'll be able to understand so if I probably go ahead and click off service context so here you'll be be able to understand why service context is used service context is a bundle of commonly used resources during the indexing and quering stage at Lama index pipeline or application so if you want to probably combine embedding technique you want to combine llm technique you want to combine prompt you want to use documents so specifically whatever indexing and quering you specifically using it will try to bundle them entirely okay so that is the reason we are going to use it over here so let me just go ahead and write service context is equal to Service uh context Dot from there is a function which is called as from defaults so whatever is the default value it will get selected so from default and here now I'm going to mention all my important information so first of all I will go ahead and mention my chunk size my chunk size I will keep all the documents that chunk that I really want to create will be 1024 you can also change it llm I will initialize it to my llm model and the next parameter is nothing but my embed model so inside this there is something called as embed model here I'm going to give my embed model itself right so once I probably give the embed model and execute it so now it is basically going to create this entire service context and after probably creating so here you can probably see my service context once I execute it this is my service context here you can see llm predictor here you can probably see the other info like context window and all everything is given over here now what we going to do we going to use this Vector store index and convert this entire data that we have our data right uh into uh indexes so here I will write from documents do from documents and here I'm going to give my documents. comma and I'm going to mention my service context is equal to nothing but whatever service context we have created okay so it will be service context and I'll copy this and paste it over here done so this looks good uh I have my Vector store everything is created now this will basically get converted into my index that I specifically want okay now this is done now with respect to query engine I can convert now this index to a query engine so here is my index here you can see a vector store index now in order to convert this into a query engine I will write index as query uncore engine I will execute this index. asore query engine and here I'm going to use my query engine is equal to this one uh as soon as I make this as query engine now you can specifically ask any question that you want okay so let's say if I write query dot query uncore engine do query and here I can give my query name let's say what is attention is all you need okay and if I execute it here I will be getting my entire response so here you can see and this is actually happening with the help of Lama 2 okay uh Lama 2 working with hugging face entirely because in the upcoming video I'll show you Mistral I will show you other models also I'll show you open source models like Falcon many more right so that in your companies if you're specifically using this you can directly use it by using the hugging face API itself right all those things will probably come so let's execute this uh so it'll take some amount of time let's see for the first time I think it for loading so here is my entire response great now let me do one thing let me print it okay so if I probably uh save this in a variable called as response okay let me do one thing let me save this in a variable called as response and and now if I go ahead and print the response you'll be able to see that okay right now everything is being displayed but when we print this specific response you'll be able to understand it okay not only that you can also execute with other queries like what is Yello and all whatever things you want to get from those PDFs so this in short is a kind of a rag system with uh you know uh retrieval argumented generation with the help of Lama 2 so if I go ahead and execute my response so here you'll be able to see my response attention is a powerful tool but it is not here you can probably see but now if I print this response you'll be able to see that this entire response will get printed attention is a powerful tool in NLP but it is not only the thing that you need to build successful model everything is given from the entire document not only this if I want any queries see if you if you want to make this really fast you really need to have a good GPU right with respect to the response also so what is YOLO and if I probably print this specific response again I'm telling you if you really want to make it fast let's say in upcoming videos when I change this run time to some amazing GPU this response will be very very faster because you require parallel processing to get the entire output and response over here so if I probably print the response with respect to what is yolow again it'll be good enough enough right it'll be good you'll be able to get the query so what if I probably click on what is print response here you can see YOLO is a real time object detection system that uses a single neural everything is there if there is no context from those it will not display you anything okay so I hope you got an idea with respect to this uh just go ahead and implement this I'll give you one challenge try to convert this into an end to endend with the help of streamlet and all in your local try to do it um anyhow I'll be uploading those videos as we go ahead so yes this was it from my side if you like this particular video please make sure that you subscribe to channel press the Bell notification icon and everything will be provided in the description of this particular video I'll see you in the next video have a great day thank you one all take care bye-bye

Info

Channel: Krish Naik

Views: 25,774

Rating: undefined out of 5

Keywords: yt:cc=on, LLaMA2, LLaMAindex, RAG LLM, LLM App Development, Retrieval-Augmented Generation, LLaMA Tutorial, Advanced LLM Techniques, AI Programming, LLaMA2 Integration, LLaMAindex Tutorial

Id: f-AXdiCyiT8

Channel Id: undefined

Length: 24min 9sec (1449 seconds)

Published: Wed Jan 31 2024