Chat with Documents is Now Crazy Fast thanks to Groq API and Streamlit

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so this whole thing took one/ tenth of a second which is insanely fast in the last video I showed you how to get started with the grock API in this video I want to show you how to build a rag pipeline using the grock API we will build this simple rack pipeline in this video and later in the video I'll show you how to package everything within a streamlet app so first I'll walk you through the code and then we're going to look at the streamlit application and we will also discuss some things that you need to consider while you're wrapping everything up in a streamlet app so first we need to install all the required packages we're going to be using beautiful soup 4 to download data from a website f is going to be used for Vector store AMA is going to be used for serving embeddings streamlet for the UI Gro is going to be used for accessing the mix M model and everything is going to be integrated using Lang chain and we're going to use this python. package to store our secrets which is the grock API key in this case next we need to import all the required packages so we are importing stream liid chat Croc that's from within Lang chain and the AMA embeddings apart from that we are importing all the basic things that we will need to create our R pipeline so here's a quick overview of how the r pipeline is going to look like so we have a document in this case this is going to be content of a website we will chunk them into different chunks then for each Chunk we compute the embeddings we create a vector store and whenever user asks a question we calculate embeddings for that question then do a similarity search on the vector store get the relevant chunks the relevant chunks plus the query goes uh as a context to the llm and we will get a response now in order to create this rack pipeline first and foremost we are going to load our uh secret API key in the previous video I showed you how to get API key from grock so if you haven't seen that video I'll highly recommend to watch that video and in this case we are using the load. package to load that and set it as an environment variable okay once we have our API key now we need to get our data in this case we are going to be downloading data and this is going to be one of the essays that was written by pogram this essay contains text which spans over 26 pages so basically we download the data using the web based loader and then we load the documents from the this loader okay so once we load our document the next step is to chunk them and for that we're going to be using the recursive character text splitter if you want to understand the internal working of this splitter I have a detailed video on how to split your documents into different chunks in this case we are defining a chunk size of thousand characters with an overlap of 200 characters now this is a very naive approach there are actually methods on how to figure out the chunk size I am creating a whole Advanced series on that if you are interested in that make sure to sign up link is going to be in the video description so once we have our text splitter created then we need to actually split the documents so we run our documents through the uh splitter that will give us our documents okay once we have that the next uh step is to uh calculate embeddings for all those documents now in this case we're going to be using the olama embeddings you can use any embeddings you want there are a whole bunch of Open Source embeddings that are available you can even use the open embeddings with this model so we will create our Vector store once the vector store is created next we need to create our llm so we're going to be using the chart grock we will provide are API key currently the API supports two different models one is the mix Moe and the other is L 270 billion model for this experiment we are going to be using the mixe model all right once we have that we need to Define our prompt template so in this case The Prompt template is answer the following question based only on the provided context than step by step before providing a detailed answer and I'll tip you uh $200 if the user finds the answer helpful then we provide our context and it will expect a user input okay so all once all this is set up then we need to create a chain and in this case we're using create stuff document chain we provide our llm which is the mixe through grock API and the user prompt that we will receive okay the uh next step is going to be to create a retri based on the vector store so for this example I'm keeping everything in memory and not actually persisting it to disk so we take a retriever then we create a retrieval uh chain in that case we need to provide the Retriever and the document chain that we created before and whenever you provide a user input and invoke the chain with the user input you're going to get a response okay so that was a quick overview of the code now let's see this in action we need to do a couple of things so I put together a python script that is going to be running the streamlet app but we actually need to start a server that is going to be running the AMA embedding model and in order to do that I have already installed AMA all I need to do is in my terminal I'm going to type AMA run and then a Lama 27 billion model so basically this is going to be using the embedding of the Lama 270 Bild model so right now the embedding server is running so we can go back to our code so let me quickly walk you through the streamlet part so here we are loading the embedding model that we want to use this is a single script that is going to run so every time it runs it's going to load or download the data from the web page so let me actually show you the contents of the web page okay so here's the website that we are looking at I'm specifically reading this essay how to do great work this is written by Paul Graham and it's around 27 pages long so this is going to be used as our knowledge base now if you just create embeddings the way I had shown you before for every question that you run it will recreate them Bings and that takes a long time so instead store our Vector store in the session state so basically when we launch our app it first checks if the vector store is in the streamlit session state if it's not there so that means we are launching it for the first time it will load the embedding model and then goes through the whole process of basically downloading the data from that website splitting it creating the embeddings and then create a vector store and store it there so this is going to be important I'll actually show you the the real time that it takes to create the edex and that's the the majority of the time that is going to be spent when we are running that app okay so once we have that then we are defining our llm our prompt template as I showed you before we are also creating both our documents chain as well as the retrieval chain I can also I think put this inside uh session state so then for each run or each question is not going to uh recreate these so that might be something that you want to experiment with then we uh get input from the user after getting input from the user I also wanted to check how long it takes for it to call the API and go through this whole retrieval chain so for that I start a timer that's going to be the start time when it process our request get a response from the API so we stop that time and then we noted down so I actually wanted to see for each request how long that is going to take and once you get the response we write it up and we are also showing here the context or the chunks that were used to get that response okay so I created a new virtual environment called YouTube in the previous video I showed you how to do that all the requirements are going to be in the repo that I'm going to post so we're going to just run streamlit run and then this is called Lang chain rag app so we can to run this now a couple of seconds have elapsed and this is actually Computing those embeddings so I actually wanted to see like how long it takes to compute the embeddings and create the vector store for 27 Pages this is going to be the majority of the time that is going to be spent by the script one way you could do it is you can create these using a separate script and you can persist it to the disk and whenever you load the app you can actually load the embeddings but but in this case I'm not doing that I have created other videos which shows you how to do that so I'm going to put a link to those videos okay so our app is ready to interact with okay so actually let's ask a simple question about author's opinion on how to do great work and this is going to be real time so I hit enter and it's running now and here's the response that we got so it says based on the provided contacts the author suggest several things when it comes to doing great work so hard work undirected thinking avoid distraction cultivate taste and here are the different chunks that the model is using now here's the actual time it took so this is the CPU time between the time we asked the question and then it retrieved the relevant chunks to the model or the question and passed those through the API and received a response this is insanely fast so this whole thing took one/ tenth of a second which is insanely fast we want to ask about author's opinion of impressing people and this time I think it was even faster than before okay for some reason it's centered twice and both times it's taking 07 seconds and this is end to end rack pipeline so retrieval using embedding model then getting that as context to the llm sending it to the gro API and getting a response and showing us which is pretty crazy okay so this was pretty fast now I'm not really concerned about the responses that we get there are Advanced Techniques that we can use to improve our rack pipeline as I said I am putting together a whole series on that so make sure to sign up if you are interested a link is going to be in the video description if you're working with llms and building rack pipelines and are looking for advice I offer Consulting Services if you are interested details are in the description I hope you found this video useful thanks for watching and as always see you in the next one

Info

Channel: Prompt Engineering

Views: 19,305

Rating: undefined out of 5

Keywords: prompt engineering, Prompt Engineer, LLMs, AI, artificial Intelligence, Llama, GPT-4, fine-tuning LLMs, Groq, Groq AI

Id: _IcgfbXAAPM

Channel Id: undefined

Length: 12min 17sec (737 seconds)

Published: Sat Mar 02 2024