Chatbot with RAG, using LangChain, OpenAI, and Groq

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Retrieval Augmented Generation, RAG. Let's see how to create a chatbot from scratch using this technique in this video. I will assume that you have no prior knowledge of building a chatbot. Still, by the end of this session, you will be able to create a chatbot using OpenAI's GPT model and LangChain Framework. This chatbot will be able to answer questions related to any internal documentation of an organization. So to get started, let me take you through what we are actually going to be building at a very high level. We will create an RAG pipeline that retrieves information and generates output. Typically, we input a query into an LLM to get an output. This works well for general knowledge or knowledge that LLM has seen before, but not always. The problem is that many LLMs have not seen a lot of information we would like to understand. For example, if I ask what makes Mistral 7B special, most LLMs wouldn't have the answer at the time of recording. Mistral is a recent language model that wasn't included in most LLMs training data. So many LLMs are not familiar with Mistral 7B. They usually provide incorrect information or make something up when asked about it. To avoid this, a Retrieval Augmented Generation pipeline is used. Today, we will see how to create this pipeline. It is disappointing when LLMs who are supposed to be knowledgeable about programming cannot answer general knowledge questions correctly. This is because an LLM's knowledge is limited to what it learned during training and has no access to the outside world. As a result, the response we received was a hallucination. So let's get started by building a chatbot that has a specific limitation. We will explore how to create it and see the limitations in action. To do this, we will follow a notebook I will link to in the video description. All right, we need to use a few libraries, such as LangChain, OpenAI, HuggingFace, DataSets, Qdrant client, and TikToken. In this case, we will be relying heavily on the LangChain library to use GPT 3.5 Turbo or GPT 4 models from OpenAI. Basically, we import the ChatOpenAI object, which is essentially just a chat interface or an abstraction in LangChain. LangChain is a useful tool for creating more complex AI systems. While you can also use OpenAI API to access various models, LangChain provides additional components that can be easily integrated into your system. For example, you can add a rag pipeline efficiently. To initialize our chat model, we format some objects to the structure typical of OpenAI chat models. The chat log is composed of a system prompt at the top, which gives instructions to the model and the user query, the chatbot, the AI, the assistant, and the user. So this pattern repeats throughout the log. When using the OpenAI API, the chat log is represented as a list of dictionaries. Each dictionary contains a role and the corresponding text content. So the LangChain format is slightly different but follows a similar structure. It has a system message, a human message, and an AI message. The system role corresponds to the system message. The user role corresponds to the human message. And the assistant role corresponds to the AI message. The content for each message is provided in the dictionary, making it easy to access and use. This is the LangChain version of what I have just demonstrated. Let's initialize it and pass all of it to our OpenAI chat object. Then we can run it and observe the response generated. I receive a response regarding machine learning which I requested before. I don't know if the response is accurate. It could be assuming things. However, I trust that it does have the answer. So let's print it out in a more readable format. This response is an AI message which we can add to our chat log to continue the conversation. I will append it to our messages list. Next, I will create a new prompt to ask a different question. I won't mention machine learning this time. Instead, I will ask about the difference between supervised and unsupervised without mentioning the word learning. So in our chat model, we rely on the history of our previous conversations to guide our response. This is why we add our response to the messages, followed by the new prompt before sending them out to GPT 3.5. You will notice that the model mentions the difference between supervised and unsupervised learning, showing a good conversational history. We now have a chat bot, and that was easy to put together. There is nothing complicated going on there. Let's delve deeper into hallucinations and why they occur. LLMs hallucinate for various reasons, one of which is that they depend only on knowledge, learning during their training. This means that an LLM only knows what is in its training data. It doesn't learn by exploring the world. If something is not in its training data, it won't know it. Even if it is, it might not be accurate, leading to ambiguity. The purpose of an LLM is to compress the information in the training data into an internal model of the world as it existed in that data set. However, this approach has limitations because it only has access to the training data. This is where RAG comes in to handle the problem. The middle component here can be a RAG pipeline, Google Search, or linked to an external source like an SQL database. It only connects to a subset of the external world, not the entire world. We want to enable a feature that our current LLM doesn't have. Our LLM is trained on a specific data set and only understands things based on that data. This type of knowledge is called parametric knowledge because it's stored within the model parameters. These parameters are only ever changed during training and are frozen afterward. Our LLM only uses parametric knowledge, which means it can't learn new things or adapt to new situations. But with RAG, we can add a memory component that can be modified and updated, giving us a long-term memory. This will allow our system to learn and adapt to new situations beyond what it has already learned. In the case of RAG, an external knowledge base of external memory, is used as a vector database. The database as input into your LLM has the advantage of adding, deleting, and managing the memory and knowledge of your LLM. This is really cool because it's almost like managing information in a person's brain or updating it. However, it may sound dystopian. Still, this is a good analogy for what we are doing with LLMs. And yeah, we are using a method called source knowledge. This differs from parametric knowledge because the knowledge is not stored in the model's parameters. Instead, the source knowledge refers to anything we insert into the model via the prompt, and information that goes through the prompt is considered source knowledge. When adding the source knowledge to our LLM, it will look like this. At the top of our prompt, we typically have some instructions. At the bottom, we have the user's query, which is a question. And in between, we insert the external information or source knowledge. We can call this information a context, document, or many other things. But here, we will refer to it as a context. Before building the pipeline for this task, let's manually insert a message into our model and observe its impact on the performance. We will add the message what makes Mistral 7B special and analyze the model output. Again, the model is not familiar with this specific reference to Mistral 7B. Let's try another one. I will ask about LLMChain in LangChain. This is another example where the model had to modify the question slightly because it couldn't find any specific information on LLMChain in LangChain. However, the structure of the question remains the same. All right, let's try using the source knowledge approach. I got some information about LLMChain from the LangChain website. We can see the information is quite long. I extracted some relevant details about LangChain, chains, and LLMChain. The plan is to concatenate all this information to form our source knowledge. Then we can use the same structure prompt as before and see what happens. We have the prompt, instructions, and context. Now I will feed the query into our chatbot and see what happens. The response is, in the context of LangChain, LLMChain is described as the most common type of chain within the framework. It's almost identical to the information on the website, but given a more readable format and adapted to our specific question. Much information is available. And as far as I know, it is all accurate. We received a very good answer by adding some text to the prompt. But doing this every time is not practical. Our goal is to find a way to automate this process and apply it to many documents, which is where a rag comes in. We need to focus on the retrieval component, which is not that complicated and can be easily done. We will download the data set from Hugging Face, which can be accessed from their website. The data set is created by scrapping the Mistral 7B arxiv papers. Although it is not a large data set, it will be useful for our example. However, the data set is not very clean. We will use the chunks of text I extracted from there to create our knowledge base. We will utilize Qdrant to build our knowledge base. To do this, we need to obtain an API key by visiting cloud.qdrant.io If you don't have an account, you must create one or log in if you already have one. And here, I'm using the free tier. You can create a collection here, but we will create it in the notebook. Instead, we need to go to Data Access Control. I will copy my API key and remember my cluster URL. And now, I will add this information to my .env file and paste my API key. So we will create our collection using an embedding model called text-embedding-3- small provided by OpenAI. This model provides numerical representations of meaning similar to how humans interpret text. The size of these vectors is known as the embedding dimension, which determines the size of the vectors that the model outputs and, consequentially, the size of the collection that will store those vectors. We will then create embeddings for what we call documents. These documents are equivalent to the context mentioned earlier, essentially chunks of text that will serve as part of our knowledge base. We have two of these documents or context here. And when we embed it then, we will receive two embeddings. Each of these embeddings will have 1,536 dimensional output. That's how we will perform the embedding. We will now iterate over our complete data set, the Mistral 7B arxiv paper, and perform the embedding process. This will help us extract the essential information about each of the records. To begin with, we converted the data set into a pandas data frame that contains several columns. For this example, we will only need two columns, chunk and source. After isolating the relevant columns, we will use the DataFrameLoader module to load the records as LangChain documents. This will simplify the entire process. And now we add all those into Qdrant. So here we have a specific import from LangChain for vector stores. And we only need to import Qdrant. To send the chunks to my collection, I use the function from_documents() Specify my documents, that is chunks, the embedding model that will be used to transform the chunks into vectors, the URL of my cluster, and the name of my collection. We are almost done with the project. All that's left is to finish the pipeline and connect the knowledge base to our LLM. So we can test the connection with Qdrant and run a query to see the information that will be retrieved. Let's run, beautiful, it works. So we asked, what is so special about Mistral 7B? We saw earlier why you couldn't, answer this, but now it's different. In this case, we pass it into our vector database or vector store, and we return the top 3 most semantically similar records. As you can see, we are getting these chunks from the Mistral paper. It's good, but a little tricky to read. We currently have 3 documents that are hard to read. And to avoid this, we will let our LLM handle it. I will set up a custom prompt function to retrieve the top 3 most relevant items from the vector store and use them to create our source knowledge. We will input this information into an augmented prompt and then provide the output. Let's improve our query by incorporating the context below and answering the question. We have this context and the query. What is so special about Mistral 7B? And this is now the modified query that we can input into our chatbot. So let's try it. We will create a new human message before. We will append that to our chat history and feed that into there. Let's run. This may take a little time due to OpenAI's inference speed. So Mistral 7B is a 7-billion parameter language model that stands out for its superior performance and efficiency. Here are some highlights that make Mistral 7B special, Outperforms other models, attention mechanism, model fine tuning, ease deployment, yada, yada, yada. I understand that the impact of RAG on this system and its implementation is quite clear. However, implementing RAG can be naive, assuming every query requires a question. This is what I call standard RAG, and it is not always necessary. For example, if a user greets the chatbot by saying, hi, how are you? The chatbot does not necessarily need to refer to an external knowledge base to answer. Also, the inference time can be slow, depending on the model and API usage. So that is one of downsides of using this approach, but there are many benefits. We can achieve better retrieval performance, obtain a wealth of information to help us answer questions more accurately, and provide citations for our source. This method is faster and more efficient than other alternative approach like using agents. Additionally, we can filter out the number of tokens, making it even more convenient. OK, before someone screams in the comments about using OpenAI and that I could do something with another LLM, or even about the inference time, which can sometimes be critical, I prepared another example to try an alternative with fast inference, capable of generating almost 500 tokens per second. Let's try Groq! For our example, we will only need the API key, which is currently free, and install one package. So first, we need to do pip install langchain-groq. Also, I will generate the API key and save it in my .env file. Before working with the API, let's quickly look at the playground. You can provide your system message here, where I will copy our prompt message and context. Then the user input, what is so special about Mistral 7B. Here, you'll have two different options for the model. So for our experiments, we will be using the Mixtral 7B. Let's try it. It was very, very fast. Below, we have some metrics with inference time. It's just shocking. OK, now let's go back to the notebook. In this example, all I need to do is create a new chat instance to use Groq. So I will reuse the last human message we made that was the custom prompt function for our special prompt with RAG. So let's run it. And it was much faster than GPT. Also, changing the model and reusing our code was very easy. If you want to see more videos about Groq, leave a message in the comments. I hope this introduction to RAG and chatbot has helped you understand how these components work together. Thank you so much for watching, and I will see you in the next one.
Info
Channel: Qdrant
Views: 10,616
Rating: undefined out of 5
Keywords: vector search, vector search engine, langchain, openai, gpt-4, chatgpt, retrieval augmented generation tutorial, retrieval augmented generation, rag, rag tutorial, chatbot rag, chatbot tutorial, mistral, mistral 7b, Groq, qdrant, groq ai, groq api, groq speed, groq llm
Id: O60-KuZZeQA
Channel Id: undefined
Length: 20min 13sec (1213 seconds)
Published: Mon Mar 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.