Retrieval Augmented Generation,
RAG. Let's see how to
create a chatbot from scratch using this technique in this video. I will assume that
you have no prior knowledge of building a chatbot. Still, by the end of this session, you will be able to
create a chatbot using OpenAI's GPT model and LangChain Framework. This chatbot will be
able to answer questions related to any internal
documentation of an organization. So to get started, let me take you through what we are
actually going to be building at a very high level. We will create an RAG pipeline that retrieves
information and generates output. Typically, we input a
query into an LLM to get an output. This works well for
general knowledge or knowledge that LLM has seen before, but not always. The problem is that many LLMs have not seen a lot of
information we would like to understand. For example, if I ask
what makes Mistral 7B special, most LLMs wouldn't have the answer at the time of recording. Mistral is a recent language model that wasn't included
in most LLMs training data. So many LLMs are
not familiar with Mistral 7B. They usually
provide incorrect information or make something up when asked about it. To avoid this, a
Retrieval Augmented Generation pipeline is used. Today, we will see
how to create this pipeline. It is disappointing when LLMs who are supposed to be
knowledgeable about programming cannot answer general
knowledge questions correctly. This is because an LLM's knowledge is limited to what
it learned during training and has no access to the outside world. As a result, the response
we received was a hallucination. So let's get started
by building a chatbot that has a specific limitation. We will explore how to
create it and see the limitations in action. To do this, we will follow a notebook I will link to in the video description. All right, we need
to use a few libraries, such as LangChain,
OpenAI, HuggingFace, DataSets, Qdrant client, and TikToken. In this case, we will be relying heavily on the LangChain library
to use GPT 3.5 Turbo or GPT 4 models from OpenAI. Basically, we
import the ChatOpenAI object, which is essentially just a
chat interface or an abstraction in LangChain. LangChain is a useful
tool for creating more complex AI systems. While you can also use
OpenAI API to access various models, LangChain provides additional components that can be easily
integrated into your system. For example, you can
add a rag pipeline efficiently. To initialize our chat
model, we format some objects to the structure
typical of OpenAI chat models. The chat log is
composed of a system prompt at the top, which
gives instructions to the model and the user query, the
chatbot, the AI, the assistant, and the user. So this pattern
repeats throughout the log. When using the OpenAI API, the chat log is represented as a list of dictionaries. Each dictionary contains a
role and the corresponding text content. So the LangChain
format is slightly different but follows a similar structure. It has a system message, a
human message, and an AI message. The system role
corresponds to the system message. The user role
corresponds to the human message. And the assistant role
corresponds to the AI message. The content for each
message is provided in the dictionary, making it easy to access and use. This is the LangChain
version of what I have just demonstrated. Let's initialize it and pass all of it to our OpenAI chat object. Then we can run it and
observe the response generated. I receive a response
regarding machine learning which I requested before. I don't know if the response is accurate. It could be assuming things. However, I trust
that it does have the answer. So let's print it out
in a more readable format. This response is an AI message which we can add to our chat
log to continue the conversation. I will append it to our messages list. Next, I will create a new
prompt to ask a different question. I won't mention
machine learning this time. Instead, I will ask about the difference between supervised and unsupervised without mentioning the word learning. So in our chat
model, we rely on the history of our previous
conversations to guide our response. This is why we add
our response to the messages, followed by the new prompt
before sending them out to GPT 3.5. You will notice that the
model mentions the difference between supervised
and unsupervised learning, showing a good conversational history. We now have a chat bot,
and that was easy to put together. There is nothing
complicated going on there. Let's delve deeper into hallucinations and why they occur. LLMs hallucinate for
various reasons, one of which is that they depend only on knowledge, learning during their training. This means that an LLM only knows what is in its training data. It doesn't learn by exploring the world. If something is not in its
training data, it won't know it. Even if it is, it might not
be accurate, leading to ambiguity. The purpose of an LLM
is to compress the information in the training data into
an internal model of the world as it existed in that data set. However, this approach has limitations because it only has
access to the training data. This is where RAG
comes in to handle the problem. The middle component here
can be a RAG pipeline, Google Search, or linked to an
external source like an SQL database. It only connects to a
subset of the external world, not the entire world. We want to enable a feature
that our current LLM doesn't have. Our LLM is trained on a specific data set and only understands
things based on that data. This type of knowledge
is called parametric knowledge because it's stored
within the model parameters. These parameters are
only ever changed during training and are frozen afterward. Our LLM only uses parametric knowledge, which means it can't learn new things or adapt to new situations. But with RAG, we
can add a memory component that can be modified and updated, giving us a long-term memory. This will allow our
system to learn and adapt to new situations
beyond what it has already learned. In the case of RAG,
an external knowledge base of external memory,
is used as a vector database. The database as input into your LLM has the advantage of
adding, deleting, and managing the memory and knowledge of your LLM. This is really cool because it's almost like managing
information in a person's brain or updating it. However, it may sound dystopian. Still, this is a good analogy for what we are doing with LLMs. And yeah, we are using a
method called source knowledge. This differs from parametric knowledge because the knowledge is not stored in the model's parameters. Instead, the source
knowledge refers to anything we insert into the model via the prompt, and information
that goes through the prompt is considered source knowledge. When adding the
source knowledge to our LLM, it will look like this. At the top of our prompt, we typically have some instructions. At the bottom, we have the
user's query, which is a question. And in between, we
insert the external information or source knowledge. We can call this
information a context, document, or many other things. But here, we will
refer to it as a context. Before building
the pipeline for this task, let's manually
insert a message into our model and observe its
impact on the performance. We will add the
message what makes Mistral 7B special and analyze the model output. Again, the model is not
familiar with this specific reference to Mistral 7B. Let's try another one. I will ask about
LLMChain in LangChain. This is another
example where the model had to modify the question
slightly because it couldn't find any specific
information on LLMChain in LangChain. However, the structure of
the question remains the same. All right, let's try
using the source knowledge approach. I got some information about LLMChain from the LangChain website. We can see the information is quite long. I extracted some relevant
details about LangChain, chains, and LLMChain. The plan is to
concatenate all this information to form our source knowledge. Then we can use the
same structure prompt as before and see what happens. We have the prompt,
instructions, and context. Now I will feed
the query into our chatbot and see what happens. The response is, in
the context of LangChain, LLMChain is described
as the most common type of chain within the framework. It's almost identical to
the information on the website, but given a more
readable format and adapted to our specific question. Much information is available. And as far as I know, it is all accurate. We received a very
good answer by adding some text to the prompt. But doing this
every time is not practical. Our goal is to find a
way to automate this process and apply it to many
documents, which is where a rag comes in. We need to focus on
the retrieval component, which is not that
complicated and can be easily done. We will download the
data set from Hugging Face, which can be accessed from their website. The data set is
created by scrapping the Mistral 7B arxiv papers. Although it is not a large data set, it will be useful for our example. However, the data set is not very clean. We will use the chunks
of text I extracted from there to create our knowledge base. We will utilize
Qdrant to build our knowledge base. To do this, we need to obtain an API key by visiting cloud.qdrant.io If you don't have an
account, you must create one or log in if you already have one. And here, I'm using the free tier. You can create a collection here, but we will create it in the notebook. Instead, we need to
go to Data Access Control. I will copy my API key
and remember my cluster URL. And now, I will add
this information to my .env file and paste my API key. So we will create our
collection using an embedding model called text-embedding-3-
small provided by OpenAI. This model provides
numerical representations of meaning similar to how humans interpret text. The size of these
vectors is known as the embedding dimension, which
determines the size of the vectors that the model
outputs and, consequentially, the size of the
collection that will store those vectors. We will then create
embeddings for what we call documents. These documents are
equivalent to the context mentioned earlier,
essentially chunks of text that will serve as
part of our knowledge base. We have two of these
documents or context here. And when we embed it then,
we will receive two embeddings. Each of these embeddings will have 1,536 dimensional output. That's how we will perform the embedding. We will now iterate
over our complete data set, the Mistral 7B arxiv paper, and perform the embedding process. This will help us
extract the essential information about each of the records. To begin with, we converted the data set into a pandas data frame
that contains several columns. For this example, we
will only need two columns, chunk and source. After isolating the relevant columns, we will use the DataFrameLoader module to load the records
as LangChain documents. This will simplify the entire process. And now we add all those into Qdrant. So here we have a
specific import from LangChain for vector stores. And we only need to import Qdrant. To send the chunks to my collection, I use the function from_documents() Specify my documents, that is chunks, the embedding model that will be used to transform the chunks into vectors, the URL of my cluster,
and the name of my collection. We are almost done with the project. All that's left is to finish the pipeline and connect the
knowledge base to our LLM. So we can test the
connection with Qdrant and run a query to see the information that will be retrieved. Let's run, beautiful, it works. So we asked, what
is so special about Mistral 7B? We saw earlier why
you couldn't, answer this, but now it's different. In this case, we pass
it into our vector database or vector store, and
we return the top 3 most semantically similar records. As you can see, we
are getting these chunks from the Mistral paper. It's good, but a little tricky to read. We currently have 3
documents that are hard to read. And to avoid this, we
will let our LLM handle it. I will set up a custom prompt function to retrieve the top
3 most relevant items from the vector store
and use them to create our source knowledge. We will input this
information into an augmented prompt and then provide the output. Let's improve our query
by incorporating the context below and answering the question. We have this context and the query. What is so special about Mistral 7B? And this is now the modified query that we can input into our chatbot. So let's try it. We will create a
new human message before. We will append that to our chat history and feed that into there. Let's run. This may take a little time
due to OpenAI's inference speed. So Mistral 7B is a 7-billion
parameter language model that stands out for its
superior performance and efficiency. Here are some
highlights that make Mistral 7B special, Outperforms other
models, attention mechanism, model fine tuning, ease
deployment, yada, yada, yada. I understand that the
impact of RAG on this system and its implementation is quite clear. However, implementing RAG can be naive, assuming every query requires a question. This is what I call standard RAG, and it is not always necessary. For example, if a user
greets the chatbot by saying, hi, how are you? The chatbot does not necessarily need to refer to an
external knowledge base to answer. Also, the inference time can be slow, depending on the model and API usage. So that is one of
downsides of using this approach, but there are many benefits. We can achieve
better retrieval performance, obtain a wealth of
information to help us answer questions more accurately, and provide citations for our source. This method is faster and more efficient than other alternative
approach like using agents. Additionally, we can
filter out the number of tokens, making it even more convenient. OK, before someone
screams in the comments about using OpenAI
and that I could do something with another LLM, or even
about the inference time, which can sometimes be
critical, I prepared another example to try an
alternative with fast inference, capable of generating
almost 500 tokens per second. Let's try Groq! For our example, we
will only need the API key, which is currently
free, and install one package. So first, we need to
do pip install langchain-groq. Also, I will generate the
API key and save it in my .env file. Before working
with the API, let's quickly look at the playground. You can provide your system message here, where I will copy our
prompt message and context. Then the user input, what
is so special about Mistral 7B. Here, you'll have two
different options for the model. So for our experiments, we
will be using the Mixtral 7B. Let's try it. It was very, very fast. Below, we have some
metrics with inference time. It's just shocking. OK, now let's go back to the notebook. In this example, all I need to do is create a new chat instance to use Groq. So I will reuse the last human message we made that was
the custom prompt function for our special prompt with RAG. So let's run it. And it was much faster than GPT. Also, changing the model
and reusing our code was very easy. If you want to see more videos about Groq, leave a message in the comments. I hope this
introduction to RAG and chatbot has helped you understand
how these components work together. Thank you so much for watching, and I will see you in the next one.