In this video, I'll show you how
to write your own app to process, summarize, and query PDF documents using large
language models that run privately, locally, on your own machine for free. This is Vincent
Codes Finance, a channel about coding for finance research. If you're interested in that topic,
please subscribe. In my previous video, in which I showed how to install a Chat GPT clone locally
on your machine, there was a feature in Open WebUI that let you upload documents and query those
documents. However, this feature only looks at part of the document at a time, it doesn't look
at the whole document. Many of you have asked me in the comments if it was possible to summarize
documents using Open WebUI. I haven't found a way, so I decided instead to show you how to build
your own app to do it. So, this is what we'll be doing today. The app I'm building will be
fully running locally on my machine. For that, I'll be using Ollama to serve the models, and
then to build the app, I will be using LangChain, which is a library that provides a lot of
convenience tools to access and deal with large language models programmatically. I'm doing
everything in Python today, but this library is also available in JavaScript if you're more
comfortable with that. Finally, I'll be using Streamlit, which is a Python module that lets
you build interactive dashboards very easily. All the code for this tutorial is available
on GitHub, so look at the video description to get the link to the repository and to my blog,
which provides written instructions for that. The app that I'm building today is based on
examples from the LangChain documentation. This is a diagram from that documentation that explains
the two methods that we'll be looking at today. First, we'll use the stuffing method to produce
a summary, and then I'll use the map-reduce method not to produce a summary, although you
could; that's what they do in the example, but I'll use it to query the document. So,
the way this works is that you first have to load the documents. They call that multiple
documents in our case. When we load the PDF, each page will be its own document, so we'll have
a list of documents, which are the pages from the PDF. We'll extract the text and we'll run our
query on that text using the stuffing method. What we'll do is we'll just put all of that text
within the context of our query. So, we'll just put all of that together into one big query
which will submit to our large language model, and then we'll get the result as our summary.
For the map-reduce process, what we'll do is we'll do it in multiple steps. So, what we'll do
is we'll take each page and then run a query on that page to find the relevant information in
that page and summarize it. And then we'll run a final query where we use the result from each
page and aggregate them into our final result. First, we'll use Ollama to serve our models. In
order to install Ollama, just go on their website and click download. If you're using Homebrew on
Mac, you can just do "brew install ollama," and then you'll have to download a model in order
to work with that. So, you can have a look at their model library. I'll be using Mixtral for
my examples here. In order to install that model, all you have to do is run the following command:
"ollama pull mixtral," and that will download the model. It is quite big, so it's going to
take a few minutes for you to download it. So, for this project, you'll need to install the
following dependencies. So, I'm using obviously langchain, but in order to use all the features
that I want from LangChain, I'll also need pypdf, which LangChain uses to load PDFs. Even though
we won't be using OpenAI for this project, we'll need the langchain_openai module and the
openai module because we'll be using Ollama, and Ollama provides an OpenAI API compatible
API, which we'll be using. That way, if you ever want to take your app but instead of
running it locally, you want to run it on OpenAI, it's only a few settings to tweak, and everything
will work. We also need tiktoken to help us count the tokens in our queries. We'll use
python-dotenv to load our .env file, and we'll be using streamlit in order to build
our dashboard. Finally, I have a final dependency, which is rich. I'm not using it in this tutorial,
but in my repository, I do have also a CLI tool, so a command-line tool, to do the same things,
and this one will be using Rich for the output. As a sample document, I am using one of
my research papers. The link is in the description if you ever want to read it. It is
open access, so you can download it for free, and this is what we'll be using, and we'll be
trying to summarize and query that document. So, first, we'll need to read our documents,
and for that, we'll use LangChain, which provides document loaders that help
us load different types of files. So, whether they are local files such as PDFs,
as what we're doing today. It supports CSV, different types of files, and it also has
integration for loading different types of data from third-party providers. In our case, we'll be
using the PDF document loader, which uses pypdf, so that's why we also need that as a dependency
for our project. So, we'll be following roughly this example here. Note that it also supports
extracting images from PDFs, which I won't be doing in this tutorial. So, first, we'll read
the file. So, we'll import the pypdf loader from LangChain, then we'll provide the file path,
create our PDF loader, and then just load, and that will create a list of documents. So, we
see we have one document for each page in my PDF. Next, what we'll do is we'll first summarize the
document using the stuffing method. So, we'll just read the document, put it all in a query,
and ask the large language model to summarize our document. This is inspired by the example in
LangChain. I tweaked it a bit so that it provides slightly better results, in my opinion. But what
we have to do first is to create our prompt. So, you can play with that to see whether you get any
better results, but what I have as a prompt is, "Write a long summary of the following document,
only including that is part of the document, and do not include your own opinion or analysis."
And then I provide the document. It is a prompt template from LangChain, so I can add some curly
braces here, so this will be variables that will be filled in by LangChain at runtime. This is the
document, and then I finish with "Summary:" and then I ask the large language model to complete
this prompt. So, I'll create my prompt from the template, and then next, what I'll do is
I'll define my large language model chain. We will be using Ollama, but we'll be using the
OpenAI compatible API from Ollama, so that way, if you ever want to use GPT-4 in the future,
you can just change some parameters here. So, what we'll be creating here, we'll be setting
the temperature, so that's how much randomness there is in the model. If you set it to zero, it's
almost a deterministic output from your model. You should always get the same output given the same
prompt, but if you add some temperature with that, it's going to add some randomness, so you can
play with that to see what it gives in terms of results. Then we set the model name, so I'll
be using Mixtral, as I said. We have to provide an API key, even if it is local. I'm running
Ollama locally, so I'll just pass it anything; it doesn't really matter. And then I'll change
the base URL. Instead of using the OpenAI URL, I'm using my local machine. And now I'm creating
my large language model chain, which is a chain type from LangChain. And now what I'll do is I'll
finally create a full chain. So, I'll use a Stuff Documents chain from LangChain, which takes two
things. So, it takes a large language model chain, so an LLM chain, so we'll pass it the one we
created previously, and then it also asks for the name of the variable that represents the document
in our prompt. So, in our case, it's "document." If we go back to our prompt, it's "document." This
is what I need to provide in order to create my chain. And now that I've got my chain created, I
can actually invoke that chain. So, in order to invoke a Stuff Documents chain, I have to provide
the documents that I want the chain to run on. So, if I do that, I'll just get the results and then
print the results so that we see what we get. So, it took a while, obviously. The first time I
called a model, Ollama has to load it into memory, so there's an overhead with the first query,
but it took a bit over 1 minute on my computer. I do have the output here. In order to kind of
look at it in a nicer way, I'll use textwrap, which is a built-in library in Python, in order
to format it so that we can look at it. And then we see there's a summary of the paper, but it
doesn't really have anything to do with the paper, right? So, it says the paper examines whether
blah blah blah, but it's the paper doesn't talk about the PEAD, the post-earnings announcement
drift, or anything like that. Where is it getting this information? Well, it's actually getting it
from the references, if we look at the end of our document, we do have a list of references that
do discuss these things, so that's where it's getting that information. Obviously, I don't
want that; it's not part of the article. So, what I could do in order to avoid this is just
select the pages that I want to use in my summary. For example, here, I could just drop the last
two pages, which contain mostly references. So, in order to do that, I can just use my list
of documents but go up not to the last page, not to the next to last page, but the one before
that. So, go to minus three, and then if I reinvoke my chain using these documents but just
excluding the last two pages, then I get a summary that actually is much better. It's actually
really discussing what the paper is about, and it's completely different from the abstract,
so it's also quite interesting. This way, if you've been enjoying the video so far, please
like so that others also get to discover it. Okay, so that was the first thing I wanted to show
you. This is how we can just query a document; we're just stuffing the full content of the
document into one query. It's good; it summarizes, but it's also not providing the information maybe
that I want to extract. Maybe sometimes I want something else than a summary. So, another way
to do that is we can just query a document. So, have a prompt that changes from every query to
query, but for that, we'll use the map-reduce approach. So, a common way to query a document
would be to use a RAG approach. That involves kind of splitting a document into chunks, generating
embeddings, so kind of vector representations for the text in those chunks, storing them in a
database that you can then query based on those embeddings. So, you would embed the query from
the user, do a semantic search on those chunks, get the most relevant chunks, and then use those
to build the answer. This is not what we'll be doing today. If you're interested in that approach
and you'd like me to make a video on that, I'll be happy to do so. Just let me know in the comments.
But here, what I'm going to do is I'm going to take a simpler but much more processing-intensive
approach. It's kind of a brute-force approach. So, I'll use a map-reduce to query the document.
The idea is that I'll be looking at every page in the document, asking the large language model
if the document, so that page, is relevant for the query. If so, extract, like, give me the
relevant information to that query, and then after doing that on every page, I'll put them all
back together, and I will provide the final answer using that. So, it's I'm mapping; I'm applying the
first prompt on every page, and then I'm reducing; I'm taking these outputs and aggregating them to
produce a final answer. So, for this, I'll just use a sample query: "What's the data used in this
analysis?" So, for that, I'll need kind of two parts. I have two parts. The first part is the map
part. So, in order to map, what's this going to be applied to every document, I'll say, "Well, the
following is a set of documents." It's actually not a set of documents; it's just a page. But in
this case, it is still a set of documents because we'll be using the Stuff Documents chain in order
to produce that. So, in order to be consistent, we'll keep it with that. And then the idea
is we'll ask the model based on this list of documents, please identify the information that
is most relevant to the following query. This is the user query. If the document is not relevant,
please write "Not relevant." And then I'll ask for a helpful answer. And using that, what I'll do
then is I'll create my prompt. I will partially fill that prompt. So, I'll pass in the user
query that I have here in my prompt, and then I'll pass the rest of my map prompt, so the one
that has been filled with the user query as well, to the LLM chain, which I'll call the map chain.
This one is going to be used for mapping. After that, we'll need a reduce part. So, in order to
reduce, what we'll do is we'll use a different prompt. We'll use this one: "The following is a
set of partial answers to a user query. These are my documents. Take these and distill into a final
consolidated answer to the following query." This is my user query. "Here's my complete answer."
And then I'll be again creating my prompt from my template. I'll be setting a partial prompt.
So, I'll be setting the user query directly into that prompt, so it's already integrated. And
then finally, I'll build my full chain. So, in order to do that, I'll use a MapReduce
Document chain and a Reduce Documents chain. So, in order to do that, I'll first need to create
my LLM chain for the reduce part. This is still using the same LLM models that we defined
earlier. Now, I've got my reduce prompt. So, here we'll build a few chains. The first one is
going to be a Stuff Documents chain that combines documents using our reduce chain. That's what's
going to kind of reduce everything at the end. This chain will be passed to a Reduce Documents
chain. This chain actually will take all of the map documents and reduce them together using the
combined documents chain. So, as it is reducing, so as it is combining the different outputs,
it also makes sure not to exceed the maximum number of tokens. If the maximum number of
tokens is reached, what it will do then is just produce a new answer and use that for a
next reduce step. And finally, we'll combine all of that in a MapReduce Documents chain. So,
we'll set the map LLM chain, then we'll have our Reduce Document chain that we've just defined
earlier, we'll set our document variable name, and we'll set to not return the intermediate step.
We're only interested in the final result. So, as you can imagine, this will generate
a lot more queries than the first step. Okay, so it's completed. It took a bit over 5
minutes. That's a bit longer than my different trials that I've done. I think it's mostly
because I'm recording at the same time, and this is maxing out my computer. This is making
things a bit slower than they were previously, but still, it was a couple of minutes to run
all of the steps because we are doing a lot of queries. But if we look at the answer
it provided, it is fairly decent. It does provide a good summary of the data that's being
used in that project. Okay, so that's all good, but if I want to summarize documents often, this
is kind of not a super nice interface to use. So, I did build for myself a CLI interface. It is on
the GitHub repository; you can have a look at it and download it. But what I think is even nicer
is if we have a kind of nice UI for that. So, the next step will be to kind of package that in
a nice UI so that I can reuse it more often. So, we'll build that using Streamlit, which is a
Python module that lets you build interactive apps that run in your browser. You can also deploy
them online if you want, but given that here we're running local models, that wouldn't quite
work. But this is what we'll use in order to build our UI. So, if I look at the structure of my
project, I have a few files here that I added. So, I do have my pyproject.toml that defines all the
dependencies. I've repackaged most of my code here in that module that I call documents_llm. So, it
has a few things in here. So, we have a helper to load a PDF, also to load a text file, but here
we'll just be using the pypdf. So, it's just a simple function that when I call it, I provide it
either a path or a file, and then the start page, the end page that I want, and then it's going to
wrap around the pypdf loader that we've just seen. Then I have a summarize.py file that packages our
summarize document function. It takes a list of documents as input, kind of model parameters
here, and then it will create the chain for summarizing and invoke the chain and return the
output text. This is just what we've done before. And in query.py, I did the same thing with
our map-reduce chain. So, it takes documents, a user query, and then the model parameters. It
builds the model from a map-reduce chain. So, I abstracted that part away. It's going to
invoke that chain and return the output text. The get_map_reduced() chain function is just
what we've done before, so it kind of packages all the steps to combine our chain. And then
I also have some Streamlit helper functions that we'll come back to in a moment. My Streamlit
document is going to be in called doc_app.py. So, it is fairly straightforward. I've added some
os and time imports in order to load environment variables and also time the execution query.
I'm importing Streamlit, then I'm using dotenv to load the .env file. So, you can set that to
set your default for the OpenAI URL, API key, and model name that you want to use. If you want
to use the actual OpenAI API, you can just change things here. And then from our Streamlit helper,
I'll also import run_query, which is our function, our main function, to run the query. So, we'll
get back to that later. So, first, what the app will do is load the .env file if it exists and
then load the parameters from the environment. And then after that, I'm building my app. So, if
we go back to the app, I do have a title here, which is my PDF Document Analyzer, with a little
description. And then here, I have some a sidebar with different widgets that I can kind of use in
order to get the user input and configuration. So, how is that build? Well, it's fairly simple.
I just use from Streamlit, I write the title, and then I write my little description. And
then I create a sidebar using with sidebar, and then everything that is within that
with block will be part of the sidebar. So, I've got a header for the model. I'll use a text
input for the model name, but I'll get the default one from the environment variable if it exists.
Then I've got a slider for the temperature. The default value is 0.1, but you can adjust it as
you want. Next, we have the documents property, so you can upload a PDF file. So, I'm using a file
uploader, which supports a PDF file. And then I'm also having a selector for the page range. So, I'm
making it two columns, so I'm creating two columns so that they are side by side. And the first one
will have the start page, which is a number input that has a value, default value of zero, minimum
value of zero, and then the end page, which has a default value of minus one, which in Python just
means you're taking the last element. So, it's going until the end, but you can actually change
it. This one will also accept negative values. And then I will also have a radio button to select the
query type, which is either summarize or query, so one of the two that we've been using. So, now
we're outside of that block. But what I'm adding here is that if the radio button value is query,
well, we also need to get the user queries. So, in that case, I'll display a text area that asks for
the user query, and I'll provide kind of a default query here. And finally, I'll add a button. And
the button, the way you define it in Streamlit, is a bit weird. Is there's an if statement where
you put in the name of your button, and then whatever is under that if statement will run once
the button is clicked. So, first, I'll initialize my result to None. I'll start my timer. I'll check
that there's a file. If there's no file, I'll write an error and say "Upload a file." And then
after that, if there is a file, what I'll do is I'll display a status. I'll run my query, passing
all the parameters to this big function. And then when I'm done, I'll update the status to "Done."
And if there's an error, I'll just catch it here. It's a catch-all, and I'll just display the error.
And that's it. And after that, I have an if here. If result, so if the result is not None, so there
was a result generated, then I'll create a new container with a border that I'll label "Result."
I'll display the result. I'm wrapping it into a markdown display. Some LLMs will generate results
in kind of markdown-like format, so in that case, it will get rendered. Otherwise, it's just
displayed as text. And then I'll have a little info bar with the time it took to run the model.
And the last thing that we haven't looked at is how to run the query. So, let's look at that.
So, here in that Streamlit helpers file, I have got two functions. One is save_uploaded_file.
Why? Well, it's because the LangChain document loader needs an actual file, whereas the Streamlit
file uploader will only keep the file in RAM. So, I actually have to take that file, I'll write it
to the temporary folder, so I'll write that file and return the path to that temporary file. And
I'll use that where in my run_query. The run_query will use the different functions that we've
created before. It will take the uploaded file, so that's directly from Streamlit, then it'll take
a Boolean to say whether summarize or not. If not, then it's query, the user query, which could be
null if it's summarize, start page, end page, and then model parameters. And then what it will
do is it will first load the file and then either summarize or query. In order to load the file,
well, I'll call my function save_uploaded_files. But before, I'll write this update. Now, what's
not too clear when we look at this is that here, our query is running within a status with status
block. So, anything that gets written within that block will be part of that status bar. So, if we
go back to our run_query function, we see that here, st.write will be written within that status
bar. So, we're just providing more information about the status. We're saving the uploaded file.
Once we've got that file saved on disk, we're loading the PDF. We'll use unlink to delete the
temporary file. And now, we'll either summarize, so if we summarize, we'll call our summarize
document function. If we're querying, we'll call our query document function, and we'll return the
result. And that's all there is to it. If we look at our result and we run our Streamlit file using
Streamlit run and then calling our file, which is doc_app.py, it loads our file. Here, I have
the model name. It's loaded from my .env file, but I could change it if I want. I do have the
temperature. I will load my file, so I'll have the PDF file here. File uploaded successfully. Can set
the start and end page, so two or three, something like that. Then I can either query or summarize.
If I click "Query," I do have the text box that appears here. But what I'll do here, I'll just do
our simple summary. So, I'll click "Run." Now we see we're saving the uploaded file, we're loading
the documents, we're summarizing the document, and now we have it. It took a little bit over 1
minute. We have our result, which is the summary of the paper. So, now you have all the building
blocks. You can play with the prompts, you can try your own types of query. Let me know how it works
for you. Let me know what to try. You can also try the different models. So, here I'm using Mixtral.
Obviously, Llama2 is much faster. It's not as powerful, but I have also had good results with
that one. So, let me know in the comments what you try and what works and what doesn't work. But this
is how you can summarize PDF documents locally, privately, on your own computer. That's it. If
you enjoyed the video, please like and subscribe.