Summarize PDFs with a Local AI (Private GPT) in Python

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

In this video, I'll show you how to write your own app to process, summarize, and query PDF documents using large language models that run privately, locally, on your own machine for free. This is Vincent Codes Finance, a channel about coding for finance research. If you're interested in that topic, please subscribe. In my previous video, in which I showed how to install a Chat GPT clone locally on your machine, there was a feature in Open WebUI that let you upload documents and query those documents. However, this feature only looks at part of the document at a time, it doesn't look at the whole document. Many of you have asked me in the comments if it was possible to summarize documents using Open WebUI. I haven't found a way, so I decided instead to show you how to build your own app to do it. So, this is what we'll be doing today. The app I'm building will be fully running locally on my machine. For that, I'll be using Ollama to serve the models, and then to build the app, I will be using LangChain, which is a library that provides a lot of convenience tools to access and deal with large language models programmatically. I'm doing everything in Python today, but this library is also available in JavaScript if you're more comfortable with that. Finally, I'll be using Streamlit, which is a Python module that lets you build interactive dashboards very easily. All the code for this tutorial is available on GitHub, so look at the video description to get the link to the repository and to my blog, which provides written instructions for that. The app that I'm building today is based on examples from the LangChain documentation. This is a diagram from that documentation that explains the two methods that we'll be looking at today. First, we'll use the stuffing method to produce a summary, and then I'll use the map-reduce method not to produce a summary, although you could; that's what they do in the example, but I'll use it to query the document. So, the way this works is that you first have to load the documents. They call that multiple documents in our case. When we load the PDF, each page will be its own document, so we'll have a list of documents, which are the pages from the PDF. We'll extract the text and we'll run our query on that text using the stuffing method. What we'll do is we'll just put all of that text within the context of our query. So, we'll just put all of that together into one big query which will submit to our large language model, and then we'll get the result as our summary. For the map-reduce process, what we'll do is we'll do it in multiple steps. So, what we'll do is we'll take each page and then run a query on that page to find the relevant information in that page and summarize it. And then we'll run a final query where we use the result from each page and aggregate them into our final result. First, we'll use Ollama to serve our models. In order to install Ollama, just go on their website and click download. If you're using Homebrew on Mac, you can just do "brew install ollama," and then you'll have to download a model in order to work with that. So, you can have a look at their model library. I'll be using Mixtral for my examples here. In order to install that model, all you have to do is run the following command: "ollama pull mixtral," and that will download the model. It is quite big, so it's going to take a few minutes for you to download it. So, for this project, you'll need to install the following dependencies. So, I'm using obviously langchain, but in order to use all the features that I want from LangChain, I'll also need pypdf, which LangChain uses to load PDFs. Even though we won't be using OpenAI for this project, we'll need the langchain_openai module and the openai module because we'll be using Ollama, and Ollama provides an OpenAI API compatible API, which we'll be using. That way, if you ever want to take your app but instead of running it locally, you want to run it on OpenAI, it's only a few settings to tweak, and everything will work. We also need tiktoken to help us count the tokens in our queries. We'll use python-dotenv to load our .env file, and we'll be using streamlit in order to build our dashboard. Finally, I have a final dependency, which is rich. I'm not using it in this tutorial, but in my repository, I do have also a CLI tool, so a command-line tool, to do the same things, and this one will be using Rich for the output. As a sample document, I am using one of my research papers. The link is in the description if you ever want to read it. It is open access, so you can download it for free, and this is what we'll be using, and we'll be trying to summarize and query that document. So, first, we'll need to read our documents, and for that, we'll use LangChain, which provides document loaders that help us load different types of files. So, whether they are local files such as PDFs, as what we're doing today. It supports CSV, different types of files, and it also has integration for loading different types of data from third-party providers. In our case, we'll be using the PDF document loader, which uses pypdf, so that's why we also need that as a dependency for our project. So, we'll be following roughly this example here. Note that it also supports extracting images from PDFs, which I won't be doing in this tutorial. So, first, we'll read the file. So, we'll import the pypdf loader from LangChain, then we'll provide the file path, create our PDF loader, and then just load, and that will create a list of documents. So, we see we have one document for each page in my PDF. Next, what we'll do is we'll first summarize the document using the stuffing method. So, we'll just read the document, put it all in a query, and ask the large language model to summarize our document. This is inspired by the example in LangChain. I tweaked it a bit so that it provides slightly better results, in my opinion. But what we have to do first is to create our prompt. So, you can play with that to see whether you get any better results, but what I have as a prompt is, "Write a long summary of the following document, only including that is part of the document, and do not include your own opinion or analysis." And then I provide the document. It is a prompt template from LangChain, so I can add some curly braces here, so this will be variables that will be filled in by LangChain at runtime. This is the document, and then I finish with "Summary:" and then I ask the large language model to complete this prompt. So, I'll create my prompt from the template, and then next, what I'll do is I'll define my large language model chain. We will be using Ollama, but we'll be using the OpenAI compatible API from Ollama, so that way, if you ever want to use GPT-4 in the future, you can just change some parameters here. So, what we'll be creating here, we'll be setting the temperature, so that's how much randomness there is in the model. If you set it to zero, it's almost a deterministic output from your model. You should always get the same output given the same prompt, but if you add some temperature with that, it's going to add some randomness, so you can play with that to see what it gives in terms of results. Then we set the model name, so I'll be using Mixtral, as I said. We have to provide an API key, even if it is local. I'm running Ollama locally, so I'll just pass it anything; it doesn't really matter. And then I'll change the base URL. Instead of using the OpenAI URL, I'm using my local machine. And now I'm creating my large language model chain, which is a chain type from LangChain. And now what I'll do is I'll finally create a full chain. So, I'll use a Stuff Documents chain from LangChain, which takes two things. So, it takes a large language model chain, so an LLM chain, so we'll pass it the one we created previously, and then it also asks for the name of the variable that represents the document in our prompt. So, in our case, it's "document." If we go back to our prompt, it's "document." This is what I need to provide in order to create my chain. And now that I've got my chain created, I can actually invoke that chain. So, in order to invoke a Stuff Documents chain, I have to provide the documents that I want the chain to run on. So, if I do that, I'll just get the results and then print the results so that we see what we get. So, it took a while, obviously. The first time I called a model, Ollama has to load it into memory, so there's an overhead with the first query, but it took a bit over 1 minute on my computer. I do have the output here. In order to kind of look at it in a nicer way, I'll use textwrap, which is a built-in library in Python, in order to format it so that we can look at it. And then we see there's a summary of the paper, but it doesn't really have anything to do with the paper, right? So, it says the paper examines whether blah blah blah, but it's the paper doesn't talk about the PEAD, the post-earnings announcement drift, or anything like that. Where is it getting this information? Well, it's actually getting it from the references, if we look at the end of our document, we do have a list of references that do discuss these things, so that's where it's getting that information. Obviously, I don't want that; it's not part of the article. So, what I could do in order to avoid this is just select the pages that I want to use in my summary. For example, here, I could just drop the last two pages, which contain mostly references. So, in order to do that, I can just use my list of documents but go up not to the last page, not to the next to last page, but the one before that. So, go to minus three, and then if I reinvoke my chain using these documents but just excluding the last two pages, then I get a summary that actually is much better. It's actually really discussing what the paper is about, and it's completely different from the abstract, so it's also quite interesting. This way, if you've been enjoying the video so far, please like so that others also get to discover it. Okay, so that was the first thing I wanted to show you. This is how we can just query a document; we're just stuffing the full content of the document into one query. It's good; it summarizes, but it's also not providing the information maybe that I want to extract. Maybe sometimes I want something else than a summary. So, another way to do that is we can just query a document. So, have a prompt that changes from every query to query, but for that, we'll use the map-reduce approach. So, a common way to query a document would be to use a RAG approach. That involves kind of splitting a document into chunks, generating embeddings, so kind of vector representations for the text in those chunks, storing them in a database that you can then query based on those embeddings. So, you would embed the query from the user, do a semantic search on those chunks, get the most relevant chunks, and then use those to build the answer. This is not what we'll be doing today. If you're interested in that approach and you'd like me to make a video on that, I'll be happy to do so. Just let me know in the comments. But here, what I'm going to do is I'm going to take a simpler but much more processing-intensive approach. It's kind of a brute-force approach. So, I'll use a map-reduce to query the document. The idea is that I'll be looking at every page in the document, asking the large language model if the document, so that page, is relevant for the query. If so, extract, like, give me the relevant information to that query, and then after doing that on every page, I'll put them all back together, and I will provide the final answer using that. So, it's I'm mapping; I'm applying the first prompt on every page, and then I'm reducing; I'm taking these outputs and aggregating them to produce a final answer. So, for this, I'll just use a sample query: "What's the data used in this analysis?" So, for that, I'll need kind of two parts. I have two parts. The first part is the map part. So, in order to map, what's this going to be applied to every document, I'll say, "Well, the following is a set of documents." It's actually not a set of documents; it's just a page. But in this case, it is still a set of documents because we'll be using the Stuff Documents chain in order to produce that. So, in order to be consistent, we'll keep it with that. And then the idea is we'll ask the model based on this list of documents, please identify the information that is most relevant to the following query. This is the user query. If the document is not relevant, please write "Not relevant." And then I'll ask for a helpful answer. And using that, what I'll do then is I'll create my prompt. I will partially fill that prompt. So, I'll pass in the user query that I have here in my prompt, and then I'll pass the rest of my map prompt, so the one that has been filled with the user query as well, to the LLM chain, which I'll call the map chain. This one is going to be used for mapping. After that, we'll need a reduce part. So, in order to reduce, what we'll do is we'll use a different prompt. We'll use this one: "The following is a set of partial answers to a user query. These are my documents. Take these and distill into a final consolidated answer to the following query." This is my user query. "Here's my complete answer." And then I'll be again creating my prompt from my template. I'll be setting a partial prompt. So, I'll be setting the user query directly into that prompt, so it's already integrated. And then finally, I'll build my full chain. So, in order to do that, I'll use a MapReduce Document chain and a Reduce Documents chain. So, in order to do that, I'll first need to create my LLM chain for the reduce part. This is still using the same LLM models that we defined earlier. Now, I've got my reduce prompt. So, here we'll build a few chains. The first one is going to be a Stuff Documents chain that combines documents using our reduce chain. That's what's going to kind of reduce everything at the end. This chain will be passed to a Reduce Documents chain. This chain actually will take all of the map documents and reduce them together using the combined documents chain. So, as it is reducing, so as it is combining the different outputs, it also makes sure not to exceed the maximum number of tokens. If the maximum number of tokens is reached, what it will do then is just produce a new answer and use that for a next reduce step. And finally, we'll combine all of that in a MapReduce Documents chain. So, we'll set the map LLM chain, then we'll have our Reduce Document chain that we've just defined earlier, we'll set our document variable name, and we'll set to not return the intermediate step. We're only interested in the final result. So, as you can imagine, this will generate a lot more queries than the first step. Okay, so it's completed. It took a bit over 5 minutes. That's a bit longer than my different trials that I've done. I think it's mostly because I'm recording at the same time, and this is maxing out my computer. This is making things a bit slower than they were previously, but still, it was a couple of minutes to run all of the steps because we are doing a lot of queries. But if we look at the answer it provided, it is fairly decent. It does provide a good summary of the data that's being used in that project. Okay, so that's all good, but if I want to summarize documents often, this is kind of not a super nice interface to use. So, I did build for myself a CLI interface. It is on the GitHub repository; you can have a look at it and download it. But what I think is even nicer is if we have a kind of nice UI for that. So, the next step will be to kind of package that in a nice UI so that I can reuse it more often. So, we'll build that using Streamlit, which is a Python module that lets you build interactive apps that run in your browser. You can also deploy them online if you want, but given that here we're running local models, that wouldn't quite work. But this is what we'll use in order to build our UI. So, if I look at the structure of my project, I have a few files here that I added. So, I do have my pyproject.toml that defines all the dependencies. I've repackaged most of my code here in that module that I call documents_llm. So, it has a few things in here. So, we have a helper to load a PDF, also to load a text file, but here we'll just be using the pypdf. So, it's just a simple function that when I call it, I provide it either a path or a file, and then the start page, the end page that I want, and then it's going to wrap around the pypdf loader that we've just seen. Then I have a summarize.py file that packages our summarize document function. It takes a list of documents as input, kind of model parameters here, and then it will create the chain for summarizing and invoke the chain and return the output text. This is just what we've done before. And in query.py, I did the same thing with our map-reduce chain. So, it takes documents, a user query, and then the model parameters. It builds the model from a map-reduce chain. So, I abstracted that part away. It's going to invoke that chain and return the output text. The get_map_reduced() chain function is just what we've done before, so it kind of packages all the steps to combine our chain. And then I also have some Streamlit helper functions that we'll come back to in a moment. My Streamlit document is going to be in called doc_app.py. So, it is fairly straightforward. I've added some os and time imports in order to load environment variables and also time the execution query. I'm importing Streamlit, then I'm using dotenv to load the .env file. So, you can set that to set your default for the OpenAI URL, API key, and model name that you want to use. If you want to use the actual OpenAI API, you can just change things here. And then from our Streamlit helper, I'll also import run_query, which is our function, our main function, to run the query. So, we'll get back to that later. So, first, what the app will do is load the .env file if it exists and then load the parameters from the environment. And then after that, I'm building my app. So, if we go back to the app, I do have a title here, which is my PDF Document Analyzer, with a little description. And then here, I have some a sidebar with different widgets that I can kind of use in order to get the user input and configuration. So, how is that build? Well, it's fairly simple. I just use from Streamlit, I write the title, and then I write my little description. And then I create a sidebar using with sidebar, and then everything that is within that with block will be part of the sidebar. So, I've got a header for the model. I'll use a text input for the model name, but I'll get the default one from the environment variable if it exists. Then I've got a slider for the temperature. The default value is 0.1, but you can adjust it as you want. Next, we have the documents property, so you can upload a PDF file. So, I'm using a file uploader, which supports a PDF file. And then I'm also having a selector for the page range. So, I'm making it two columns, so I'm creating two columns so that they are side by side. And the first one will have the start page, which is a number input that has a value, default value of zero, minimum value of zero, and then the end page, which has a default value of minus one, which in Python just means you're taking the last element. So, it's going until the end, but you can actually change it. This one will also accept negative values. And then I will also have a radio button to select the query type, which is either summarize or query, so one of the two that we've been using. So, now we're outside of that block. But what I'm adding here is that if the radio button value is query, well, we also need to get the user queries. So, in that case, I'll display a text area that asks for the user query, and I'll provide kind of a default query here. And finally, I'll add a button. And the button, the way you define it in Streamlit, is a bit weird. Is there's an if statement where you put in the name of your button, and then whatever is under that if statement will run once the button is clicked. So, first, I'll initialize my result to None. I'll start my timer. I'll check that there's a file. If there's no file, I'll write an error and say "Upload a file." And then after that, if there is a file, what I'll do is I'll display a status. I'll run my query, passing all the parameters to this big function. And then when I'm done, I'll update the status to "Done." And if there's an error, I'll just catch it here. It's a catch-all, and I'll just display the error. And that's it. And after that, I have an if here. If result, so if the result is not None, so there was a result generated, then I'll create a new container with a border that I'll label "Result." I'll display the result. I'm wrapping it into a markdown display. Some LLMs will generate results in kind of markdown-like format, so in that case, it will get rendered. Otherwise, it's just displayed as text. And then I'll have a little info bar with the time it took to run the model. And the last thing that we haven't looked at is how to run the query. So, let's look at that. So, here in that Streamlit helpers file, I have got two functions. One is save_uploaded_file. Why? Well, it's because the LangChain document loader needs an actual file, whereas the Streamlit file uploader will only keep the file in RAM. So, I actually have to take that file, I'll write it to the temporary folder, so I'll write that file and return the path to that temporary file. And I'll use that where in my run_query. The run_query will use the different functions that we've created before. It will take the uploaded file, so that's directly from Streamlit, then it'll take a Boolean to say whether summarize or not. If not, then it's query, the user query, which could be null if it's summarize, start page, end page, and then model parameters. And then what it will do is it will first load the file and then either summarize or query. In order to load the file, well, I'll call my function save_uploaded_files. But before, I'll write this update. Now, what's not too clear when we look at this is that here, our query is running within a status with status block. So, anything that gets written within that block will be part of that status bar. So, if we go back to our run_query function, we see that here, st.write will be written within that status bar. So, we're just providing more information about the status. We're saving the uploaded file. Once we've got that file saved on disk, we're loading the PDF. We'll use unlink to delete the temporary file. And now, we'll either summarize, so if we summarize, we'll call our summarize document function. If we're querying, we'll call our query document function, and we'll return the result. And that's all there is to it. If we look at our result and we run our Streamlit file using Streamlit run and then calling our file, which is doc_app.py, it loads our file. Here, I have the model name. It's loaded from my .env file, but I could change it if I want. I do have the temperature. I will load my file, so I'll have the PDF file here. File uploaded successfully. Can set the start and end page, so two or three, something like that. Then I can either query or summarize. If I click "Query," I do have the text box that appears here. But what I'll do here, I'll just do our simple summary. So, I'll click "Run." Now we see we're saving the uploaded file, we're loading the documents, we're summarizing the document, and now we have it. It took a little bit over 1 minute. We have our result, which is the summary of the paper. So, now you have all the building blocks. You can play with the prompts, you can try your own types of query. Let me know how it works for you. Let me know what to try. You can also try the different models. So, here I'm using Mixtral. Obviously, Llama2 is much faster. It's not as powerful, but I have also had good results with that one. So, let me know in the comments what you try and what works and what doesn't work. But this is how you can summarize PDF documents locally, privately, on your own computer. That's it. If you enjoyed the video, please like and subscribe.

Info

Channel: Vincent Codes Finance

Views: 2,232

Rating: undefined out of 5

Keywords: researchtips, research, professor, datascience, dataanalytics, dataanalysis, bigdata, data science, python pandas, big data, chatgpt, gpt, ollama, artificial intelligence, chat gpt, machine learning, uncensored, opensourceai, llama2, mistral, private, privacy, opensource, javascript, code, programming, python, langchain, streamlit

Id: Tnu_ykn1HmI

Channel Id: undefined

Length: 28min 57sec (1737 seconds)

Published: Sat Mar 30 2024