Marker: This Open-Source Tool will make your PDFs LLM Ready

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
The availability of good data can really make or break your LLM application. Most of the text data is available in PDF formats. This is true both for enterprises as well as for personal documents. But working with PDFs for LLM is extremely hard. PDFs is essentially a broken format. PDFs usually have complex structure. There are nested elements of different data types. There is absolutely no standard layout, which becomes very cumbersome to extract data from PDFs. There are added challenges because of different encodings, different fonts, formatting, tables, and images can add to this headache. Now to make PDF LLM ready, there are a number of approaches that people have been exploring. So there are approaches to convert PDFs to plain text for easier parsing, then use machine learning models to detect the layout. of your PDF, then use optical character recognition or OCR models to detect text on the PDFs. Okay. So it's actually a cumbersome task, which is prone to errors. However, working with markdowns is very easy when it comes to LLM because you can easily convert that to plain text. Markdowns can retain the original formatting, so you can have title, headers, images, tables in there, and LLMs can essentially effectively process Markdowns structured elements. So the goal of this video is to show you an open source tool that you can use to convert your complex PDF files into well structured Markdowns. If you want to convert your PDF files into Markdowns, you have some paid options like Mathplix, which will convert your PDFs into Markdowns or extract readable text from it. If you're looking for open source options, then you have NuGet, which is an open source project from Meta, but this is mainly focused on academic documents. I personally find this new project called Marker to be extremely helpful. This will let you convert PDFs to Markdowns quickly and accurately. And here is how it performs compute in NuGet. So it's much faster. So, one page of text. will take about a hundred seconds compared to about 400 seconds when you try to do that using Nougat. And the accuracy is almost double that of Nougat as well. So here's a quick example of how the difference in performance of Nougat and Marker can look like. So this is a book called Think Python. And the authors actually tried to convert this using both Nougat as well as the marker. So here's the output that you get from Nougat. It basically completely ignored the first few pages along with the table of content, but a marker is able to preserve everything. So here are the first couple of pages. And then at the end, we have the table of content which was accurately extracted. And after that we have the first chapter. And in case of Nougat it actually confuses the headers and footers. as a part of the first chapter, and also brings in the preface right in the middle of first chapter. But for this specific document, Marker seems to be able to preserve the structure of the book. . So let's look at some features of marker before I show you how to get started with this. So it supports a wide variety of documents. And it's optimized for books and scientific papers, but I have tested in on something like resumes and it works pretty pretty great. Now it says supports all languages. I'm not sure what exactly the authors mean by all languages. It removes headers, footers, and other artifacts, which is evident when you look at that think python book and it formats tables and codeblocks extracts and saves images along with the markdown so it will actually extract images and store them separately and it will convert most equations to latex depending on how the complex equations are and the great thing is that it runs on gpu cpu or mps if you have apple silicon Now, if needed, it will do OCR on your text as well. It uses Surya, which is another package created by the same person who created marker. Now there are some limitations as well because PDFs is a tricky format, so marker will not convert 100 percent of equations to latex, which is understandable, and tables are not always formatted 100 percent correctly. I have actually noticed this, I'll show you a couple of examples of where it fails. So you want to pay close attention to that white spaces are not always respected Not all line spans will be joined properly now There are definitely some limitations, but in my tests it seems to work on most pdf files And it's an open source tool, but there are some limitations around the usage. So if your organization is making under 5 million dollars in gross revenue in the most recent 12 months period and under 5 million in lifetime VC funding, then you can use this in your commercial projects. But if it's more than that, then you need to get a license, which is completely understandable because these open source projects actually takes effort times and compute costs to be done. Okay. So enough talking. Let's, let me show you how you can start a business. installing this and start converting your PDF files into structured markdowns. All right. So first we will create a new conda environment and I'm going to call this marker. I already have a virtual environment by the same name. So it's going to ask me whether I want to remove the existing environment. So I'm going to say yes. This will basically delete the existing environment and then recreate the new virtual environment for me. Now, one thing you want to do is when you create a new virtual environment, you also want to install PyTorch. Okay. So we have our new virtual environment created. I am going to activate it using conductive marker. And as I said, you want to install a PyTorch. So depending on your operating system I'll put a link to this page. You need to select your operating system. I am currently using it on a Mac. So we're going to be installing using PIP and we have Python. So this is the command that I'm going to be using. If you're on Linux, here's the command that you need to use. On Windows, here is the command. But the only difference is going to be I'm not going to be using PIP3. I will just use PIP directly.. So PIP install torch, torch vision, torch audio. And this will download PyTorch on my machine and install it. Okay, so install marker. We're going to use the pip install marker pdf command. If you want to do OCR on top of it, you can install ocrmypdf. It's an optional package, but here are the instructions on how to do that. In my case, I'm not going to install it. I think it will use the Surya package by default. Okay. So once the package is, is installed, you can either convert a single PDF file to markdown, or you can convert multiple files. In that case, you are going to be using two different commands. So we're going to first start with converting single file. So you will use the marker underscore single command in your terminal. Okay. then you need to provide the path of the file that you want to convert, then where you want to store that. And there are some optional parameters, for example, batch multiplier. how many maximum pages you want to convert. And since it supports multiple languages, so you can actually provide the language which the document is in. This will help improve the OCR process if it has to do. But all the documents that I'm testing are in English, so I'm not going to try this. So for my experiments, we're going to be using three different files. Here is a scientific paper. These are basically two columns. and there are some images in there. Now this is a relatively simple document because there are just two columns. The tables are well structured, so nothing weird in these tables. And there are some images as well. And we also have some equations in latex format. So let me show you how to convert this PDF file into a structured markdown. Okay. So here. My file is in this PDF underscore files folder. So we're going to use the marker underscore single command, then a path of the file and then the output folder. So I created an output folder which is currently empty. So it's going to create the markdown and store it there. Now when we run this, it will first download the OCR model, which is Surya if it's not already downloaded. And then it will try to figure out the layout of your PDF file. So it runs bounding boxes through the PDF files to detect different elements within the file. And that's how it's going to extract text and also ensure that the text is in the proper position, right? So basically it first tries to figure out the bounding boxes. And as I said the structure of the whole document. And when it does that after that you will run the bounding boxes algorithm on it and extract text from it. This will take a couple of minutes depending on the length of your document. So it took a few minutes and created a new folder inside the output folder. So let's see what it created. So inside the folder, we actually have all the images that are extracted from the document. So this is pretty nice. There are actually five images in the original document and it's able to extract all of them. So this is pretty neat. Now next we have a JSON file with all the metadata. So it detected English language, file type is PDF, there are a total of 10 pages, which is accurate. And then I think it looks at the number of tables, the number of equations so it was able to parse seven different equations. and there are four different tables. So you also get this pretty neat JSON file, and you can parse this and see if the information in here is actually correct. Now, here is the actual markdown that you get. So here's the title of the page, then the authors as well as their affiliation. So it also correctly extracted the abstract as well. And it overall, overall, it looks pretty neat. Now for images, it's basically providing a reference of where the image is stored. So this is pretty neat. And I think it is also able to extract the equations as well. So that's pretty good. Now the tables in this case are nicely preserved. We can actually look at a preview of this. So I think if I right click on this and click on open preview. So here is a preview of the paper in the markdown format, which looks pretty well structured compared to the original document. Okay. So here's the original document and I have the preview open. On the side. Now, I think it's able to preserve the relative locations. So here's the related work. You have the related work on top and then that image right in the middle. Let me have a look at the equation. So yeah, the equation seems to be correct. So this is pretty awesome., I have another document which is the cv of andrew young Who is the creator of corsera? So this is his cv and I ran. This whole folder, through marker There's another paper called orca. Now in this case, the paper has a table of content and then it's arranged in a single column. And here are the outputs for both of them. So the CV again has a total of 21 pages. There are no tables or equations in there, no images. And here's what the output kind of looks like from Surya, which is pretty great. Here's a preview of the same thing., so here is the JSON file of the Orca paper. Now, challenges with existing models, key contributions, I think these are different sections that the model is able to extract from the paper, which is pretty neat. One thing which I noticed was that it broke this image into two parts. So you might want to have like a, a secondary post processing step on top of this markdown to make sure that the images as well as the table that it extracts are accurate. But as a first pass, this is a pretty awesome that you can do this with an open source model. Now, here's the markdown of the original LLAMA2 paper, and it does a pretty decent job here as well. So this was a quick video on how to get started with Marker which will help you convert your PDF files into structured markdown. It's an open source project, pretty amazing and it does a pretty decent job. And as we all know, data is an important aspect of when it comes to LLMs, so I'll be creating some more videos around it. I'm planning on creating a video on how to scrape data from web pages. So if you're interested in these topics, make sure to subscribe to the channel so you don't miss any of the new videos. I hope you found this video useful. Thanks for watching and as always, see you in the next one.
Info
Channel: Prompt Engineering
Views: 35,297
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, LLMs, AI, artificial Intelligence, Llama, GPT-4, fine-tuning LLMs
Id: mdLBr9IMmgI
Channel Id: undefined
Length: 14min 10sec (850 seconds)
Published: Fri May 31 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.