The availability of good data can really
make or break your LLM application. Most of the text data is
available in PDF formats. This is true both for enterprises
as well as for personal documents. But working with PDFs for
LLM is extremely hard. PDFs is essentially a broken format. PDFs usually have complex structure. There are nested elements
of different data types. There is absolutely no standard
layout, which becomes very cumbersome to extract data from PDFs. There are added challenges because
of different encodings, different fonts, formatting, tables, and
images can add to this headache. Now to make PDF LLM ready, there
are a number of approaches that people have been exploring. So there are approaches to convert
PDFs to plain text for easier parsing, then use machine learning
models to detect the layout. of your PDF, then use optical
character recognition or OCR models to detect text on the PDFs. Okay. So it's actually a cumbersome
task, which is prone to errors. However, working with markdowns is very
easy when it comes to LLM because you can easily convert that to plain text. Markdowns can retain the original
formatting, so you can have title, headers, images, tables in there,
and LLMs can essentially effectively process Markdowns structured elements. So the goal of this video is to show
you an open source tool that you can use to convert your complex PDF
files into well structured Markdowns. If you want to convert your PDF files
into Markdowns, you have some paid options like Mathplix, which will
convert your PDFs into Markdowns or extract readable text from it. If you're looking for open source options,
then you have NuGet, which is an open source project from Meta, but this is
mainly focused on academic documents. I personally find this new project
called Marker to be extremely helpful. This will let you convert PDFs to
Markdowns quickly and accurately. And here is how it
performs compute in NuGet. So it's much faster. So, one page of text. will take about a hundred seconds
compared to about 400 seconds when you try to do that using Nougat. And the accuracy is almost
double that of Nougat as well. So here's a quick example of how
the difference in performance of Nougat and Marker can look like. So this is a book called Think Python. And the authors actually tried
to convert this using both Nougat as well as the marker. So here's the output
that you get from Nougat. It basically completely ignored
the first few pages along with the table of content, but a marker
is able to preserve everything. So here are the first couple of pages. And then at the end, we have the table of
content which was accurately extracted. And after that we have the first chapter. And in case of Nougat it actually
confuses the headers and footers. as a part of the first chapter, and
also brings in the preface right in the middle of first chapter. But for this specific document,
Marker seems to be able to preserve the structure of the book. . So let's look at some features
of marker before I show you how to get started with this. So it supports a wide
variety of documents. And it's optimized for books and
scientific papers, but I have tested in on something like resumes
and it works pretty pretty great. Now it says supports all languages. I'm not sure what exactly the
authors mean by all languages. It removes headers, footers, and other
artifacts, which is evident when you look at that think python book and it
formats tables and codeblocks extracts and saves images along with the markdown
so it will actually extract images and store them separately and it will convert
most equations to latex depending on how the complex equations are and the great
thing is that it runs on gpu cpu or mps if you have apple silicon Now, if needed,
it will do OCR on your text as well. It uses Surya, which is another
package created by the same person who created marker. Now there are some limitations as well
because PDFs is a tricky format, so marker will not convert 100 percent
of equations to latex, which is understandable, and tables are not
always formatted 100 percent correctly. I have actually noticed this,
I'll show you a couple of examples of where it fails. So you want to pay close attention
to that white spaces are not always respected Not all line spans will be
joined properly now There are definitely some limitations, but in my tests it
seems to work on most pdf files And it's an open source tool, but there
are some limitations around the usage. So if your organization is making under
5 million dollars in gross revenue in the most recent 12 months period and under 5
million in lifetime VC funding, then you can use this in your commercial projects. But if it's more than that, then you need
to get a license, which is completely understandable because these open
source projects actually takes effort times and compute costs to be done. Okay. So enough talking. Let's, let me show you how
you can start a business. installing this and start converting
your PDF files into structured markdowns. All right. So first we will create a
new conda environment and I'm going to call this marker. I already have a virtual
environment by the same name. So it's going to ask me whether I want
to remove the existing environment. So I'm going to say yes. This will basically delete the
existing environment and then recreate the new virtual environment for me. Now, one thing you want to do is when
you create a new virtual environment, you also want to install PyTorch. Okay. So we have our new virtual
environment created. I am going to activate it
using conductive marker. And as I said, you want
to install a PyTorch. So depending on your operating
system I'll put a link to this page. You need to select your operating system. I am currently using it on a Mac. So we're going to be installing
using PIP and we have Python. So this is the command
that I'm going to be using. If you're on Linux, here's the
command that you need to use. On Windows, here is the command. But the only difference is going to
be I'm not going to be using PIP3. I will just use PIP directly.. So PIP install torch,
torch vision, torch audio. And this will download PyTorch
on my machine and install it. Okay, so install marker. We're going to use the pip
install marker pdf command. If you want to do OCR on top of
it, you can install ocrmypdf. It's an optional package, but here are
the instructions on how to do that. In my case, I'm not going to install it. I think it will use the
Surya package by default. Okay. So once the package is, is
installed, you can either convert a single PDF file to markdown, or
you can convert multiple files. In that case, you are going to
be using two different commands. So we're going to first start
with converting single file. So you will use the marker underscore
single command in your terminal. Okay. then you need to provide the path of
the file that you want to convert, then where you want to store that. And there are some optional parameters,
for example, batch multiplier. how many maximum pages
you want to convert. And since it supports multiple
languages, so you can actually provide the language which the document is in. This will help improve the
OCR process if it has to do. But all the documents that I'm testing are
in English, so I'm not going to try this. So for my experiments, we're going
to be using three different files. Here is a scientific paper. These are basically two columns. and there are some images in there. Now this is a relatively simple document
because there are just two columns. The tables are well structured,
so nothing weird in these tables. And there are some images as well. And we also have some
equations in latex format. So let me show you how to convert this
PDF file into a structured markdown. Okay. So here. My file is in this PDF
underscore files folder. So we're going to use the marker
underscore single command, then a path of the file and then the output folder. So I created an output folder
which is currently empty. So it's going to create the
markdown and store it there. Now when we run this, it will first
download the OCR model, which is Surya if it's not already downloaded. And then it will try to figure
out the layout of your PDF file. So it runs bounding boxes through
the PDF files to detect different elements within the file. And that's how it's going to extract
text and also ensure that the text is in the proper position, right? So basically it first tries to
figure out the bounding boxes. And as I said the structure
of the whole document. And when it does that after that you
will run the bounding boxes algorithm on it and extract text from it. This will take a couple of minutes
depending on the length of your document. So it took a few minutes and created
a new folder inside the output folder. So let's see what it created. So inside the folder, we actually
have all the images that are extracted from the document. So this is pretty nice. There are actually five images
in the original document and it's able to extract all of them. So this is pretty neat. Now next we have a JSON
file with all the metadata. So it detected English language,
file type is PDF, there are a total of 10 pages, which is accurate. And then I think it looks at the number of
tables, the number of equations so it was able to parse seven different equations. and there are four different tables. So you also get this pretty neat
JSON file, and you can parse this and see if the information
in here is actually correct. Now, here is the actual
markdown that you get. So here's the title of the page, then
the authors as well as their affiliation. So it also correctly extracted
the abstract as well. And it overall, overall,
it looks pretty neat. Now for images, it's basically providing
a reference of where the image is stored. So this is pretty neat. And I think it is also able to
extract the equations as well. So that's pretty good. Now the tables in this
case are nicely preserved. We can actually look at a preview of this. So I think if I right click on
this and click on open preview. So here is a preview of the paper
in the markdown format, which looks pretty well structured
compared to the original document. Okay. So here's the original document
and I have the preview open. On the side. Now, I think it's able to
preserve the relative locations. So here's the related work. You have the related work on top and
then that image right in the middle. Let me have a look at the equation. So yeah, the equation seems to be correct. So this is pretty awesome., I have
another document which is the cv of andrew young Who is the creator of corsera? So this is his cv and I ran. This whole folder, through marker
There's another paper called orca. Now in this case, the paper has
a table of content and then it's arranged in a single column. And here are the outputs for both of them. So the CV again has a total of 21 pages. There are no tables or
equations in there, no images. And here's what the output kind of looks
like from Surya, which is pretty great. Here's a preview of the same thing., so
here is the JSON file of the Orca paper. Now, challenges with existing models, key
contributions, I think these are different sections that the model is able to extract
from the paper, which is pretty neat. One thing which I noticed was that
it broke this image into two parts. So you might want to have like a,
a secondary post processing step on top of this markdown to make
sure that the images as well as the table that it extracts are accurate. But as a first pass, this is a
pretty awesome that you can do this with an open source model. Now, here's the markdown of the
original LLAMA2 paper, and it does a pretty decent job here as well. So this was a quick video on how
to get started with Marker which will help you convert your PDF
files into structured markdown. It's an open source project, pretty
amazing and it does a pretty decent job. And as we all know, data is an important
aspect of when it comes to LLMs, so I'll be creating some more videos around it. I'm planning on creating a video on
how to scrape data from web pages. So if you're interested in these topics,
make sure to subscribe to the channel so you don't miss any of the new videos. I hope you found this video useful. Thanks for watching and as
always, see you in the next one.