LlamaParse: Convert PDF (with tables) to Markdown

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning everyone how's it going today welcome back to the channel in today's video I am going to show you how to parse a PDF file and how to convert it into a markdown file and this is very interesting not because we're not going to be using a regular method just like OCR or just regular PDF uh loaders from other libraries but right here we're going to be using an API from Lama index and it is going to allow us to parse the PDF and to also parse a little bit more complex parts of the PDF like tables and if you have been building rag applications for a while you know that parsing tables can be a headache if you're using simple methods like OCR to parse your PDF documents right now the reason for this is that when you're using OCR for parsing this kind of documents you're going to get a simple line for each line of your table and in a table of course each element of the table uh contains information not only related to its line but also to its column and Performing simple OCR is not going to allow you to get that information and if you send just the line information to your language model the language model is most likely not going to be able to interpret that information so the idea right here is that we're going to be using lamap pars API to send this document to their API and in return we will get a markdown file that looks something like this and as you can see it it has the tables in markdown format so this is the kind of uh format that you can send to your language model and it is going to be able to interpret this and it is going to then allow you to have a chat with your PDF application with table tables or with tabular data okay so let's actually get to to do this with Lama pars [Music] all right so quick um just quick explanation of what Lama pars is it is an API by L index you can send your files to this API and it will return to you the structure data from this file and something pretty interesting about this um ingestion method is that they use generative AI during the ingestion process in regular rag uh Frameworks or rag workflows the generative AI actually comes at the end of the process which means that when you're only actually just sending the data to your language model and getting the result back what they do is they actually use gen AI during the ingestion process to allow you to understand your document a little bit better uh during the parsing process and they support several types of files they support PDF PowerPoint uh Word documents Etc probably going to be making more videos about this if you're interested uh but yeah I mean Ian the idea right here is that it is an API and you can basically just parse documents without having to create your own um data transformation uh pipeline so you just send the your document to their API and you get the structure data in return um and of course they have a paid plan um but they also offer a super generous free plan of a thousand Pages a day which is super cool so I really encourage you to use this in uh your projects if you're parsing less than a thousand projects a day it's completely free and yeah so let's actually get right into into building into parsing this document so let's do that all right so we're at our Google collab file just making sure that you see everything that's going on right here um and what we're going to be doing is we're going to first install lamap pars and then we're going to download this file that we have right here so so the first thing to do is install LMA pars and in order to do this all you have to do is to pip install LMA pars and there you go and of course if you're creating your own application in your virtual environment don't don't forget to create your virtual environment before doing this and once that is done oops and once that is done we're going to be able to just download the file that I showed you just a second ago and in all the to do this I'm just going to use W get and there you go so in I mean just to be sure that you understand what's going on here W get just downloads the file that I have right here and I say that I want to download it into the Apple low d10k PDF file file and as you can see I have it right here already so there you go now the next thing to do in order to set up this um this uh project is we're going to want to oops we're going to want to create a we're going to want to initialize Nest ASN K and this you're only have to you you're only going to have to do this if you're in a Jupiter notebook or a callup notebook because LMA pars which is the API that we're going to be using remember that it is an async um we're going to be using async method methods and async methods are not to work in a cab notebook so this is only for this to work in a cab this is not necessarily related to to the to the API itself now let's just initialize our API key that we just created and we're going to name it Lama Cloud API key it is important to name it that way because that is the name that um lama lama par method is going to look for in your environment variables and here you have my API key which you can of course copy but it's going to be useless because I will have already disabled it by the time this video is up so there you go now that this is done I mean to be clear this is the place where you're going to put the API key that you can get from here so you go to cloud. Lam index. you create an account you go to API keys let me just zoom in a little bit go to API keys and you Cate and generate new key and this is where you're going to create your API key and then you're going to paste it right here all right so once that is done we can actually start parsing our document so let's do that right now great so in order to parse our file the only thing that we're going to want to do is we're going to import oh wait from LMA pars we're going to import Lama pars itself like this there you go and then we can just just actually initialize our we can just I mean literally just call it and it's going to return to us what we want so I'm going to call it document and this one we're just going to call Lama pars like this and the first um argument right here is going to be the the target uh format that you want your your structure document to be and in our case we want it to be let me just call it it is result type and this one is going to be markdown like that then we do dot load data and right here we're just going to send our actual PDF file so I'm actually going to come right here click on copy path for this file right here and I'm just going to paste it right here like that so I'm going to execute this and it's probably going to take a little bit of time because it's oh well that was pretty fast and let's see what document looks like as you can see it is quite long let me just put this right here there you go so as you can see it is quite long let's see what we have to show right here so at the beginning it is a list apparently it contains a document with an ID embedding non metadata I mean you can all of course add metadata as well within this method and right here we have the text um uh property and this is the text that is contained in the entire PDF and as you can see we have tabular data right here so that's looking pretty good let me show you real quick how it looks like so I am going to come right here and I'm going to do print going to do document and as you have saw as you saw this is um an array or a list and the first element is the only document that we have right here so I'm going to do zero and as I show you as I showed you the contents are actually within text so I'm going to show you text I'm going to show you the first thousand characters so let's see how that looks like and there you go here you have your actual PD uh markdown file for your for your PDF with tabular data in markdown that was pretty easy okay so great job uh good job so far um now I'm going to show you something super cool which is that you can actually add a prompt to to Lama pars to tell Lama pars what the document is about because remember that Lama pars uses generative AI during the parsing process um yeah so you can actually add a prompt to to Lama parts to actually tell it what to document is about and what you want um the parser to do with it because you can even ask it to summarize it if you want uh so I'm going to show you that in a moment let me just actually export the markdown file for you so that you see what's actually going on um here we actually named it document so I'm just I mean just to be clear what is going on right here I am creating a new file called Apple 10K MD and I am writing into it the contents of document zero uh. text which are basically just this I mean the contents that were exported from the from the PDF so I'm just going to run this right here and as you can see I have it right here let me just download it um download and let's just open that file to see how how it looks like um so just going to open a new window here and the downloaded file is actually this one right here so there we go let me just zoom in a little bit so there we go as you can see this is the actual file that was here this is pretty much the same file and all the tabular data is in markdown so this is going to be super useful for you if you want to create your rag application that uh that is going to perform chat with your PDF and your PDF has a lot of um PD of tabular data so there you go that's good now let's actually focus on creating a parser um that is going to take a prompt and that's actually very simple let's just go back to our let's just go back here and right here what we're going to do is we're going to add another um element to this thing right here so we're going to say documents with instruction and we're going to also call LMA Parts like this and here just like before we're going to do let me just copy it it's going to be easier like that we're going to say let's do it like this L Parts the result app is going to be marked down we're going to load the data from here but apart from this we're going to pass in another parameter that is going to be parsing let me just save parsing instruction and this is basically just a string and you can pass in any kind of um instruction you want you can ask it to summarize the document you can ask it to make a list of your tables you can ask it whatever you want uh in this case I'm just going to say this is the Apple annual report it's going to run this probably going to take a little bit more time in my experience when you use a prompt it actually takes a little bit longer but yeah I mean let me just pause the video and show you when it is done and there we go our documents with the instructions having parts let's actually just take a look at them and see what they look like so I'm going to do I'm going to do pretty much the same thing as I did up here just going to export it but I'm going to export the documents with instruction this time and I'm going to save it into Apple 10K instruction and let's see so now we should come right here we should be able to come here and download this right here let's go to code and here is our file so let's see how it looks like um as you can see it updated the title to actually represent what the what I told it what I told the language model that this was about so you can see that this is the it's titled annual report and yeah I mean it didn't change um many other things other than that but this can be very useful if you especially if you have a little bit more complex data and you want for example a summary of it instead of getting uh instead of doing all of the ingestion by yourself and trying to summarize your own document by yourself you can just use LMA pars and just ask it to summarize a given part or the tables Etc and this is going to probably give you better result than better results than if you were doing this by yourself so there you go let me just show you how it looks like in action ual uh markdown so there we go so you can you can see that you have the tabular data right here yeah so everything seems to be working correctly pretty good so there you go that was house to bar that was house to pars a PDF document using LMA pars I hope that you found it useful and yeah let me know if you're interested in more videos about Lama index and Lama Parts in general and yeah I'll see you next time [Music] [Music]
Info
Channel: Alejandro AO - Software & Ai
Views: 5,133
Rating: undefined out of 5
Keywords: llamaindex, rag, chat with pdf, chat with pdf gpt, research and publication, streamlit, rag chatbot, chat with pdf langchain, langchain, convert pdf to markdown, pdf to markdown, pdf with tables to markdown, llama2, retrieval augmented generation
Id: 7DJzHncUlpI
Channel Id: undefined
Length: 15min 55sec (955 seconds)
Published: Wed Jun 05 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.