Extract Image & Image Info From PDF & Use LlaVa via Ollama To Explain Image | LangChain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello guys I'm sudas qua and welcome back to data science basics in this video let's see how we can extract image and image information from the PDF using unstructured right I have been creating videos in unstructured the first one is what is unstructured on a higher level and the second one was how to extract the table data from that particular PDF and summarize it with Lama 3 via AMA and many of you mention in the comment section that you want to see a video where we can extract the information of image from the PDF so this is how the video will proceed first we will use the PDF and we will pass the PDF into the partition PDF of unstructured and we will extract the image information right when we do this initial part there is not all the information being extracted so we will modify the code and we will also extract the images separately and also the image information right and at last as a bonus I will show you how to explain the image using the lava via AMA and Lang chain meaning that we are going multimodel here so we will pass the image into the lava llm and then that will explain what the image is all about let's get started before going into the code here is the YouTube stops and inside there there is data cleaning inside here there is this unstructured image image extraction from PDF so you can clone this repository and then continue watching with me so I will open the terminal here as you can see I have already cloned this particular repository I have created the virtual environment and activated it so that the packages are isolated only for this particular virtual environment that is really good practice if you are using Python and then I'm going inside the data cleaning as I mentioned you in the GitHub page and here these are the things that inside this data cleaning folder and now I will open this in a vs code right so vs code is here let me get the vs code from another window so yeah this is the vs code we are going to use this unstructured image extraction page so I will make this bigger here are some of the links I have been mentioning in all the videos please go there if you want to know more information about the unstructured so in the setup here we are first installing the unstructured library right if I do shift enter this will take some time and then you might be thinking what is this percentage percentage capture I have been explaining this in my previous video also but this will capture all the outputs that is shown below here it is not so it is just captured inside it so our notebook looks clean and the watermark is here to print what version of an structure or or different packages we are using in this particular project so it's easier to track so that if something doesn't work we can just come back and see okay which version of that package is being used during during this time so it is now installing all the necessary packages here after this here is the another thing called warning control because I don't want any warnings to be shown here so this is just the optional one we can run so it is taking some time to install the necessary packages so after this is installed what we can do is this importing things here and these are the other import that we need to do I have been explaining this before also but mainly we are using this partition PDF in order to explore different things and these are for the unstructured client but in this particular video I'm not going to go through that but if you are using the API of unstructured then we will be using this right so yeah all the things are being installed now so I can run this and then I can import necessary class and functions uh from here and I can load the watermark from here let me first see that all of things are being uh imported I'm showing all this step by step because some of you mention that it's quite uh difficult to follow each and every step if you already run the cell so I'm going to run it as I demonstrate it so if there is some errors there I can explain you what it is all about so yeah this is the loading the watermark I can draw this so you can see that the Json version 2.0.9 is shown and on structure client is shown but here we don't see any un structured right so you need to be careful for example if I go here and in here I can just import let's say unstructured so if I run this it is imported un structure now if I run this it will be showing us the unstructured also meaning that how this Watermark work is you need to be installing something uh already in the in the beginning and sometimes it does not show the unistructure so if you want to show all things at once then just import the main package first and it will be shown here I hope now that is clear so yeah the first part is setup part is done now we can go to the initial exploration so this one I will not go through this because it is just printing what are the different package contents and the next one is as you can see here we have the data inside this data folder right in the data folder I have this GPT for all PDF paper I have been using the same PDF for all these videos here right so I will just go here I will close this and I'm just importing okay uh partition PDF which I have already done here but just to show you in sale by sell it's easier to explain and then we are we are having the file name as data/ GPT for all PDF we are passing that file here and everything is taken care by this partition PDF so what happens if we do this I have been explaining this in my previous videos but just to make sure for all of you so here if I go and see the elements so so there is the length of the elements 134 elements is being extracted and if I go here I can see the contents of that particular elements right I'm printing here elements and it is shown here if you just click this text editor it will show you all the elements here because that is being truned in the Shell here so if you want to see what are the unique elements being extracted so here it is just narrative text title and uncategorized text but we want to have the information from the image also right so if you are doing the normal uh partition PDF implementations without any strategy being set up it will not show here all the elements that it is fing from that particular PDF so how to achieve that I have written here you can just go here and read we don't see image table information okay we don't see the image right so I can say here image information is not extracted so we need to use the different strategy and we need to use High raise strategy there are different strategies you can just go and uh look here and one thing to mention now that we are using different strategy meaning that we want to extract more information from that particular PDF meaning that under the food only structure needs different packages apart from the previous one so you might be facing some package installation issues here it will maybe complain this package is not installed that package is not installed and so on based on the machine so I'm using this in Mac but you might be using Linux or Windows and so on right so here are two links that I find it helpful this this one is from the unstructured website itself so it has recommended you to install some packages in your system so it is able to extract those informations and if you still face some issues there is some there is the link in the stack Overflow I find it helpful when I set up my machine so you can also go through that if you get any issue if you don't get any issue then that's great so here you can see I am explaining form on structured partition partition PDF and elements partition PDF and here is the high R right if we run this now so it says here High raise High raise okay cannot import name partition PDF from partition and this why it is not installing is because from unst structure. partition import this uh let me see what is happening here from on structure okay okay what was the first one we used from onr structure. partition. PDF right it was you can even use Auto but I need to use PDF right so now I'm providing the strategy High race so when we use the high Race So as you can see here it is going to be reading the PDF file this and many different packages will be used under the hood to extract more informations and you can see there is this logging also being shown here and it is using OCR with Tesseract to extract the informations from this particular PDF so depending upon your PDF how big it is and what how many different different elements there are it some of the PDFs might content images figures tables and so on it will take time based on that so now it is being extracted so if I just go here and print now you can see first we have what we have before with it is here okay here you can see there is just narrative text title and uncategorized text but now just using the high race what we were able to do is we were able to get the image footer header table and on instructure text is there list item figure caption and all the information that is in this particular PDF is being extracted with this High RIS strategy so now we we are focusing on the image right so if we go here here and see the images it says here none but there are already some images so you can see there are six different elements it is not shown here or you can even print length it is not shown here in zero because in zero there is nothing uh but if I I have already run this once so I know from where we can get some information so if I not six but it is five because it starts from zero in Python so you can see there is GitHub repo growth and something something something is being extracted from this particular image so if you go to this image where is the jpt for all you can see this is from where it is extracting the image GitHub repo growth and there is JP for Lama alpaka and so on right so here you can see there is GitHub repo growth there is alpaka and so on so it is getting the information from that particular uh figure or graph whatever you want to call it so that is using this just High race right I will show you now little bit more advanced way so there is another way what we can do if you want to know what are the different arguments or parameters that goes inside the partition PDF just write partition PDF and two question marks you can see the signature here so you can see what are the different uh parameters you can pass into this particular particular class right but if I go down here now what I am saying here is because I want to now also extract the images from this PDF before we have the information of the PDF but now I want to also extract the images right I say here path is images but before this I will remove this image folder here because I have already run this one so I will remove this images folder there is no image folder here I will show you side by side here so I'm saying okay images because I'm going to work in this particular path but you can just provide any path you want from where the images will be stored and I'm saying raw PDF elements here using the same partition PDF I'm passing the file name extract images in the PDF I can pass through so that it can extract the images strategy is high R remember that we need to use the high R strategy and in for table structure is true we we also want to extract the table information but you can just make it false if you want so this one applicable strategy is equals to high race and this extract image block output de equals to path in order to use this particular argument you need to have the strategy as high R otherwise it will not work so these are the things from where you can get this is the information that I said you can get but if you want to go in depth and see what are the different things you need to pass or you can go to this particular link if you click this one you can see all the information from the code is being provided source code and if you scroll a little bit down it is explaining all the different parameters here with what is the file name what is file What is strategy what are the different languages it supports what is the metadata info and so on so this is how you can get only applicable if Strat equals to highrisk for different things so you need to go step by step knowing what the code is doing under the hood and then applying that into your code right once I run this what will happen now if I run on this again you can see the same thing is happening again that is happening up here right all the information not this one but this one all the things are happening but now here the image folder will be created and all the images from this particular PDF will be dumped in that particular folder called images so you can see here there is now images being created if I go inside the images there are the figures here and if I go inside the figures so these are the figures from this particular PDF so now let's display I have already displayed here but this code let me close that first so here with this code what we can do is I'm saying go to this image folder and display all the images so I will just run this here and you can see it is extracting how many 1 2 3 4 5 six different images so this one is one image that it extracts second one is going inside it now it's extracting one at a time and then at last it is extracting that for now what we have is we also have all the images that is being used in this particular PDF we also have the image information also being extracted from this particular PDF right so now let's go multimodel meaning that we will pass this image and ask the lava model to explain us what is this image about right so for that what we can do is we can use the lava from the AMA again now if you are new to AMA I'm not going to go and explain you what is AMA and so on but I have the playlist here you can just go through this playlist you need to First install AMA and make sure it is running in your machine also and we need to First install the necessary packages so I'm installing Lang chain because we are going to use AMA and lava via Lang chin lava is going to use via AMA and AMA is going to be used via Lang chain right so yeah this will be installed and now I can import the AMA from here again uh here is the question mark question mark things which gets L it's the best way to explain or let's say see what is inside this particular uh class right so here it says AMA and you can see all the different uh parameters or argument what you want to call it are inside this AMA and you can see here it is by default using the Lama 2 and this is the Local Host 11 1434 where the AMA is being listened so if you click this one I'm I I want to show this because many of you mentioned in the comment section why is this error sh in my machine so when you run AMA make sure that ama is running so whatever question that I have been seeing in my other YouTube videos the question is the same it is not listening to Port this why because you are not wrong running AMA so first run Ama or open AMA in your machine make sure it is listening in Port 1 1434 if it is not uninstall it again install it and make sure first it is listening to this port right so now I can say llm equals to AMA mod lava 7B I can use that because I have already pulled that particular model from AMA so if I do here AMA list you can see that I have uh lava 7B which is 4.7 GB in size it is already being installed and I have ol installed meaning that I can use lava with this thing here so llm equals to AMA I'm passing the model as lava so I will run this one and you need to have peel installed here I'm showing you this because when you run this next line of code sometimes there if there is no peel uh P install then it will show the error so I am using the 10.3.0 uh version and now the next thing what I'm doing here is convert to base uh 64 I'm converting the peel images to base uh 64 encoded strings and then the plot uh image base 64 so if I run this you can see that I'm just plotting this with this with this particular code and next what we can do I'm using this because now I'm going to pass this image because in other images if you go here we don't have that many informations here right it is just a random image but in this one we have some information so that the lava model can give us some input or some output out of it right so we want to have lava explain what is this image all about for us so I have already run this here so you can see llm with input image context so what I will do is I will first run this so we have this image here and here you can see this is the information that is already being provided when I run this code before so what I'm doing here is llm with image context llm do bind and I am binding these images and I'm passing image b64 because here image b64 is we convert it to the base 64 so we have the PE image I think it's easier to explain here so this is the file path I'm providing image folder this 4 six image peel image is image. open file path so now we have this peel image I'm passing that into this convert to base 64 so convert to base 64 so that is being converted and now this image b64 can be passed into the image. bind and then we can invoke that so I'm saying here explain this uh image right so it is showing here okay the image shows a line chart representing the growth of GitHub repo over time if you run this again it will show you maybe some other information because that's how llm works it does not provide you the same information each and every time sometimes you can get okay answer sometimes you get great answer and that is when you can ask have your prompt optimization techniques being implemented so you know if you ask certain types of question it will give you the right answer so yeah it is taking some time in order to go and extract the information from from this image so yeah here is the image the image is a graph showing the growth of GitHub reposit over time this this this you can just go here and see but yeah this is how you can extract the images so we what what do we do let's just recap first and before going there also let let me see if I explain you all the things or not so we did this we did extract the image information yeah let's summarize now what we did we have the image or the PDF we passed that into the partition PDF we went through three different strategies without highr with highr and next one with high race and also extracting the images and saving it into a folder and then we just pass that into the lava via AMA and Lang chain to explain that particular image so now I hope it cleared your questions how to extract the image and image information from the PDF thank you for watching if you are new subscribe I have more educational videos in the pipeline I will be still creating some videos with unstructured so yeah thanks for watching and see you in the next video
Info
Channel: Data Science Basics
Views: 4,215
Rating: undefined out of 5
Keywords: unstructured.io, etl for llms, large language models, get your data rag ready, what is unstructured, unstructured api, langchain unstructured, clean pdf, python code unstrucured, unstructured data processing, unstructured.io api, unstructured.io tutorial, unstructured langchain, unstructured.io pdf, unstructured llm, unstructred.io excel, extract image data from pdf using unstructured, unstructured pdf, llama3, llava via ollama, multi-modal, llava langchain, image pdf ollama
Id: Ad-87wzJouk
Channel Id: undefined
Length: 21min 8sec (1268 seconds)
Published: Fri May 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.