Image Annotation with LLava & Ollama

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Okay. In the last video I talked about some of the updates around Ollama. And one of the things I talked about was that you can build a lot of little apps, that actually do things on your local computer. And so one of the ones I mentioned was about screenshots. So people asked about that and in this video, I'm going to go through, show you how to do it. I'm going to talk about the basics of what's going on. It's very simple. And at the end, I'll talk about a different version that I'm using personally, which is a more advanced version on this. Okay. So the whole idea of this is that basically we've got a folder, which we do screenshots for, right? Most of us on our computers have a folder where we save automatically saved screenshots. And over time that folder gets very big, or at least if you're me, that gets very big. And often I want a screenshot that I did a while back at a certain point in time and stuff like that. But the only way to go through that is to actually go through and look at each of the screenshots. And while that works, one of the things that you can do, I'm going to show you here is that you can get one of the vision language models to automatically annotate or write captions about the images, which will make it easier for you to find the image later on. So in this video, I'm just going to show you a basic version of how to create the annotations using Ollama, using the LLaVA 1.6 model. And then perhaps in another video, I'll show you how to sort of integrate this with a RAG system. So that you can actually just use a Q&A to basically query your screenshots and get the answers back that way. All right. so let's jump in and actually look at sort of a diagram of how this all works. It's actually very simple. Really we've got a folder where we've got our screenshots So the first code we're going to basically have, it's just something that goes, To the folder and gets a list of the files out. And then basically sorts those, So that we've got them in some kind of order to basically use. The next thing up is like the main logic of this. And this is really just going to basically load up the file. In my case, it's loading up PNGs. You could set it to load up different things. But in my case, because it was PNGs, I found that Ollama didn't seem to process the PNG that was more used to JPEGs, but what you can do is just convert the file to bytes. And then it can handle bytes, quite easily in here. So basically I load up the file. I convert it to bytes. I then send it to the lava 1.6 model. Now here, there's a few different models that you can use. You can use the basic, 7 billion parameter model. and you'll get okay results out of that. You can go up to the 13 billion parameter model, which is the one I'm going to show you here. And then you could also go up to the 34 billion parameter model. So in my experimenting, the thing that I've found is that the 7 billion parameter model will often miss things that are really obvious. And it can do a really good job on some images and not as great on other images. The 13 billion parameter model is definitely better at sort of having a bit more understanding over the image and stuff like that. And if you've looking for something specific in an image, you could actually put that in your prompt that goes along with this. And probably get decent results out of that Obviously the 34 billion parameter model is going to be the best one for this. But for a lot of people, either you're not going to be able to run that, or it's just going to be insanely slow. For me on this machine that I'm using, I've got a 32 gig of RAM. I can run that. So it's not, you know, you need like a supercomputer to do this. But it's definitely slower than that than the 13 billion parameter model that I'm going to be using here. Now, when you send the image there, you want to send it along with a prompt. And you really want to customize the prompt for your particular use case. So if you're just trying to index stuff, maybe you have something about describing stuff. I'll show you when I go through the code of the prompt that I'm using in here. We then bring the results back from the Llava 1.6 model. and then I basically just add that to a dataframe. So one of the things that I did early on was I basically check if there is a CSV file in the folder. And if there is, I just load it up and then we basically check to see, okay, has this file already been processed? If not, we'll process it. If this file has been processed, we just leave it and go to the next image in there. But once we get the results back from Llava 1.6, we basically put that into the dataframe. And then finally, we just saved the dataframe out to a CSV file. So I'm just showing you a really simple sort of standalone version of this. You could obviously save this to a database. You could save this to a vector store, which is one of the things I'll talk about at the end. You've got a whole wide variety of things that you could do in here. And then you end up finally with your CSV file, which you could load into excel or Google sheets Or use it wherever you want to use this kind of thing. All right. Let's jump into the code and have a look at how this. actually works. Okay, so go through the code. It's pretty simple in here. We've got our imports up first. So I'm just bringing in Ollama and then I'm going to bring in generate which we're going to use for actually generating the return. We're then going to use glob to basically get a list of the files. I'm going to use pandas to make the dataframe. I'm going to use PIL to bring in an image. And then I'm going to convert it to bytes. So that just shows you what I get on in here. So first up, we basically try to load a CSV file with the file name that we've got here. So I'm calling it image_descriptions.csv. And if that exists, we'll just convert that to a dataframe so that we can then add anything that's new. And then that will be saved at the end, back to this. If it doesn't exist, we're just going to basically make a new, pandas dataframe. we're going to give it two columns. One is going to be the image file. One is going to be description in here. So when we run that, now we've got our dataframe. All right. So we need to get a list of the files from where we're going to get them. So basically here, we're just doing a glob of the folder path. In this case, I'm going for a PNG files just because I know that folder has nothing but PNG files in it. But if you were using JPEGs or something like that, you could change this or you could use it, star.star to get any thing and then put in a check to make sure that it's an image, that kind of thing. Alright. so we run that, we've then basically got the list out. I'm just going to sort the list. In this case, I'm just going to print out just some sort of debugging, again, a printout, the first three images it gets. And we're going to print out the head of the dataframe, if we've got one. If we don't have one, obviously we'll just see an empty dataframe there. All right. So now we come down to the main part of this. So what we're going to do is we're going to have A loop where we're going to basically just go through each of the image files. In this case, I'm just taking the first five. We're going to basically check if this image file is in the dataframe. And if it's in the dataframe, we're just going to skip it. If it's not in the dataframe, we're going to process the image. So this is the main function here, the processing the images. What this does we pass in the path name to the file. It was just gonna print this out to the console so we can see it. Obviously you could turn these print statements off really easily. And then we're going to basically load up that image. And, convert it to A bytes format. So that we can just pass that in. we're then going to basically pass that into generate You'll see this full response string. I'm going to just set that up there. But we're going to pass into the generate. So we're telling it which model we want. Now, in this case, I'm using the LLaVA 13B 1.6 model. But in my case, I've got all three of the different sizes in here. And if I was just running it as say a Cron job or some sort of job that got run. in the middle of the night or something, I could just go for the really big model. And I wouldn't worry too much about it taking that long. And it doesn't take huge amount of time anyway. It's more, if you're a system has got enough RAM to actually run it or not. And then I've got the prompt. So the prompt that I'm passing in here is, Describe this image and make sure to include anything notable about it. And then in brackets, I've got include text in the image. So the idea here is that if it sees some texts, then that's probably going to be one of the most important things. Now, the challenge is going to be that it often won't get that text, especially with the smaller models. So this is something that you want to be aware of, but you can totally play, not only can you, but you should totally play with the prompt here for your particular use case. So sometimes if I've got it sort of Looking for something in particular, perhaps you've got it where you're trying to get it to do a, not safe for work check or something like that, play around with the prompt for those kinds of things. And be aware that the smaller models, they're just not going to be very good for certain things. All right. So that's my prompt. I then pass in the image here. And then this is just taking in a list. I'm just passing in the image bytes that we got in there. And I'm going to stream the response out here. And then it's just so that I could see the texts coming through and basically print out each response, as we go through this. And then basically just adding each of those streaming texts into the full response. And then finally, at the end of this function, I'm basically just adding the path name or the image file name and the full response that we got back to the dataframe. So it's just adding it to a new row in there. And then finally at the end of it, we're just saving this out back to a CSV file with the same name as what we started with. So if we've updated the CSV file, we'll just be updating that file. If there was no CSV file to start out with that we're basically creating a CSV file here. All right, so let's come in here. I'm going to run it and let's see the sort of time frame that it actually takes to do this. So you can see that basically there was no CSV file in this case. and we can see that at the start, it is, processing this image. So it's actually loading up the LLaVA model. And then sure enough, now we can see that, okay, it's generating text out here. Now, this is what I mean by the smaller models will not always get the text, right? So that particular first image there is for Gemini, right? It's a screenshot something about Gemini advanced in there, and you can see that it's actually got Gemni. So, they won't always get perfect OCR or something like that. Obviously the bigger models will do better at some of this stuff .. you can get some things that will get quite nice results out here. if we look at this one here about the Google logo, it's done a quite nice job at sort of being able to interpret that. So this is a stylized version of a Google logo at features and abstract dragon light creature with Chinese elements. So here it's done a really nice job of being able to capture what's in there. And actually on this one, probably a lot of OCR systems wouldn't get the Google, in there very well. Now looking at the results out, we can see that, okay, it's basically done five different images here. And It's had no CSV file to load. It's done the five images because I had that set to five there. If I come out now, we just save that after changing it now to run through all the images. now can actually see that, okay, already the CSV file has these images done in here. So it doesn't need to do those images again. And you'll notice that in this time, it was actually quicker to actually get the model because the model was already there and stuff as well in this case. So you can run into some issues where, if you're trying to load the model twice and you've got it half loaded or something like that. In any of those cases, you can just quit out of Ollama and come back in and it should work fine, or you can actually go through and kill all the Ollama processes manually. But it will often, be in the middle of something and then restart a new process, So you can see for a bunch of these, it really has got the whole sort of idea, the number of images in here of this sort of design. and it really does understand that this is like a CAD design going through and working that out. Which will make it quite easy for us to find this either just doing a keyword search in there, or to actually use some kind of RAG in there as well. And you can see, at this point it's generating, pretty quickly. like I mentioned before, I'm not using a super fast Mac here. I'm just using a Mac mini And it's going through and being able to generate these images out. Just to show you some of these images One of them was a Walmart, a receipt that I got from online. And I think it's done that this image appears to be a receipt from Walmart, a large retail store. The receipt lists several items. Okay, so that one has done a pretty good job. what about the flying cat one. this image shows an orange Tabby cat in mid jump. the front paws are extended out. So you can see that it is actually getting some of these, quite nicely. and then, like I mentioned, that doesn't need to get them all perfectly for you to be able to then put this into some kind of search or some kind of RAG for this. how could you extend this and make an advanced version of this? So one of the things that I've done to make it advanced versions of this, is where you get also things like the file modification or creation date and store that in the list. And then when you put those into RAG, you can then basically use those as metadata to actually do searches and say, I want to search for this image. It was from December 23. And it just be able to sort of hone down. So especially if you've got a lot of images that are going to be very similar, from time to time adding in anything that can get a metadata in there as well, can be a really good. So Things like getting the modification date, getting the owner or the username of the person who saved it. Those kinds of things can be useful if it's not just for yourself as well. But anyway, this project gives you a simple, example of how you could do this. And hopefully you can see that this it can be really useful. Okay, so, in one of the next videos, I think I'll look at, how to add in a custom RAG with a fully open source local model. so that we can try that as well in here. As always, if you've got any questions or comments, please put them in the comments below. if you found the video useful, please click and subscribe. I'm going to be doing a bunch more things like this. I'm currently working on a number of things to try and show people how to do the function calling with some of the open models and looking at the different results that you get from different open models for doing things like that. All right, I'll see you in the next video. Bye for now.

Info

Channel: Sam Witteveen

Views: 20,765

Rating: undefined out of 5

Keywords: ollama, run ollama locally, large language models, ollama on macos, local llm, install llm locally, ollama function calling, ollama output, LLaVA, Large Language and Vision Assistant, Multimodal Model, Artificial Intelligence, Visual Understanding, Language Understanding, Image Captioning, Visual Question Answering, Image Retrieval, Conversational AI, chatgpt, image chatbot, javascript, vision models, openai compatibility, RAG

Id: _TUvb6NtpGA

Channel Id: undefined

Length: 14min 40sec (880 seconds)

Published: Wed Feb 14 2024