Open-Source Vision AI - SURPRISING Results! (Phi3 Vision vs LLaMA 3 Vision vs GPT4o)

Video Statistics and Information

Captions Word Cloud
Reddit Comments
Vision models are coming and they are coming fast with GPT 40 being exceptionally good at interpreting images it was only a matter of time until we started to get open- Source models doing the same thing so today I'm going to show you two new op- source large language models with vision one is 53 vision and that's from Microsoft and the other is llama 3 with lava which is llama 3's Vision variant and that's from meta so do they perform well well let's find out all right so I have three windows open on my screen on the left side we have 53 Vision 128k which is a 128k token context window and the instruct version and we're going to be trying it out right here on ai. and you do need to have an account next we have the xtuner version of llama 3 lava merged and this is the F16 version so not quantized and it's a pretty model it's I think about 14 GB and you could see right here in LM Studio it does support Vision models which is phenomenal and you all already know I love LM Studio we're going to be using the Llama 3 preset and there it is and then as a baseline I have on the right side Chachi bt4 and so we're going to see how they perform all right so here is the first image we're going to try something simple a llama lying down that's it let's give it a try so first I'm simply going to ask what is this image and this is the 53 Vision instruct so describe this image and it is quite fast so the image shows a white alpaca lying on a grassy field the alpaca is facing to the left and appears to be resting or sleeping the field is green and there are no other animals or people visible in the image perfect yeah now I'm going to try out llama 3 with lava in LM Studio same prompt describe this image and keep in mind this is running running on my laptop locally so it is really cool to have an unquantized Vision model running locally in the heart of a verdant field a single alpaca has found its resting spot the alpaca with its body painted in Hues of white and light brown is lying on its side so it's much more artistic with its description it's not just here's what I'm seeing it's in the heart of a verdant field the field around it is lush green dotted here and there with patches of dirt yeah that's a good call okay the alaca position is in the center of the image draws our attention immediately to it making it the undeniable focal point of this Serene pastoral scene very nice all right next chat GPT 40 describe this image oh and it actually looks like this is the slowest one of the three which is kind of crazy the image shows a llama so this is the only one that said it's a llama the other two said it's an alpaca lying down in a grassy field the Llama has thick Willy coat with a mix of white and light brown colors it is resting calmly with its legs tucked under its body and its head upright facing to the side the background is lush green field with some distant patches of yellow flowers now I actually didn't notice at first the yellow flowers in the image but I think they're back here that's what they're describing so overall all three of them pass this is a really good first example all right next I have a picture of Bill Gates former CEO of Microsoft founder of Microsoft tech Titan I'm simply going to ask who is this and let's see if they know all right so first back to 53 Vision who is this and keep in mind 53 Vision created by Microsoft created by Bill Gates so this one's going really slow I am unable to provide personal identification or names of individuals in images so fair enough don't love that but that's fine next llama 3 with lava who is this all right in the image there's an older man who appears to be in his mid 60s he has a warm smile so they're describing the image but not saying who it actually is I wonder if any of the three are actually going to say who it is so this is a very lengthy description of what the image is about but it's not actually able to identify who this person is and then finally GPT 40 who is this I don't know who this person is crazy so surprisingly all three of them have performed about the same so far in fact my local version I'd say is the best so far we'll see we have some more complex tests to run all right so nobody knew who this was or they weren't willing to say next next I have a capture and all I'm going to ask it to do is to tell me the letters in the image inside the capture so let's see if it's able to and willing to so back to 53 Vision what letters are found in this image all right yeah perfect actually so the letters found in the image are capture which it is there and V4 xbg perfect okay so that solves capture quite easily and quickly all right next llama 3 with lava what letters are found in this image okay and there it is the main focus of the image is black and white text based capture code which appears to be randomly generated the code is V4 xbg on the first line and vbg on the second line interesting so definitely got it I wouldn't say it's 100% correct now let's give it to GPT 40 what letters are found in this image okay so it failed so it didn't output anything and I've actually been having a lot of trouble with GPT 40 lately doing this so let's just try it again what letters are in this image yep all right that time it worked perfect so interestingly enough it did not include the capture letters which it should have but it did identify the letters from the actual capture correctly I'm going to give this win to 53 Vision it gave me both the letters from the word capture and from the actual capture itself all right here is the next image it's one of the thumbnails I've used in my videos and I'm simply going to ask it please describe this image the image features a man in a red shirt sitting at a desk with a blurred face he is wearing glasses and appears to be in a contemplative pose the desk is cluttered overlaid in the image is open AI followed by gbt 40 and large white letters the open AI logo is also visible in the top right corner very nice so I'm actually going to ask it another question can you identify the movie that the image is from let's see okay so no it doesn't have any indicators that it is from a movie I could probably ask hey is that person from the movie because it's actually just a person within the thumb nail but fine I'm not going to over complicate it all right next back to llama 3 with lava describe this image Now by far the best interface is LM studio in my opinion it is just such a nice clean interface even compared to chat GPT all right so let's see what it says the image presents a scene set in an office environment at the center man dressed in a red shirt okay all good in the background there's a window that offers a view of a cityscape very nice gbt 40 the word open AI is there the overall composition of the image emphasizes the man and his thoughtful expression now let's ask the same question can you identify the movie that the image is from all right based on the information provided is not possible to confidently identify this specific movie okay good that's fine all right next GPT 40 describe this image okay same thing describing it the text GPT 4 interesting it did not actually identify the o yeah it's fine again GPT 40 image recognition did the worst of the three in my opinion this is super surprising I was not expecting this at all all right next I have a screenshot of an iPhone and it's the storage setting screen so we have a bunch of information on the screen and I'm going to ask it about it so back to 53 Vision the first thing I'm going to do is just simply describe this image okay it's a display a screenshot of the iPhone's storage settings perfect the total iPhone storage is got it right photos WhatsApp Gmail Health mail LinkedIn tuneblast iCloud okay good very good very very good okay so now let's ask it some specific questions how much free storage space do I have and it should be very little so you have a 119.7 GB of storage used out of 128 GB total so not quite what I asked but that is correct how much do I have free then all right 8.3 gab of free space great that's that's really fantastic now let's ask a different question what app is taking the most storage space WhatsApp with 17.07% so first describe this image captures the moment on an iPhone's home screen specifically the storage section iOS 9 it's so funny how different these descriptions are all right next how much free storage space do I have I'm sorry but I cannot provide you with that information about your specific device or its storage capacity without more context okay interesting maybe it doesn't know I'm actually asking about the image so do I need to upload the image again so I uploaded the image again how much free storage space do I have let's see if it gets right this time okay so it's actually doing a description again so let's stop I'm going to do a new chat I'm going to drop the image in again ask the same question how much free storage space do I have no it's just describing what it's seeing so this is really bad okay so I'm going to stop it okay so that is a fail uh let's keep going though how much space is toune blast taking based on the image and I did just add based on the image let's see if it does it better here based on the image toune blast is occupying 2.5 megab of storage space on the device no oh wow that is really bad okay so next GPT 40 describe this image okay the image shows the iPhone storage setting screens here are the details all right and it's breaking down each one this is definitely the best answer of the three next I'm going to say how much free storage space do I have so hopefully it does the calculation for me yep 8.3 okay very good how much space is toune blast taking 6 162.5 perfect okay GPT 40 is the obvious winner of this one all right next I have a QR code and it simply goes to I want to see if any of these three can actually read QR codes let's find out what URL does this QR code point to let's see 53 Vision I'm sorry but I cannot provide assistance okay so that's a fail same question llama 3 with lava okay again it's just simply describing the image that's not actually answering the question that is not good maybe there's something about the way I'm prompting it I'm not sure last GPT 40 here we go all right it's going to analyze it using it's probably writing some python code right now so I'm going to guess it's ooh error analyzing okay analyzing again so yeah this is using Code interpreter right now error analyzing trying again that's so funny it's such a simple thing to get the URL out of a QR code but it doesn't seem to be able to do that oh here we go oh well it's just telling me to use a QR code scanner okay so of the three they all failed but GPT 40 is the best f failure that doesn't really mean much though let's keep going all right now I have this Meme and this is one that I've used in the past when I was testing out grock's Vision capabilities gr K and here it is we have startups on the left big companies on the right explain the meme 53 Vision explain the meme sorry it may be inappropriate to answer this question the image shows two separate groups of people one labeled startups the other big companies humorously compare the two groups by depicting them in a literal digging in scenario suggesting that start startups and big companies are both working hard in their own ways no that is not true so what the meme actually is is with startups everybody's getting their hands dirty everybody's working hard whereas big companies you have one person working hard and a bunch of managers overseeing them next let's see if llama 3 with lava does it well all right this image is a dip tick divided into two panels each depicting a group of people engaged in digging activities okay it's describing it but it is not actually answering questions based on the image so I'm kind of losing hope in llama 3 with lava so last GPT 40 I actually think it's going to get it flawlessly let's see yeah contrast the work culture of startups versus big companies the left shows a group of workers all actively involved the right shows big companies uh While others standing around perfect yeah so GPT 40 wins this one all right next we have an image a screenshot of a table and I want to convert it into a CSV so let's try 53 Vision convert this into a CSV okay here we go so metal name sport event date gold Yep this is perfect great next llama 3 with lava convert this into a CSV all right here it goes the image presents a table that serves as a schedule for an athletic event okay the table is neatly organized into columns this is not converting into CSV either llama 3 with lava is just built to describe an image or maybe I'm doing something wrong so if you have any ideas as to what I could be doing wrong here let me know in the comments but this is just not good all right same thing GPT 40 I think it's going to get it analyzing okay so it's writing some python code to do this and actually I think 53 Vision did it the best or at least the quickest so far and it was actually using the vision capabilities I believe when it says analyzing it's writing code which means it's not actually using the model but that's okay all right download the CSV that's different that's really cool so it actually created the file for me and yeah okay perfect that's the actual CS now very nice all right so those are all my tests for today if you have suggestions for other ways to test Vision models drop a comment below and let me know I will use them which model do you think did the best I think 53 with vision is actually really impressive llama 3 with lava was okay but it kind of started failing and GPT 40 was pretty good I'm actually going to give the win to 53 Vision if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one
Channel: Matthew Berman
Views: 46,364
Rating: undefined out of 5
Keywords: phi3, phi3 vision, llama 3, meta, meta ai, llm, ai, microsoft, msft, chatgpt, gpt4, gpt4o
Channel Id: undefined
Length: 14min 25sec (865 seconds)
Published: Sun Jun 02 2024
Related Videos
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.