Apples New Mutlimodal AI BEATS GPT-4 Vision (New APPLE AI)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we finally have some news from Apple regarding their machine learning SL llms in terms of what they've finally been developing so Apple has introduced a multimodal AI system that is pretty impressive because it does actually exceed GPT 4's capabilities in some regards and this might be the scenario that many have been looking at when they say the GPT 4 is no longer King so let's take a look at exactly what Apple has introduced and how good this new multimodal AI system really is so let's take a look at how this system works it's called feret so this is essentially the feret model and it's based by the Apple researchers these are the ones that created it and essentially it's mainly a vision model so first it uses a tool called clip viit l14 to understand what's in the picture and then turn it into a form the computer can work with secondly it also looks at the words you give it and converts them into a format it can understand then it identifies areas in the image and if you talk about a specific part of the picture like a cat in the bottom left hand corner the model uses special coordinates to find exactly where that is in the image of course we do have processing and shapes features and it's really smart in dealing with different shapes in the picture not just simple boxes it looks at many points in the area you're talking about and understands the details and locations of each point finally it brings together this information together to accurately find and describe the specific part of the picture you're talking about essentially what we have here is a really impressive Advanced image identification model that when on certain benchmarks compared to GPT 4 and I did test it myself to just make sure it actually does exceed GPT 4's Vision capabilities so you can see here first of all there are some benchmarks that you may want to look at so you can see on the benchmarks for the feret model we can see that feret actually has all of the input types which are Point box and free form it also has very good output grounding which essentially just means that it can understand exactly the relationship between certain objects in the image and what they actually do in the real physical world then of course we have on data construction and GPT generate and robustness and of course the quantitative evaluation of refer SLG ground with chat so this is actually very interesting because in this section of the paper they didn't actually compare it to GPT 4 with a vision they compared it to GPT 4 Roi but later on in the paper I will show you that compared to GPT 4 with vision so if we take a look at GPT 4 Roi we can see here that it says GPT 4 Roi instruction tuning large language model on region of interest and essentially what GPT 40i was was a specifically fine-tune version so in the benchmarks of the PDF I'm guessing that the researchers likely tested against GPT 4 Roi instead of gp4 vision because GPT 4 Roi is a specifically designed for understanding and interacting with the regions of interest in images which is a more advanced and specialized task than what GPT 4 Vision might be designed for gp4 roi's ability to combine language and detailed image analysis especially focusing on specific areas within images makes it a more suitable Benchmark for testing the feret model's capabilities in fine grained multimodal understanding and interaction this comparison helps to highlight the advancement and specific strengths of the feret model in handling complex Vision tasks and we can also see here the GPT 4 Roi Benchmark so that when you see what GPT 4 Roi is and why it was compared against the feret model you can see exactly why it was very effective and there are some of the these things before like we actually did a video on visual chat GPT but GPT 4 Roi does have image it does have region and it does have multi- region and even after all of that the fet model actually does surpass it so here's where we do need to take a look at one of the examples where we do compare it to gbt 4 so this is an example where they say what is the purpose of the object on the bike and region zero is the highlighted yellow region that you can see right here and it says the ground truth the object is a shock absorber on the bike its purpose is to absorb or dampen shock impulses it does this bike converting the kinetic energy of the shock into another form of energy which is then dissipated in the context of the motorcycle it is particularly useful in providing comfort and stability for the rider especially when traveling over uneven or rough terrains so you can see lava the model the vision model didn't get it right Cosmos 2 Microsoft's multimotor model actually didn't get it right Shakira this another model just just completely didn't get it right and then the feret model actually gets it 100% correct it says the model is a shock absorber yada yada Y and this shows just how effective it is now like I said I did actually try to put this image into chat GPT and I said what is the purpose of the highlighted region on the bike and the highlighted region on the motorcycle is where the exhaust pipes and Muffler are located which is completely wrong now I do want to state that did actually try this multiple multiple times like I tried this prompt so many different times and chat GPT didn't get it right at all now maybe you could prompt this better than I did but on this zero shot setting it's simply just not that effective at providing insights onto certain things now one thing that they did actually talk about was the further comparisons between this and GPT 4 so if we do take a look you can see that they actually did some of this this testing theirself so GPT for vision versus feret you can see here that we have feret and essentially this is part of the section that actually talks about referring and grounding so for example you can see right here they say what is region zero used for and it says the object is a pipe used for transporting exhaust gases from a MK cycle that's correct then the second region region one the object is a shock absorber that is also correct now this is where they tried to GPT 4 and they actually did try to prompt it in two ways exactly like how I did so they asked it for the red object it did actually get the exhaust pipe or Muffler correct then the object in the red circle actually of course it does say disc correct and I this is something that I also did encounter now what they also tried to do was they tried to get gp4 Vision with the coordinates because it might be a bit more accurate and they did actually get a more accurate answer on the first part of the question but the second example it once again just completely fails now I would like to see tons of more different examples because just one you know region with a motorcycle isn't exactly the best but what's also fascinating was the grounding okay so we can see here that this is a popular capture that we all know and that we all frequently see when signing up to different websites or signing into different applications you can see right here that it says we have the traffic lights and it says detect all objects among the traffic light and it actually shows us where the traffic lights are then of course chat GPT tries to detect the traffic lights and it gets it wrong you can see that traffic the traffic lights in chat GPT are highlighted here and there aren't any traffic lights here now I think this is absolutely crazy because in terms of referring you can see that GPT for vision falls short in understanding relatively small regions and similarly for grounding GPT 4 Vision fails to localize relatively small objects in complex scenes and specific regions but as for grounding we follow the Yang at Al prompt and localize the image in using bounding boxes the image site is width and height and as we observed gp24 vs is able to understand the referring to a certain extent either colored region in the image or with the coordinates in the text but when compared to feret it does fall short in precise understanding with those really small regions however in the paper they actually did say that on the other hand GPT 4 vision is more knowled is more knowledge able in common sense for example it can further highlight that the exhaust pipe can reduce the noise and it does talk about the fact that GPT 4's enhanced linguistic capabilities are much more advanced now in regard to the grounding that we do see at the bottom here Barret does excel at identifying most traffic lights even in the cluttered scenes so the paper says nevertheless fret shines especially when precise bounding boxes for grounding are needed and Catering to those applications that require pinpoint accuracy in smaller regions and this is prec and this is precisely where fet steps in to fill the Gap so overall if we compare gb4 Vision to Apple's new multimodal feret model it's clear that feret excels in accurately identifying small and specific regions and images particularly in complex scenarios but GPT 4 can recognize areas outlined in red or specific in text but tends to struggle with smaller regions whereas GPT for vision is knowledgeable and effective in general knowledge question and answering related to the image regions fet actually stands out for its Precision in pinpointing small areas filling the crucial Gap in a detailed image analysis now we can talk about some of the image implications of this because previously if this is very effective and it very white be might be we might have a situation on our hands where we have Vision models that really do help in terms of Performing many different tasks that they weren't trained to for example there was a paper which which actually was talking about how there were early explorations of visual language model on autonomous driving so essentially this paper actually talked about how you could potentially use GPT 4's Vision capabilities for essentially just driving on the road so of course everyone knows that there are different AI systems used in for self-driving capabilities and although we're not there yet maybe GPT 4 could help because it's essentially kind of like a mini AGI system that could interpret out of context scenario so you can see here that it's able to identify certain things and you know describe the image and exactly what was going on and essentially what they did here was they tried to understand the traffic lights they also tried to essentially say based on the image that you're seeing what would your next thing be and sometimes it did get it right so red highlights the wrong understanding green highlights the right understanding and if we do get an image model that is really effective we could be seeing these kind of models Maybe even more effective than some of the AI systems that we do have in cars and thus giving us the full self-driving capabilities because we know that just being able to identify scenarios isn't good enough because what a lot of these car companies are facing is the fact that not every scenario is going to be the same a lot of these training examples are in dry and very simple road conditions whereas when things are out of context when there is snow all these judgments and things that you need kind of like a mini AGI system which is exactly what Elon Musk said are things that I guess you could say can't just be done with those AI systems so this could mean that maybe we're just about to get some kind of huge update from Apple I'm not entirely sure what they're working on but this does bring us to the question and more importantly one of the big questions which is where is Apple anyways they've got Siri and they've been sitting on it for quite some time and you might be thinking what on Earth are they going to release are they ever going to release any kind of AI model or any AI system but I've got to be honest with you guys you have to understand that Apple are a company that tends to wait but this is the one time that I think Apple waiting might actually be a horrible situation because it's not like this is a traditional kind of Technology this is the kind of technology that does move very quickly and if you're not caught out you can be left behind and apple traditionally doesn't really care about what Samsung do because usually Samsung has the best features first but with apple people are loyal they will wait for the features even if they're 3 years behind and even if you know the other side which is you know usually just Samsung versus or Android versus Apple people will say that you know Android had it first but apple the core Die Hard supporters will really not care and just simply say it doesn't matter and of course it will be intriguing to see what Apple actually does because as we know anything is truly possible ladies and gentlemen Apple have finally decided to make their entrance into the generative AI space Apple recently announced something thing called Apple GPT now Apple GPT is an artificial intelligence language model rumored to be in development by Apple it is expected to be similar to open ai's gpt3 and aims to enhance series virtual assistant capabilities and other AI powered features in Apple's products the informal name Apple gbt suggests that it could use a generative pre-trained Transformer model the same kind of model that chat gbt uses now Apple GPT started as an experiment by a small team of Apple engineers in 2022 and is currently limited to internal use assisting with prototyping future features so it's clear that Apple has realized that the markets are moving very very quickly and they do have an entirely new platform to deploy their generative AI features from the new Apple Vision Pro to their new iPhones apple has a variety of applications that they could use to deploy their new Apple GPT and as we stated Siri seems to be getting a major major upgrade there are some predicted features that we do want to talk about the most anticipated features of Apple GPT include better natural language understanding which essentially means that when we do talk to Siri and when Siri talks back to us the conversations are going to be a lot better than the monotone ones that we do currently engage in this is something that Apple hasn't really improved moved on since the major release of Siri number two is we're also going to get some improved text generation as you know sometimes when you're typing on your keyboard you do get a bunch of suggested words and if the generative in generative pre-train Transformer can actually allow us to get improved text generation writing messages in iMessage is going to get a whole lot easier and I'm pretty sure that this apple GPT is probably going to assist you in many other Apple applications as well well such as notes iMessage WhatsApp and of course any word writing software number three is of course the enhanced conversational abilities and this could mean that potentially we might be able to customize our own versions of Siri which could be unique to us that would be really really interesting and a unique spin on what we already have with the generative pre-trained Transformers now these features are expected to improve series contextual understanding provide more accurate responses and enable more realistic conversations with users now in comparison to other AI tools Apple GPT is quite similar to other AI tools like chat gbt and Google bard in terms of performance and functionality According to some sources however it's not publicly available yet and it's only accessible through a web interface for a select group of Apple employees and according to many different resources such as Bloomberg apple is expected to make a major major announcement about its AI efforts in 2024 so Apple GPT is a language model rumor to be in development by Apple and it seems that like in 2024 we're going to get a major overhaul now we aren't sure at when in 2024 this groundbreaking announcement is supposed to be but like many different Apple conventions it's probably going to be one of Apple's live stream events that they host throughout the year when they're unveiling latest products or just doing a standard keynote so we now essentially what Apple have done done is they've upgraded auto correct to the point where it actually uses machine learning so before Apple used to use an archaic old version of machine learning to predict text but now as you know as Google pioneered the way in actually creating the Transformer architecture for people to now use the thing that actually makes chat GPT so effective which openi actually built their chatbot around this is what Apple are now essentially using for their autocorrect word prediction so although this firstly wasn't a major announcement it just goes to show that of course apple as a big company as they are are seriously paying attention to what is going on in the space I mean how could you not pay attention to the rapid rise of AI there was also another small AI announcement in which many people also did Miss which was introducing Apple's new Journal feature so essentially what Journal was is it's pretty much a feature that allows you to write down your journals but it is going to be powered by an on device AI the word that they actually used was on device machine learning so essentially your iPhone can create personalized suggestions of moment to inspire your writing now they also stated that suggestions will be intelligently curated from information on your iPhone like your photos location music workouts and more and then of course you can essentially control what suggestions that they pull from your phone so essentially what we have here is an AI tool that is going to allow you to write more effectively by pulling from every single piece of data that it has on your phone such as your photos and many other different sources now one thing that I did find very interesting about this talk from Apple was that they did refuse to mention the term artificial intelligence or AI now when you look at the transcript right here you can see that AI isn't mentioned but machine learning is actually mentioned in seven different times then of course we have this tweet from a user named Ethan mik and essentially it's a very very valid point so in this tweet he basically says that Apple didn't address the dead end that is Siri in the age of AI so if you don't know what Siri is for those of you who don't use apple essentially it's a voice assistant that you can prompt by saying Hey Siri and then your phone will simply go up and wake up with a woman who essentially asks you what would you like to do now it can be a man it can be a woman essentially it's quite like Amazon's Alexa but for iPhone now the problem is is that when you ask Siri for a restaurant prompt which is exactly what this guy did this is series response versus what Microsoft Bing can do with the same exact prompt now we do know of course Microsoft Bing isn't voice activated but it just goes to show that in the age of AI why is Apple declining to spread any news or any advancements now I do have an answer for that and it's simply autonomous products now Apple has been actively acquiring a range of artificial intelligence companies in recent years with the aim of enhancing the AI and machine learning capabilities of its products and services the list of companies acquired by Apple includes emotion a startup company that uses AI technology to read people's emotions by analyzing facial expressions turri a small seattle-based startup specializing in machine learning and artificial intelligence realf face a cyber technology startup whose facial recognition technology can be used to authenticate users AI music a startup that uses AI to generate personalized soundtracks and adaptive music wave 1 a California based startup that was developing AI algorithms for video compression and to name some others the acquired Shazam sensor motoric silk Labs Drive AI laser like spectral Edge vosis exor Ai and many many more these Acquisitions have allowed Apple to tap into the expertise and Technology of these companies to develop Advanced Ai and machine learning capabilities for a range of applications for example the acquisition of turri in 2016 gave Apple access to the company's expertise in developing machine learning tools and platforms while the acquisition of xnor AI in 2019 provided apple with low power edge-based AI technology for its product by investing in a wide range of AI companies Apple has been able to stay at the Forefront of the AI race and to drive innovation in the technology industry the company has introduced a range of AI powered features in recent years such as facial recognition in the iPhone x and series improved natural language processing and it has continued to invest heavily in AI research and development overall Apple's Acquisitions in the AI space demonstrate the company's commitment to staying ahead of the curve in the technology industry by L Ling the expertise and Technology of the companies it has acquired Apple has been able to enhance the AI and machine learning capabilities of its products and services driving improvements in user experience efficiency and productivity Apple's extensive research into machine learning is a key part of the company's strategy for staying ahead of the curve in the technology industry with a dedicated Department focused on machine learning apple is able to invest heavily in research and development driving Innovation and pushing the boundaries of what's possible with this technology one way that apple is demonstrating its commitment to machine learning is by regularly publishing research papers that highlight the company's Innovative work in the field these papers cover a wide range of topics from computer vision and natural language processing to autonomous systems and data privacy one recent example of Apple's Innovative work and machine learning is the development of a program called facelet this program uses machine learning algorithms to create photorealistic 3D renders of a person's face just using two photos this technology has significant applications in the field such as virtual reality gaming and Film Production and it demonstrates the potential of machine learning to drive advances in a wide range of Industries overall Apple's heavy focus on machine learning is a testament to the company's commitment to staying at the Forefront of the technology industry by investing heavily in research and development and sharing its work with the wider scientific community and pushing the boundaries of what's possible with machine learning Apple ensures that it remains a major player in the AI race
Info
Channel: TheAIGRID
Views: 159,883
Rating: undefined out of 5
Keywords:
Id: utTtrwW9GpM
Channel Id: undefined
Length: 22min 25sec (1345 seconds)
Published: Fri Dec 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.