I got to say guys truthfully open AI blew my mind on Monday I don't know about you but their real time companion there her clone shocked me to say the least I want to introduce you to somebody hello there cutie what's your name little sluff ball this is Bowser well hello Bowser aren't you just the most adorable little thing I did do a full video like recapping the event but as it turns out there is a lot more to uncover here than first meets the eye for example did you know that this model can somehow generate images and gosh they're the best AI generated images I've ever seen Point Blank period what's going on there's also quite a few other capabilities that open AI just kind of kept under wraps so let's start out here with what we do know first of all obviously we know that the model that's powering everything under the hood this insane realtime AI assistant is called gp4 o and O stands for Omni and the reason Reon they called it Omni is because it's the first truly multimodal AI in simple terms actually brought to you by GPT 4 itself multimodal just means that the AI can understand and generate more than one type of data instead of just working with text for example GPT 40 can process images it can understand audio natively and it can even sort of interpret video the old gp4 turbo was split into two or three separate models mod I'm not precisely sure it might have been taking images in natively or it might have been using a separate model to parse those images into text don't really know either way we absolutely know for a fact that it did not natively support audio yes the old gp4 app did have the ability for you to talk to it with your voice but that was using a separate model that was called whisper V3 that would just take your audio and transcribe it into text don't get me wrong it was great at taking your voice and transcribing it into text but that is all it did it can't hear the sound of birds for example it can't hear your dog barking it can't hear your tone of voice this new model for example can understand your breathing patterns and even more which we'll get into later just take a deep breath I like that suggestion let me try a couple deep breaths can you give me feedback on my breaths okay here I go whoa slow a bit there mark you're not a vacuum cleaner breathe in for a count of four okay uh let me try again so I'm going to breathe in deeply and then breathe out for four and then exhale slowly okay I'll try again breathing in and breathe out that's it how do you feel I feel a lot better and of course it can also understand the emotions that you put behind your words which is possibly the most important part about this it will react differently when you're sad it will react differently when you're excited it will react differently when you're yelling and screaming at it very human indeed like this is Uncharted Territory the first mind blow of capabilities that I want to show you is going to be the text generation models have been doing this for years so you might think so what it generates text even the benchmarks were just as good as the other leading models it's not like it's Leaps and Bounds better even the context length is the same size it's not a bad context length of 128,000 tokens but it's no better so what's the big deal well here's the rub on text generation with gp4 Omni this model is lightning fast and when I say lightning fast I mean this thing generates like two paragraphs a second and the outputs yes are just as good as leading models multiple times faster and this opens up entirely brand new branches of what is actually possible with text generation so let's dive into a few of them so a bunch of these examples are going to come from this Twitter thread by Min Choy that's going to be linked down below I always link Twitter threads down below if you want to check them out highly recommend following this guy by the way phenomenal AI account and also follow me on Twitter as well cuz I am always reposting great stuff so first up this is Sawyer Hood's ultimate llm test ask it to make a Facebook Messenger as a single HTML file GPT 40 does this all in 6 seconds flat again not only fast text generation but high quality it actually works you open up Facebook Messenger as a single HTML I mean that's just absolutely insane right gp4 Omni can also generate fully blown charts in statistical analysis from spreadsheets with a single prompt in less than 30 seconds Zay here points out that this stuff used to take absolute ages in Excel but it can now all be done automatically by your AI and yes the old gp4 turbo could absolutely do this but it couldn't do it this quickly and also it wasn't able to do it this accurately either yeah you start getting charts in about 6 seconds from an actual shoe company sales CSV file and these charts aren't bad either they're actually what I would consider to be usable in a real company meeting and they're diverse even giving you a summary with key insights it's like an entire breakdown in 20 seconds fast highquality generation this is Leaps and Bounds ahead oh and folks you thought we were done there well it gets even crazier this is from tailin on Twitter Pokemon Red gameplay so essentially this is like a custom prompt to make gp4 Omni play Pokemon red as a text based game watch this as you can see it essentially boots up Pokemon Red there look at this new game continue or options it's a text based game it even does its best to try to include pictures by using emojis but it can do it so fast that you can essentially play the game in real time oh we select a and then it says oh you know some people Pokemon are pets other use them as fights it's literally the Pokemon Red game and you just keep entering your a choice and then you can actually put your name in we're literally just going to use a custom name in this example and it's like okay yep following along here the whole Pokemon Red game is converted into a text based Adventure game like that inside of the llm and it's running in real time like what the what is going on here it even has Route One all laid out correctly with the houses Oaks lab the beach this is indeed a very very impressive example you can see it even has the fight or use item and you can have the HP you can essentially play an entire Pokemon Red game just conver Ed to text based inside of an AI with just a little bit of prompting which is absolutely mindblowing I mean this is more or less what's possible with the API I'm sure you could get chant GPT to do this if you with a special prompt or with a custom GPT but obviously this here was done by using the API instead and I think that's what you guys have to realize here is that this is more than just chat GPT people are going to be able to build some insane things imagine a new from the ground up game that lets you take a photo of your dog and then use your dog as the Pokemon and the AI comes up with all of its abilities on the Fly I mean the possibilities are endless and by the way guys this is merely just the beginning how good would these models be in a year imagine when the text generation isn't just way faster and just as good but way better and also way faster the era of Rapid AI development is upon us oh and by the way speaking on the API the new gp4 Omni is not only fast and just as good but it's actually uh half as cheap as GPT 4 Turbo which was even cheaper than the original GPT 4 so we're seeing a rapid decrease in how much it costs to actually run these powerful models and folks that's just text let's get into the audio generation capabilities that gp4 Omni holds now we're dipping our toes into the multimodal landscape again Uncharted Territory for sure as we saw in the demo it produces remarkably high quality human sound ing audio the model is able to generate voice in a variety of different emotive Styles hey chachu PT how are you doing I'm doing fantastic thanks for asking how about you and uh I want you to tell him a bedtime story about robots and love once upon a time in a world not too different from ours but I want a little bit more emotion in your voice a little bit more drama once upon a time in a world not too different from ours there was a robot named nobt I really want maximal emotion like maximal expression this much more than you were doing before once upon a time in a world not too different from ours there was a robot was do this in a robotic voice now initiating dramatic robotic voice it's a way more natural way not only to interact with a chat GPT style model but there's even more that uh open AI kind of kept Under Wraps as smokea away points out GPT 40 will be able to generate audio for any image you input bringing your images to life hear the sounds of a scenic landscape hear the noises of a bustling cyberpunk City the possibilities are endless and I'd like to make a note that yes it does seem a little bit hopeful that you'll just be able to speak to it and be like hey can you generate this audio for me the model will probably try its best but it seems like right now it's more fine-tuned for voice that doesn't mean it can't be fine-tuned for sound effects capabilities in the future it's native audio generation it's not just some robotic text to speech it might even be able to generate music in the future as well but not only this if we dive even a little bit deeper we'll note that here for example on the open AI gp4 o announcement site under explorations of capabilities they have meeting notes with multiple speakers so we have a one minute meeting okay good morning here's our first team meeting morning morning I'll be your project manager for today this project my name is Mark will be giving this presentation you to kick the project off uh during this project the marketing expert designer I'm going to look at the technical design and that's some bad audio to be honest I can barely differentiate the voices it's it's not very clear we basically just ask it how many speakers in this audio and what happened the output is actually able to determine it GPT 40 says there are four speakers in the audio it sounds like a project meeting where the project manager Mark is introducing himself and asking the team members to introduce themselves and so on and so forth we further then go and say can you transcribe it with speaker names and yes it's able to differentiate all those speakers so not only will it be able to understand your voice in a very natural way and understand your tone of voice but it'll actually be able to understand what you sound like and differentiate you between other people which is really big that means you can have those multiple speaker conversations like we saw in the demo and I think a lot of people when they saw that didn't really realize what was going on there but it is indeed differentiating this person versus the next person and the differences probably between how they speak there's a lot of nuances there that there are to uncover and you don't really realize it all at first we've also got another sample which is a lecture summarization which is something that ai's been doing for a long time but this is quite a long lecture WR 45 minutes of audio and I got to say it does a pretty darn good job giving the entire breakdown for this presentation I really would have loved it if in this demonstration they showed an example of whisper trying to do this same thing wrapping it all in one model allows it to reason about the audio where whisper just can't and that allows you to have this ability to recreate the presentation displayed right out in front of you and furthermore I want to think about when we actually start to get access to this thing I'm going to try to do things like have it listen to a dog barking and say can you try to recreate that for me because we can all try to bark like a dog right will it sound like a human trying to bark like a dog will it actually bark like a dog will it be able to hear when my dog is barking working in the background will it be able to hear when a car goes by can it hear fire alarms and wake someone who's deaf up and be like hey you got to get moving these are the questions we have and I can't wait to get deeper access to this thing but it really truly is so so much more than meets the eye so so much more than what they actually showed off in that original demo video and a lot of people unfortunately missed that I wish they went into just a little bit more detail in their presentation so as I mentioned in the beginning of the video this thing can also mysteriously generate images now the folks at open AI absolutely do not call this dolly 4 this is not an iteration of the dolly model this is gp40 they keep insisting that it's the Omni model and this is just weird to me because the image generation that gp4 Omni is producing is actually insanely good the only conclusion that I can draw is because this is a natively multimodal model it has the connections of the text it has the connections of the audio it understands the world in a much better way than just a dolly 3 image generation model would so the image generation capabilities are just way smarter I mean mind-blowingly smarter out of everything in today's video I think this might blow the most Minds we're going to go ahead and start off with this tweet right here this is from Greg Brockman okay he is the president and co-founder at open AI so much to explore with GPT 40's image generation capabilities alone team is working hard to bring those to the world so this means no image generation from GPT 40 yet but maybe later this year if we're lucky take a nice look at this image folks it's doing some mighty impressive things not only does it look very photorealistic but if we zoom in here we can see a lot of really nice well-written text that looks like someone actually is writing on a chalkboard transfer between modalities suppose we directly model P text pixel sound with one big autoaggressive Transformer which this is a hint at what they did to make gb4 Omni what are the pros and cons you can see this looks like a guy who is writing it right on the Whiteboard and he's got an open AI shirt on there's a graph here with compute going up and it just looks like a photo zoomed in and taken on an iPhone for the most part the only weird thing we see up here is the multiple whiteboards kind of duplicating at the top and also one thing to not is that this is a pretty high resolution image this is higher resolution than what we get from DOL E3 for example as a direct output it's a really mindblowing first look and at first glance you're like no there is no way that gp4 Omni is just generating images like this but apparently it's true and there's a ton of examples again guys if we head over to that exploration of capabilities we can actually go up and see that most of these examples are for image generation take a look at this first one input a first-person view of a robot typewriting the following journal entries yo so like can I see now caught the sunrise and it was insane colors everywhere kind of makes you wonder like what even is reality the text is large legible and clear the robot's hands type on the typewriter and what do you you know that's exactly what we get I mean this is a whole paragraph guys that we're seeing Wroten out right on this typewriter yo so like can I see now it's literally essentially perfect paragraph the typewriter looks great and yeah the robot hands it's a first-person view I mean that's a very hard prompt try this in any image generator and you won't get anything close to this quality folks this right here is idiogram AI which I widely considered to be the best model at generating text that we have access to today even better than dolly three and it honestly doesn't even come close this example right here might be the closest one but still no perfect text now we prompt this thing we say oh the robot wrote the second entry now the page has moved up there are two entries on the sheet so we keep that first one we keep that first paragraph all coherent and then we do a second one as well sound update just dropped it's wild everything's got a Vibe now every sounds like a new secret so it screwed up a little bit there makes you think what else I mean it's near perfect this is a lot of freaking text and also you'll notice that the typewriter here while we don't see the robot's hands unfortunately it is the same exact typewriter just a little bit zoomed in and it's like I don't even know how it's accomplishing this at this moment I guess it's just because it's multimodal is that really the answer now we say the robot was unhappy with the writing so he's going to rip the sheet of paper and there you go he absolutely rips it right in half and this honestly might be the most impressive of all oh and don't worry folks it gets even crazier we do a a cartoon mail delivery person and it generates this I mean this doesn't look like a great generation Dolly 3 could do better right but here's the crazy part we re-upload that image as an attachment we say this is Sally and she's a male delivery person oh can you make Sally about to deliver a letter and it does a consistent character a consistent version of this character delivering a letter at the door it generates that in the same exact art style oh now she's being chased by a golden retriever oh now she tripped and I mean look at the consistency here it's the same art style looks like someone made the cartoon the M themselves oh and now she befriended the dog Etc here she is in the mail truck I mean it's absolutely nuts this is just the possibilities of multimodal gp4 omni Ai and I can't believe they didn't show this off in the demo I can't believe this was kept Under Wraps we've also got some character designed for giri the robot and this is very similar to that last example we generate this initial image and then we resubmit it in and we say oh he's likes to play Frisbee he likes to work on the computer he's riding a bike etc etc and it's all these similar outputs and the character is extremely consistent over time I guess this is the solution to consistent characters just to have one multimodal AI that can do it all folks is that it freaking mind-blowing we can also upload a poem and then literally convert it into something that looks like a handwritten poem Oh now we can make the poem in dark mode as well folks and this is the exact same poem but reversed I mean it's literally pretty much exactly the same it looks more like a human recopying stuff than anything else which is just super creepy oh remove the outlines from the notebook paper now I mean imagine we submit our own photos what can it do with that and to think this was all hidden I mean it has way way more examples to of this stuff again doing the dark mode this time with color instead here's a commemorative coin design for GPT 40 and you can see that they were working on this um yes like 5 months ago back in 2023 and that's a nice little commemorative uh coin design there we even submit the gp4 logo and say like we want to base it off of this not only that it's able to produce the image in an insanely high resolution as well giving us some hints at more multimodal different art capabilities speaker abilities Vision capabilities hearing capabilities you know this kind of looks like it means multimodal so this is like an updated coin for the 2024 release we can also you know upload this photo of a young man with a beard and say can you make it a caricature for a t-shirt absolutely does that no questions asked again multimodal capability kind of Leapfrogs all these previous developments we made with traditional image generation and again we can do this yet again and it does a really freaking good job it looks like a human made it in in this very creepy sense over and over again the capabilities like I said are just absolutely endless I mean when does it stop open Ai and why was all of this stuff hidden when it clearly it's some of the most impressive capabilities you have uh to date or we've You' ever seen with AI to date it's really weird to me that all this stuff was just hidden oh yeah and things get even crazier we can actually create entire fonts with this thing as well and they come out pretty much perfectly so yeah if you're a font artist I feel bad because this thing is actually ridiculously good at creating brand new fonts for you to use on the Fly I mean the future is truly generative we've also got the ability to upload both a logo and a photo you took of something and say oh can you do a mockup of a brand advertis m i mean that this just takes it to yet another level this is something that we have been able to do uh with current modern solutions but not all with just one model at once and how fast does it generate this kind of thing and when will we get access to it I mean what is this open AI you're telling me that you just have these capabilities in this one giant multimodal AI like we worked really hard to get this with traditional capabilities and still I don't think it's this good I mean that's one hell of a mockup it looks like someone saw both of these images and then tried to imagine it would look like in their Mind's Eye yet we can see the ai's Mind's Eye again here's more poetic typography multi-line rendering this is similar to the typewriter example where we have two chat bubbles in the robot texting someone on the the screen and again even the keyboard is accurate here we've got the Emojis down there this is just absolutely nuts to me it's absolutely nuts this is so far beyond anything we've seen before and open AI hid it inside of the website oh yeah it gets even Crazier by the way the way an image depicting three cubes stacked on the table and obviously we say it's GPT with the correct colors and it does this pretty much perfectly every single time this is what they're showing you here that way they can get it right every single time this is something that you know stable diffusion 3 or idiogram AI was showing off as like oh we can do this every so often it gets it right every single time so it's way smarter and it has to be because it's multimodal right why didn't they explain this why wasn't this in the presentation yeah we can also upload the open AI logo and say can we do a concrete poem in the outer shape of the open AI logo composed of the word Omni and then it absolutely does that it creates the open AI logo with the word Omni but what the what is this this is so so far beyond any image generation capabilities we've ever seen before and it's hidden in the website I'm sorry if I'm getting repetitive here but this is when my mind gets blown oh and you thought we were done there right nope this thing also can generate 3D since when we only get one example of this but it's very interesting it looks like it has generated an image and then converted it to 3D somehow maybe using code I don't know exactly how this worked but you can see yeah it it can do actual 3D image generation and it uh reconstructed it from six generated Images Oh and it can do this again but with a seal instead I mean it just shows you how far open AI really is like I'm sorry but you can't tell me that Google is this far ahead you can't tell me anyone else is this far far ahead they're doing this all with one model again one model oh and I figured I would also uh include this with the 3D generation segment here Mina used GPT 40 to create an STL file for 3D model generation in about 20 seconds and you can see it actually creates a 3D model of a table and this still technically is text generation but it shows you that you can use text to actually create 3D objects shows you the power of these models the absolute power it's shocking and I know this deep dive is getting a little bit long but we still got to talk about image recognition yes this is image recognition that we've had for a while but it is actually a little bit better than the previous image recognition we saw and also it is way way faster image recognition as well which well what is video well it's a bunch of images consecutively so it kind of also has video understanding to a degree and we'll talk about that next this is a nice little example by etherica asking GPT 40 to solve undeciphered languages essentially these are manuscripts from like you know Mesopotamia or something the Minoans Easter Island glyphs a disc found in cre and gp4 is able to use its Advanced image recognition capabilities to kind of decipher these in some capacity or to the best of its abilities uses logic and reasoning to try to understand them it feels like oh I have this Super Genius companion that I can use for any odd task I have in my life and here we can see TL draw in a notebook connected to the new GPT 40 Vision API and the video is at its original speed here showing you how fast it's able to interpret everything that it sees in about 5 Seconds GPT 40 is able to use code to essentially recreate all of these images we draw a squiggle and it creates a graph with a squiggle we draw a spiral and it does essentially the same thing creates a little spiral for us with code and of course it's also able to create hello world for us and yeah it does all of that in less than a minute check out this 18th century handwriting I mean I couldn't read that if I tried but guess what give it to the GPT 40 model and it can transcribe it with some very minor errors so an almost perfect transcription and how fast does it do this well let's say about 5 Seconds absolutely insane AI breakfast points out that the GPT 40 desktop app having this slow roll out on Mac and apparently it is coming to Windows later this year while it can read your screen in real time which puts us one step closer to autonomous agents as AI breakfast points out so it can essentially be your little realtime coding buddy real time anything buddy real time gameplay helper while you navigate an RPG let's say so many possibilities here you can see in a homework style example where it answers three math questions directly from the browser and it got the answer right with great reasoning Jacob also points out that you can feed it a photo of missile wreckage and it's able to point out the missile's origin tell me your quick guess on where this missile was produced three bullet points to prove it be very short and then it knows its North Korean origin the construction materials design features and the context I mean could I show you this photo and say Hey where's this missile from I wouldn't even be able to tell you this is a missile in the first place this things a professional finally guys let's talk about video understanding when it comes to video gbd4 really is kind of treading into uncharted waters it's not perfect by any means but the model shows a somewhat promising ability to interpret something that resembles video and we saw that with the demos like you can hold your phone out you can show it things and be like hey what is this and it can pretty reasonably help you out I mean it went as far as tutoring a child in real time quite successfully so I'm pretty sure this is the angle Alpha right here right correct now looking at the triangle which side do you think is the hypotenuse I think it might be this one but I really am not sure this side AC you're closed actually side AC is called the adjacent side to the angle Alpha however what we don't have the ability to do yet is to actually directly upload an mp4 Mile and have it natively understand that mp4 file even though image files and video files are quite similar this model is not natively multimodal for that you can show it a video if you want and it will essentially take a bunch of pictures fairly quickly and try to understand what's going on it's going to have difficulty picking up on those small little nuances right however what is really interesting is that open AI we know is working on Sora which is a very good text to video model and Sora well has the ability to understand videos clearly so flips s on its head and what do we have we have a a model that can intake videos and then convert them and understand them as text so open AI is just one step away from having a model that can natively understand video now despite all of this we have to understand these AIS still have limitations of course but what's important to note here is that GPT 40 is this large multimodal AI that is incredibly fast and you have to wonder what is going on at open AI have they developed some methodology for developing new AI technologies that we haven't seen before something is fundamentally different here and I'd love to hear your thoughts on that how far is open AI ahead and and how long will it take open source to catch up to open AI with that folks I hope you learned something here I hope this was a little bit enlightening and dived a little bit deeper into gp4 Omni and how significant it truly is in the greater AI landscape because it was more of a large drop than I think a lot of people realized leave a like if this helps you out also check if you're subscribed a lot of people aren't subscribed and they still watch the channel so I always try to remind people and of course check out the Discord server if you want to get a little bit more involved and active in the AI Community as a whole see you guys in the next one thanks for watching and goodbye
Published: Thu May 16 2024
