GPT-4o: What They Didn't Say!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Okay. So OpenAI had this spring update and out of that came the GPT 4o model with the, O, they're standing for Omni. The first and probably one of the biggest things about this is much more of a fully multimodal model than what they previously had. even with the GPT V models or the vision models, they were limited to basically just taking images in. With this new model not only are you able to put text in and get text out, you can put images in, get texts out. You can also even do things like voice in akin to what, Google's Gemini models, can do. And they're also even precursors of being able to do image in image out . And also even things like 3D out, with this. So OpenAI showed a bunch of things that were interesting around productizing this model with the key thing, being that they're making the play to actually make this best model available to people for free. so that's a huge, change, in that now people will be able to experience some of the really high quality models without having to pay the $20 a month without having to, use some kind of API to access this kind of thing. So I think already that alone is going to change the whole sort of startup scene and people who are trying to build startups on this kind of thing when people can now use one of the best models for free in here. You've got to also wonder what's that can actually mean for their business model. Will, people who are paying the $20 a month, keep paying that $20 a month when they realized that perhaps they didn't need as many calls as they thought, and they can actually get away with the free tier in here. another thing that they introduced alongside this model was a whole new user interface, and this is the whole thing around the desktop app and using the desktop app to be able to do things. Now, this is a great way for OpenAI to basically get some training data from people as to what they're actually doing on their desktop, what they want to use a language model for on their desktop. And you've got to think that this is like a precursor to some of the more advanced agents. and taking on some startups, like multi on where it can basically access your web browser and automate things on your web browser in here. It's not too much of a jump to imagine that we go from this to either having a plugin or having a browser in the actual, OpenAI app, which can be fully automated to do various tasks for you in the near future. Now they point out that this new model can be used just like the previous models in ChatGPT Plus for things like analyzing data, creating charts, chatting about photos, uploading files, doing a whole bunch of these things. And then also don't forget that over the past, you know, couple of months they've been adding memory and much more sort of personalization to that chat GPT plus interface there. Another benefit that comes with this model is that now, because it can take audio in and create audio out, you've actually got a much more powerful voice interface. So in the past OpenAI has been using a separate sort of TTS model and ASR models. So things like their whisper model for doing the transcription. And then their TTS model, which they've made available for certain, uses in the API, et cetera. but they certainly haven't made those models, public ly, like they did with whisper, et cetera. This new way of doing it Allows for a huge advance in the whole concepts around PROSODY in TT S. So this is where we can get it to be much more emotional, we can get it to, basically play different emotions. we can get it to be much more dynamic in its range, much more dynamic in the speed that it speaks at, et cetera. And right through to some of the demos where they're showing that it can actually do singing and sometimes can even try and harmonize with itself, as it's doing the singing So that voice is a big part. Now, unfortunately it doesn't look like that has come to the API yet. I'd love to think that sometime in the future that's going to come to the API. but I guess we have to wait and see for this, but that's certainly makes this whole new model much more powerful by giving it this user interface, which is essentially like the movie "Her". And I think you've seen probably lots of people talking about this idea that, this is now replicating that movie of where the lead character is talking to an AI girlfriend all the time. and being able to get lots of information, get advice, get updated in a whole variety of different things. You've got to think that's basically here now it's just going to come down to, how are you going to do the prompting? What are going to be the sources of information and stuff like that. And OpenAI does seem to be pushing still for the MyGPTs. So basically making a custom GPT, on this, which is going to be pretty interesting to see where that goes now. So the model capabilities are definitely outstanding from what, everything that we see in here. We can see, going through these videos of getting it to talk to itself, getting it to basically monitor, get it to basically describe what it's seeing in a video. Now we don't know, for example, things like technical details, like how many times a second is it sampling from that video? in the live demos that they showed on the live stream, it definitely seemed to get certain things wrong where when the person had accidentally had the back camera on. And had a split second of video of the wooden table and then transferred it to their face, the model was still talking about the wooden table. So that kind of thing it makes me think that they're probably only sampling a reasonably low amount of frames per second just like Gemini and other models out there are doing for that kind of thing. The whole ideas of going from text to image in here are pretty amazing. So they talked about that this is not available currently, but this is something that will be available in the near future here. and I think this is looking like it's going to be stronger than the current Dall-e models. and really in some ways, why would you have a Dall-e 4 if you can fold the whole thing in, into one model. And eventually somewhere down the track, you could imagine that this kind of model could do something like Sora as well by combining these things. So they've published a number of evaluations in here. these kind of interesting to look at and see, okay, where does it compare? to the others. One thing that I would say that I find interesting is that they, Constantly compare this to the GPT 4T and that, even though those models are showing to be, way below both the GPT 4T and the new GPT 4o in here. Again, this really is an example of be careful of, model evaluations. You really want to make your own set of evaluations, that you can try things out on and try and keep those private, so that you know, don't make the mistake that I made last year was that, when I was showing off all the sort of evaluations I did pretty soon people were actually training on those and fine tuning on those so that their models would come up and show, to be quite good at those sorts of things. The next thing that I want to talk about is probably for me, perhaps one of the most interesting things and things that people haven't really been talking about at all. And this is the whole language tokenization issue. So they talk about that the model has a new tokenizer. And that tokenizer, is a lot better at multilingual things. so if you think back when I made a video quite a long time back, I think last year, some time about the different tokenizers, and showing that something like GPT 4 with the old tokenizer really couldn't do a lot of the multilingual models because it just needed so many more tokens. and so you can see here that they've basically got this, and showing that for a lot of these languages now, you're getting, the number of tokens needed for the multilingual response out has not only halved, a third or a quarter, or even like a fifth sometimes of what it, has been in the past. And that's really interesting to look at, you know for this going forward. of course that means then that the outputs are actually much faster for multilingual things. But the one thing that I don't see anyone talking about, which I think is the most interesting thing about this is normally if you're going to go for a new tokenizer, you are going to train a new model from scratch. And if this is a model that's being trained from scratch, that means they've decided to either make a whole new GPT 4 model with this new tokenizer. and perhaps we're seeing, the benefits of all the learning that they've had on making other models and making smaller models for testing things out . But the thing I'm not hearing anyone say is that this not only could be, a fully new train from scratch model. This could actually be a very early checkpoint of GPT 5. So one of the things that we're seeing now with a lot of the companies that are training models, Is that we're seeing things like them releasing versions 1.5 of things. So we've got, Gemini pro 1.5, we're now Dennis Hassabis has actually publicly said that they didn't intend to actually release that but they realized it was so good that they decided to release it. And you got to see that, this has also happened with some of the Chinese models. where we've had the Qwen 1.5 models. just yesterday that Yi 1.5 model. and the reason is that as people are training for a much bigger model or. a version of the model that is trained on a lot more tokens. They're starting to realize that, oh, okay, somewhere in the middle of there is a model much better than what we released previously but hopefully is not going to be as good as the 2.0 version. as this is rolled out, we've got a number of, model makers now that have got these 1.5 models, which really are a stepping stone to the 2.0 models. Now in OpenAI's case, you're going to think that this is a stepping stone to the GPT 5 model in here. Now, I'm not saying that for sure this is GPT 5 this is probably, a model that they've used to basically try out some of the ideas that they're planning to do at a much bigger scale for GPT 5. In a GPT 4 or probably actually I would say in a smaller than original GPT 4 size, which is how we basically get the faster responses. We get the costs going down and stuff like that. Basically, anytime you see cost go down, it's that way because of, compute going down with these companies at the moment. people are not going out there trying to make huge profits and then deciding to have their profits. They're looking at how to distill the models and then with that, be able to run a version of those models that can get similar results out for a far less amount of compute spend in there. So for me, this is one of the most interesting parts of this whole release. So looking at it and stepping back and seeing okay, not only have they updated. this tokenizer. And it is very funny that they've updated the tokenizer that just when Llama 3 has adapted the first hundred thousand tokens out of the GPT 4 tokenizer. for that. so while Meta have made their choices to basically be catching up. OpenAI has actually realized that it's probably better that they go to a new tokenizer to be able to deliver a lot of the multilingual stuff. And you could imagine that this is a test for the GPT 5 model, which I think is probably More likely than it actually being the sort of early checkpoint for this, but certainly we're seeing, some really interesting new things coming out with this model here. So later in the week, I'll have a play with this, with code. We'll talk about doing some of the things with code. Hopefully they will also release some of the interesting things around how to access the voice via code and doing some of that kind of stuff. I think that's going to be really interesting to see what third-party people actually do with that rather than having to conform to the OpenAI way of using that. Another quick thing on the multilingual is I'm actually currently in the Bay Area at the moment. And a number of us were testing this model for the multilingual things at dinner tonight, and it's amazing that it is able to basically do live translation from a variety of different languages. that alone, you got to think totally changes the need for a lot of startups that are trying to do these very specific use cases. And this is representative again of OpenAI, kind of steam rolling a lot of startups as they roll out these sort of smaller features inside of their newer models and the bigger direction that they're going in here. Anyway, as I'm recording this, we've got Google IO on Tuesday. we've got new models coming out there. I will be making some more videos about that later in the week, and also make some videos coming back to this and playing with code, With this new GPT 4o model and seeing what you can do with it and what you can actually make out of it and stuff like that. So anyway, as always, if you've got any comments, please put them in the comments below. any questions and stuff like that. Otherwise, I will talk to you in the next video. Bye for now.

Info

Channel: Sam Witteveen

Views: 31,099

Rating: undefined out of 5

Keywords: gpt-4o review, gpt-4o app, gpt-4o (omni), gpt-4o translate, gpt-4o recap, gpt-4o, chat gpt, chatgpt 4.0, chatgpt 5, chatgpt 4 free access, openai, multilingual, text to images

Id: kMdrqHnd66w

Channel Id: undefined

Length: 14min 14sec (854 seconds)

Published: Tue May 14 2024