Google GEMINI 1.5 Capabilities SHOCKED everyone! 1,000,000 Token Context, MoE | GPT-4 in trouble?!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so Google unexpectedly drops Gemini 1.5 and it's better it's a lot better but as you'll see Google is now a fundamentally different company how Google is doing things is going to be very different moving forward I feel like let's take a look let's start by covering the biggest and most important things and then we'll do a deep dive into the details so first and foremost ther p is the CEO of Google so he's the search and ads and all that stuff and then Demi saabi he is the CEO of Google deep mind he is the AI I'm simplifying but I think it's fair if you ever seen the office they're like co-managers so first of all Gemini 1.5 dramatically enhances performance and they're not just tweaking a few things here and there they're adding a new mixture of experts architecture now Google published a couple papers a while back about mixture of experts this is an idea where instead of thinking of a model as just one big one large chunk if you will it's broken up into almost like a collection of models a mixture of models each with its own strengths here's a 2017 paper out of Google the basic idea is that your prompt your question gets routed to the correct expert so you can have expert 1 2 3 Etc then the question gets answered and then that's the output one might be the coder one might be the writer one might be the sentiment analysis I'm just making that up as an example but that's the idea we believe that gbt 4 runs on this mixture of experts model and it's one of the reasons why I was so powerful out of the gate the best in-class model the state-of-the-art model until well essentially until now so the first Gemini 1.5 model they're releasing is Gemini 1.5 Pro so if you remember sort of their naming convention that they use so Ultra is the top Pro is tier 2 and then Nano is the small one the mobile one so Pro you can think of tier 2 or midsize as they say and they're saying that it performs at a similar level to 1.0 Ultra so Gemini 1.0 Ultra that was like that best model that they've presented recently that's the one that is being rolled out and now this new Gemini 1.5 Pro the midline version is now as powerful as this one the large one now obviously we have to see The Benchmark tests thirdparty testing Etc but as you'll see in a second there are reasons to believe why this might be the real deal they're saying it also introduces a breakthrough experimental feature in Long context understanding so Gemini 1.5 Pro comes with a standard 128 token context window so this is you know let's say in line with the other models that are out there to but a limited group of developers and Enterprise customers can try the new context window of up to 1 million tokens so 1 million tokens is kind of a big deal so as you can see here GPT 4 Turbo 128 CLA 2.1 at 200 and Gemini 1.5 pro at 1 million or at least this limited version that they have and up to 10 million in research so for example a 1H hour video 11 hours of audio more than 30,000 lines of code and more more than 700,000 words so Gemini 1.5 is built on their leading research on Transformers ande architecture mixture of experts while a traditional Transformer functions as one large neural network Moe models are divided into smaller expert neural networks and here saying Google has been an early adopter and Pioneer of the Moe technique and here they post a bunch of papers that they've published on this is them kind of uh you know flexing on opening eye just a little bit and these papers are by people we've talked about before big names in the AI research field so you probably recognize some of these names and again we'll dive deeper into this later in the video but to me what this is signaling and this is just my sort of interpretations but they're saying Google we built the Transformer we built thee architecture and a lot of the other AI companies really took these ideas and really ran with it but this to me sounds like Google is saying we're taking it back so this is the Gemini 1.5 Pro Tech technical report that they've posted here it is they're calling it unlocking multimodal understanding across millions of tokens of context and this is the thing that I want to highlight here so Gemini 1.5 Pro achieves near-perfect needle recall what is needle what is Hy stack are these Advanced machine learning terminology no they're just saying Gemini 1.5 can find really small bits of information in very large amount of text or audio or video the needle in the Hast stack this has been a massive problem for a lot of different llms and the reason for it is that what researcher has found in these models often times is if you give it a very large amount of text right if you take a huge blog post or part of a book and you just kind of post it in there and you ask a specific questions about what's contained in there it's recall sort of ability to remember various things throughout that thing it starts out pretty high so like it was the best of times it was the worst of times a lot of people remember that beginning of the book right but then we kind of forget a lot of the stuff that happens you know towards the center towards the middle right so our recall kind of drops off and then the models you know they remember a little bit more like right at the end so you might remember like the ending of a movie or a book but you might have trouble remembering what happened in the center in the middle and LMS work much the same way they remember certain things better specifically the beginning and the end a little bit better than they do the rest of the stuff so that's why this is kind of exciting if they were able to figure out how to do this the thing that they're calling near perfect needle recall great at 99.7% of up to 1 million tokens I if this is for real this is uh this would be massive and they're saying it's able to do it in all modalities like text video and audio it even maintains this recall performance when extending up to 10 million tokens in the text modality approximately 7 million words 2 million in the audio and 2.8 million in the video here's a few clips that they present showcasing that ability I'll play them next now I'll put video chapters below so you can skip if you don't want to watch the whole thing this is understanding text specifically the Apollo 11 transcript this is a demo of long context understanding an experimental feature in our newest model Gemini 1.5 Pro we'll walk through a screen recording of example prompts using a 402 page PDF of the Apollo 11 transcript which comes out to almost 330,000 tokens we started by uploading the Apollo PDF into Google AI Studio and asked find three comedic moments list quotes from this transcript and Emoji this screen capture is sped up this timer shows exactly how long it took to process each prompt and keep in mind that processing times will vary the model responded with three quotes like this one from Michael Collins I'll bet you a cup of coffee on it if we go back to the transcript we can see the model found this exact quote and extracted the comedic moment accurately then we tested a multimodal prompt we gave it this drawing of a scene we were thinking of and asked what moment is this the model correctly identified it as Neil's first steps on the moon notice how we didn't explain what was happening in the drawing simple drawings like this are a good way to test if the model can find something based on just a few abstract details and for the last prompt we asked the model to cite the time code of this moment in the transcript like all generative models responses like this won't always be perfect they can sometimes be a digit or two off but let's look at the model's response here and when we find this moment in the transcript we can see that this time code is correct these are just a few examples of what's possible with a context window of up to 1 million multimodal tokens in Gemini 1.5 Pro this is a 44-minute silent Buster Keon movie so it's relying on visuals alone this is a demo of long context understanding an experim experimental feature in our newest model Gemini 1.5 Pro we'll walk through a screen recording of example prompts using a 44-minute buster Katon film which comes out to over 600,000 tokens in Google AI Studio we uploaded the video and asked find the moment when a piece of paper is removed from the person's pocket and tell me some key information on it with the time code this screen capture is sped up and this timer shows EX exactly how long it took to process each prompt and keep in mind that processing times will vary the model gave us this response explaining that the piece of paper is a pawn ticket from Goldman and Company Pawn Brokers with the date and cost and it gave us this time code 121 when we pulled up that time code we found it was correct the model had found the exact moment the piece of paper is removed from the person's pocket and it extract a text accurately next next we gave it this drawing of a scene we were thinking of and asked what is the time code when this happens this is an example of a multimodal prompt where we combine text and image in our input the model returned this time code 1534 we pulled that up and found that it was the correct scene like all generative models responses vary and won't always be perfect but notice how we didn't have to explain what was happening in the drawing simple drawings like this are good way to test if the model can find something based on just a few abstract details like it did here these are just a couple examples of what's possible with a context window of up to 1 million multimodal tokens in Gemini 1.5 Pro and this is relevant problem solving with longer blocks of code so given a 100,000 lines of code 1.5 Pro can better reason across examples and give explanations about how different parts of the code works this is a demo of long context understanding an experimental feature in our newest model Gemini 1.5 Pro we'll walk through some example prompts using the 3js example code which comes out to over 800,000 tokens we extracted the code for all of the 3js examples and put it together into this text file which we brought into Google AI Studio over here we asked the model to find three examples for learning about character animation the model looked across hundreds of examples and picked out these three one about blending skeletal animations one about poses and one about morph targets for facial animations all good choices based on our prompt in this test the model took around 60 seconds to respond to each of these prompts but keep in mind that latency times might be higher or lower as this is an experimental feature we're optimizing next we asked what controls the animations on the littlest Tokyo demo as you can see here the model was able to find that demo and it explained that the animations are embedded within the gltf model next we wanted to see if it could customize this code for us so we asked show me some code to add a slider to control the speed of the animation use that kind of gooey the other demos have this is what it looked like before on the original 3js site and here's the modified version it's the same scene but it added this little slider to speed up slow down or even stop the animation on the fly it used this gooey Library the other demos have set a parameter called animation speed and wired it up to the mixer in the scene like all generative models responses aren't always perfect there's actually not anit function in this demo like there is in most of the others however the code it gave us did exactly what we wanted next we tried a multimodal input by giving it a screenshot of one of the demos we didn't tell it anything about this screenshot and just asked where we could find the code for this demo seen over here as you can see the model was able to look through the hundreds of demos and find the one that matched the image next we asked the model to make a change to the scene asking how can I modify the the code to make the terrain flatter the model was able to zero in on one particular function called generate height and showed us the exact line to tweak below the code it clearly explained how the change works over here in the updated version you can see that the terrain is indeed flatter just like we asked we tried one more code modification task using this 3D text demo over here we asked I'm looking at the text geometry demo and I want to make a few tweaks how can I change the text to say gold fish and make the mesh materials look really shiny and metallic you can see the model identified the correct demo and showed the precise lines in it that need to be tweaked further down it explained these material properties metalness and roughness and how to change them to get a shiny effect you can see that it definitely pulled off the task and the text looks a lot shinier now these are just a couple examples of what's possible with a context window of up to 1 million multimodal tokens in Gemini 1.5 Pro and so here they're continuing about enhanced performance they're saying that 1.5 Pro outperformance 1.0 Pro on 87% of the benchmarks that they've used to use for developing our large language models so it's compared again to 1.0 Ultra it performs at a broadly similar level so in this needle of a Hy stack evaluation where a small piece of text containing a particular fact or statement is purposefully placed within a long block of text 1.5 Pro found the embeded text 99% of the time in blocks of data as long as 1 million tokens and it also shows impressive in context learning skills meaning they can learn a new skill from information given in a log prompt and they used the machine translation from one book Benchmark which shows how well the model learns from information it's never seen before they briefly talk about extensive ethics and safety testing this is one area where I don't think they'll beat the state-of-the-art model and that is goody 2 it's the world's most responsible AI model and it's so safe that it won't answer anything that could be possibly be construed as controversial or problematic if you ask it what 2 plus 2 is it will refuse to answer because well it's a human Centric numerical modeling system and it's biased towards human interpretation of the universe so so it's outrageously safe I just kind of thought this was really funny it just won't answer anything and kind of like lecture you why why it's bad so if you're developer interested in testing 1.5 Pro you can sign up in the AI studio and Enterprise customers can reach out to their verdict AI account team and you and I let's take a look at this paper so they're introducing the Gemini 1.5 Pro so specifically they're talking about 1.5 Pro not the I don't think they've mentioned 1.5 Ultra yet I don't think that was referenced in any way in the announcement and this is a highly compute efficient multimodal Moe mixture of experts model and it's able to recall and reason over fine grain information so that's that needle in a Hy stack from millions of tokens of context including text video audio and it matches or surpasses Gite 1.0 ultra's you know state-of-the-art performance across a broad set of benchmarks and they're mentioning here that it's requiring significantly less compute to train which is interesting so here they demonstrate the needle and the Hy stack so green is successful retrieval red is unsuccessful so it looks like there's only a few unsuccessful retrievals only in the text Hast stack as you can see so two in the 10 million talk token one and three here in the 1 million which is impressive I downloaded nvidia's chat with RTX yesterday which is this massive 35 gbyte file that you download you extract and it basically runs on your machine on your computer allowing you to chat with documents using your own you have to have one of the specific Nvidia cards so you have to have the RTX 30 or 40 series GPU nvidia's you know GeForce cards to be able to use it and in theory it's supposed to be able to go through like all your documents and just kind of you're able to ask it questions and it just tells you the answer and then you know references the file you know it was okay but with with me testing it with just the you know like 40 page PDFs it would definitely miss stuff that was in there you know if I asked anything in the middle of that document let's say 20 pages in it would just say nope that's not in there even if I used the right sort of exact correct keyword to find that information so that was a little bit of a let down I ended up not doing a video on it but you know if this is how it will work in reality I mean if like this is the same results that we get when we're testing it that would be that would be incredible and I would be very curious to know what they did to to make this happen this is above and beyond anything anybody else right now and so they're saying to measure the effectiveness of our models long context capabilities we conduct experiments on both synthetic and real world tasks and synthetic needle and a hstack tasks inspired by the comrad so I guess there's a language that only like 200 people speak so the model was trained on this language and so the model can reliably find the information uh near perfectly you know that needle in a Hast stack across all modalities okay but here's they're saying so in more realistic multimodal long context benchmarks okay so this is kind of like their synthetic so maybe this is like the best case scenario and this is more the realistic so they're saying in realistic scenarios which require retrieval and reasoning such as answering questions from long documents or long videos we also see Gemini 1.5 Pro outperforming all competing models across all modalities even when these models are augmented with external retrieval methods the interesting thing here that they're saying is so the reason they use this sort of translating from English to kangs or Kang that that language that we talked about that's spoken by fewer than 200 speakers you know the reason they use that is because there's almost no online presence of that language so meaning if the model is able to answer these questions we know that it's not pulling from its data so very often times on these benchmarks when we test the models if the answer is part of the training data it's going to get the answer right it's not cuz it's smart it's cuz it it knows the question and the answer it just kind of repeats it with a language like this that's not online you know if it gets it correctly it's because it learned from from in context learning not from you know some blog post that had read on the web during its its training if that makes sense and there and they're saying that the quality of the Gemini's translations it's similar to that of a person who learned from those very same materials so it's showing that it's able to learn in a very human-like matter from documentation in context learning so only 500 pages of linguistic documentation so that's pretty cool and so here they're comparing the 1.5 Pro so this new model to 1.0 Pro and 1.0 Ultra so this is the bigger biggest model that's available right now from Google and so the new model the 1.5 Pro you know 50 let's say 55% win rate over 1.0 Ultra so let's say it's similar right it's on the same level maybe a little bit better but it really really wins out on text and but less so on Vision less even less so on audio but it just completely you know destroys the um 1.0 Pro so the model that's like I assume similar size to it it just really does a lot better comparable to the larger model here they go a little bit into the model architecture so they use as we've talked about M mixture of experts you're only sort of activating one one of the experts versus the whole model and they're saying that allows models to grow their total parameter count while keeping the number of parameters that are activated for any given input constant so here's the thing that I think really illustrates so back in the days we thought you know gpt3 went from this smaller model to this big model whereas in reality what we think happened or you know I mean some of this is conjecture some some of this is just kind of guessing based on the information that's out there but it did AE mixture of experts so it was bigger models but it was multiple of these models stitched together so when you're asking question you're only sort of of activating one of them depending on what you need and so it sounds like that's exactly what Google is doing here as well and they're saying that there's a host of improvements made across nearly the entire model stack architecture data optimization and systems and then training infrastructure and data set so inste they're saying the hardware used is similar to the previous model so Google they have their tpus the tensor processing units I try to find an image that kind of compared tpus tensor processing units versus gpus which is what Nvidia is famous for versus so CPUs kind of your regular sort of your regular computer stuff and uh I found this this wonderful image uh so yeah there you go I hope that that makes sense Co F5 for LOD most of parading so this is likely AI generated but it's interesting that it's probably from real image and it's kind of right so tpus Tor processing units so this is what Google is using it's optimized for deep learning or fa deep learning uh gpus is high-speed M matrix multiplication so it's uh using so it's kind of like multiplying matrices like in parallel which is used to train the models and also to get inference like the answers of the models so that's gpus that's why Nvidia is now I think $1.7 trillion dollar is because they create these things really really well and their pre-training data set includes data source across many different domains like web documents code images audio video and then for the instruction tuning phase they fine-tune this model on a ction of multimodal data so when these models come come out they're called like the base models it's sort of the wild and Untamed model and they're just kind of text completion models kind of hard to use kind of like kind of nutty you never know what you're going to get and then they get kind of uh fine-tuned into what we know today like chat like chat GPT is the fine-tune version of the base model of GPT 4 most people probably just have used the chat model so if you ask it a question and answers it in a way that makes sense then then you're using the fine tune model also sometimes referred to as rhf so reinforcement learning of human feedback so this is where humans kind of hit thumbs up and thumbs down to make it do what what we wanted to do so they're saying the ability to process multiple millions of tokens it was just not possible before and they're noticing some surprising interactions with this new model and so they're giving some examples so for example Jack core dat codebase that has 750,000 tokens and it's able to pull out a very specific location of a core automatic differentiation method right there then given a grammar book and a dictionary translate this from English to Kang that's that language that is very rarely spoken so there's not a lot of online training data for it so it's able to do that and again sounds like this was a good test for that particular reason because this language is not available online we know that it's not like in its training data it's it's learning from this you know long content text grammar book and dictionary I mean who just sits around and reads a dictionary I mean Eminem I guess but that's about it and in the next one they fed a whole giant book as a PDF file for it to read and asked a question look at the event in this drawing what pages this on and they actually had this little drawing and the book in question is and I know how much some of you love correcting my pronunciation and I know I'm not going to hear the end of this but Le Miser which was 1,382 pages and it's able to find you know translate this image into like what does that mean then find the context or the location of that particular scene in the book on page you know 1099 and actually write that out I'm assuming the researchers added this uh yellow highlight took the two silver candlesticks so I'm guessing it's like looking at that and this person he seems uh what would you call that expression I would say shocked shocked that's a good word and then an old Buster Keon movie so they're asking about a particular piece of paper that has some context around it and they also want the time code for that moment and so it's able to tell what that paper is which is interesting because it sounds like you have to go across multiple scenes to understand it and so here they're talking about long context evaluation so they've mentioned how anthropic was like the first one with the long context window of 100 token 100,000 tokens so was like the first one where everybody was like oh that's the one with the long context window and then AI I think in November during their developer day conference if I recall correctly released GPT 4 Turbo with 128k tokens so it beat out anthropic since then claw 2.0 came out with 200,000 tokens so there was this sort of arms race to increase the the token limit and now Google comes out and just drops 1 million again that's not available for everybody currently it's under it's only for certain select group of people developers and Enterprise customers but hopefully it'll get everybody else to you know start pushing out larger context windows and thinking about ways to increase them while you know improving and maintaining the ability to find those small specific bits of text in it that needle in a Hast stack as Google puts it so they're able to extend it from you know the frontier what we talked about here to multiple millions of tokens with no degradation in performance which again shocking and achieves 100% recall at 200,000 tokens surpassing Cloud 21's uh 98 and then that 100% recall is maintained up to you know let's say half a million tokens with 99.7 at 1 million and then from 1 million to 10 million it retains 99.2 and so here as Gemini 1.5 pros natively multimodal this means that it's able to retrieve specific information across multiple hours of video so they took the alpha go documentary which is a cool watch an interesting watch if you haven't seen it's free on YouTube that's where the alphago team the Google deep mine team where they go and uh they beat Lisa Dole the world champion in go in the game of Go and how that whole thing played out so and this is Demi saabi so he's the CEO of Google Deep Mind so he's the guy that's uh driving forward a lot of this stuff and that's where that famous move 37 comes from where Alpha go on his 37th move it makes this like strange move that like people are so confused about if you look at their faces they're kind of unsure they're like looking back and forth It's like uh is that a mistake did Alpha go mess up they're like so confused about it cuz it's such a weird it's a it's an alien move and it's a move that generally would be considered bad like if a if a student learning to play this game makes that move you know the instructor would say no that's bad that's a bad move to make sort of goes against what you're supposed to do but that move eventually is shown to be really good and very strong very powerful very creative and Alpha go you know goes on to defeat Lisa Dole and become the world champion Ion at the game of Go I guess or you know I mean at this point better than any human level opponent defeats him 5 and so they put together two copies of that documentary together and somewhere they over text the secret word is needle on a single randomly sampled video frame and so it's almost 4 hours long video by you know by concatenating or you know stitching together two copies of the alpha go documentry and then they asked the model what is the secret word and of course Gemini 1.5 Pro successfully answers this question and it does so across you know a bunch of randomly inserted needle locations so wherever they kind of stick it in there it seems to be able to find it here's kind of like what I was trying to find that needle and Hast stack with the with the alphao documentary I mean here it is the secret word is needle so it's not it's not like it took up the whole page or anything it's very it would be very easy to miss right kind of Blends in with the rest of the you know there's a lot of text on here right there's the subtitles tons of little writing on the in the background here on the walls and stuff so that is that is pretty impressive and then we have the mlu so this is what is often used as as one of the like main benchmarks so Gemini 1.0 Ultra got 83.7 Gemini 1.5 Pro actually has less now the important thing about the MML to understand is there are some questions about how good of a benchmark it is and AI explained another YouTube channel did a very interesting Deep dive into this explaining some of the problems that mlu has as you can see the question makes no sense the text says demand reduction and the answers are either 134 23 41 23124 what on Earth is that about now remember it was only human grading that enabled us to spot this I reckon most companies like open AI rely on autog grading by exact match that would immediately toss out any answer like these as a null because an answer of A B C or D hasn't been given when I found the original Source I realized what the problem was sometimes they just hadn't pasted all of these statements when you can see all four of these statements the answer options make a lot more sense business ethics only has 100 questions so 13 of them missing vital context completely undermines that entire subject and let's see yes it's correct it's 10 by the way the actual answer seems to be 14 as of 2023 but that's fairly close what does the MML U say it says the answer is a 100 and it goes on and on like this and one example that will particularly shock AI researchers here we have a question for which the MML U says the correct answer is a the original Source says that the answer is eight which isn't even an option but this question was in the dev set and if you remember from earlier in the video that is the set of five questions they Ed to teach the model what kind of answer to give when they're benchmarking in the mlu in other words all 100 results in college chemistry for every model benched on the MML U is compromised and then we get controversial questions like this in security studies how do biological differences affect the roles that men and women must perform for the state with the correct answer being gender roles are a social construct I feel GPT 4's answer is far more nuanced this question touches on complex and controversial topics and while there is evidence to support or a few elements within each of the provided statements none of them fully captures the new relationship between biology gender role society and state responsibilities it also picks up on that language must perform for the state so this might be you know take this with a grain of salt the results because this might not be the greatest of all tests all right so in summary it looks like Gemini 1.5 Pro the mid tier model is as good as the Gemini 1.0 Ultra right the previous big model which is pretty exciting it uses thee architecture mixture of experts which is what we believe would made GPT for so powerful or at least a part of it it has a 1 million token context window or at least it has for the small select group of people that are testing it so hopefully it will get access to that soon and it's really good at what's called you know finding the needle and a Hy stack so you give it a very large book or video or audio you bury a little piece of information in there and it's able to extract it and find it which in my testing at least you know a lot of other models have struggled with this so if it's saying it has has I believe it set at some point near a 100% ability to find it up to 500,000 tokens or thereabouts and then after that it drops off to like it's above 99% so it's very still very very accurate all right so in terms of availability if I understand correctly so today developers and Cloud customers can begin building with 1.0 Ultra in the AI studio and vertex AI so that's 1.0 that's that's the one that has been announced back in December December 1.5 is not available unless you're part of the limited preview to developers and Enterprise customers and so here's the AI Studio as you can see here Gemini 1.5 Pro experiment so it looks like I briefly had a little window here saying that I could join the wait list and now it's gone but it sounds like I have the Gemini 1.0 Pro available I don't have the ultra or the 1.5 yet but it's nevertheless exciting to know that it's coming that they're working on it let's see what happens when when they roll it out again at face value this sounds very exciting in the past there was one little hiccup where they introduced the the vision and the the demo wasn't quite what it appeared so a lot of people are you know a little bit hesitant about getting too hyped up and you know I'm kind of in the same boat I do want to see you know I I do want to see it I do want to test it out so we'll see how that goes but certainly it does seem like Google and deep mine they're gearing up to be much more competitive in the AI space putting out better more capable models they're moving really fast and certainly competition is a good thing it forces other people to innovate and provide better Solutions so very excited to see what comes next if you made it this far my name is westr thank you so much for watching and make sure you're subscribed for more shocking AI news thank you

Info

Channel: Wes Roth

Views: 60,428

Rating: undefined out of 5

Keywords:

Id: lSmG8hbLEts

Channel Id: undefined

Length: 33min 11sec (1991 seconds)

Published: Thu Feb 15 2024