NVIDIA's AGI "SuperTeam" SHOCKS The ENTIRE Industry | Karpathy Leaves OpenAI, Gemini Infinite Tokens

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
imagine this you wake up one morning and you're the CEO of the third largest company in the world just last week you posted another record quarter which resulted in a one-day gain of the same size of an entire Netflix you have the most cuttingedge chips the best AI researchers and a cash stockpile the size of many countries Jensen the CEO of Nvidia is in this exact position and what he decided to do is go all in on AGI but not just AGI he's building a superstar team to achieve AGI in any modality and any reality and the entire team is built around one man Dr Jim fan he's simply starting a team dedicated to extending his research around the foundation agent everything has been leading to this and that's what we're going to be talking about today we're going to briefly go through all of his incredible contributions that led him to starting a team that is well positioned to achieve AGI the best mins unlimited cash and the largest collection of gpus in the world let me show you why this is going to change everything and Andrew karpathy left open AI Gemini 1.5 has essentially unlimited context size chat with RTX was released by Nvidia and grock stunned everybody with their inference speeds let's take a look at all of it so first Dr Jim fan who if you're not familiar Dr Jim fan is a head researcher at Nvidia a very very important person in the world of AI and the work he's doing on something called Foundation agent is truly mindblowing and now he's going all in on Foundation agent so here he says career update I am co-founding a new research group called gear at Nvidia so he's not leaving Nvidia he's simply starting a new organization within the company with my longtime friend and collaborator Professor yuk zooo gear stands for generalist embodied agent research and this is exactly what he has been promising with the foundation agent and I made an entire video about the foundation agent because it was truly futuristic and something that could change the face of artificial intelligence completely and if you want to check out that video I'll drop a link in the description below we believe in a future where every machine that moves will be autonomous and robots and simulated agents will be as ubiquitous as iPhones we are building the foundation agent a generally capable AI that learns to act skillfully in many worlds virtual and real and for a quick reminder the first time we started hearing Rumblings of the foundation agent was when they built a Minecraft agent that essentially could play Minecraft indefinitely and really reach the end game of Minecraft for the first time and then in my previous video I showed how they were applying that same architecture to operate in the real world fully autonomously I am extremely excited about what Dr jiman is doing with his team now I've made a couple videos about simulation and this is an extension of that because what the foundation agent promises to do is allow AI to operate in any reality and so what does that actually mean first of all any virtual reality so any video game it can operate within no matter what the rules are no matter what the physics definitions are it can operate in any of those types of virtual environments but then it can also operate in any real world environment and it's doing that through the embodiment of a robot and the way that they're training is by simulating a real world environment in a virtual environment so all of a sudden they have all of this amazing synthetic data that they can use to train a robot for the real world here he says 2024 is the year of Robotics and if you remember that was one of my predictions for 2024 so I'm glad somebody who knows a million times more than me about the subject also agrees the year of gaming AI super excited about that when we can interact with a game and all the NPCs in the game are essentially alive and they Chang the game dependent on what I do in that world that to me is a game I want to play and the year of simulation we are setting out on a moon landing Mission and getting there will spinoff mountains and learnings and breakthroughs join us in the journey so if you want to apply to their team you can do so right here so below that he lists a couple highlights on some of the research that they've already been doing so first Eureka gp4 writes reward functions to teach a FiveFinger robot how to do pen spinning tricks better than I can trained with GPU accelerated physics simulation at 1,000 times faster than real time and that's what I've mentioned that is going to be applicable to real world robots so they're training on GPU accelerated Virtual Worlds and then they're going to be applying that to the real world Voyager and this is the Minecraft AI that I had mentioned the first llm powered agent that plays Minecraft proficiently Voyager bootstraps its own capabilities as it explores the open-ended World continuously and I made a video about this as well I'll drop it in the description below and here's mind Dojo open framework that turns Minecraft into an AGI research playground we collected hundreds of thousands of YouTube videos Wiki pages and Reddit posts for training of Minecraft agents and here's the vimma the First multimodal llm with a robot arm attached we introduced multimodal prompting for robot learning now I haven't covered this topic but it looks really cool and to dive a little bit deeper into this it says robotics will be the last moot we conquer in AI what would a robot gpt's API look like introducing V A vimma an llm with a robot arm attached it takes multimodal prompts text images videos or a mixture of them you can say rearrange the table to look like and then you pass in an image or follow the action provide video frames you can even teach it new visual Concept in context this is a wug image and this is a blicket image now put the red wug on the green blicket so these are nonsense words but it kind of shows you what is possible when you're able to provide different modalities of input to a large language model and it is open source which is really nice you can find the GitHub repo right here and they have a paper as well so very very cool here he describes mimic play solving complex robotic tasks by combining unstructured human play data and robotic T operation data now I've made a few videos about t operation and as a reminder that just means that somebody is controlling the robot remotely whether it's literally attached to the robot or doing it remotely through some kind of controller but either way that t operation gives the robot data to train itself on so it takes that data and makes it more generally applicable to other actions that aren't exactly like that and mobile Aloha was a video that I made that kind of covers exactly that topic they were able to train a robot that can move in the real world to do amazing tasks without explicitly defining how to accomplish those tasks using teleoperation to train it and so mimic play is a library out of Nvidia by Dr Jim fan and to extend on mimic play he has mimic gen robotics has a data problem and actually all of AI has a data problem and that's where synthetic data is going to come into play synthetic data seems to be the only way that we're going to reach AGI we multiply highquality human data by by a thousand times using digital twins in simulation and he says right here synthetic data will provide the next trillion tokens to fuel our hungry models now there's actually a Counterpoint to that argument if we think about the human brain the human brain is able to learn incredible things at a fraction of the amount of information needed to train large language models so if we're able to do that why are large language models requiring so much data that's something we're going to have to consider it's possible that the entire Transformers architecture is just not the most optimal architecture to reach AGI now this is the edge of my knowledge so I'm not actually sure I'm just posing the question to you what do you think do you think because our brains are able to learn so much more with such little data that maybe we're not taking the right approach with large language models let me know in the comments and here is prismer an open- Source multimodal llm before gbt 4 was announced it combines many pre-trained visual experts to be super to be super sample and parameter efficient and there's a bunch of projects that he's released from Nvidia and now he's forming an entire team to focus on this so super excited for what gy fan is going to bring to the world really appreciate all the work you're doing next a bit of AI news Andre carpy has actually left open AI after just about a year being there let's see what he says Hi everyone yes I left open AI yesterday first of all nothing happened and it's not a result of any particular event issue or drama but please keep the conspiracy theories coming as they are highly entertaining so he left people sometimes leave jobs that's fine I'm really interested in what he's going to do next Andre karpathy is probably one of the leading AI Minds in the world and not only that he's able to explain complex AI topics really simply I'm very impressed with a lot of his videos that break down very difficult topics into digestible bits for people people like me who aren't experts at this stuff and I really hope he actually goes down the educational route because he is one of the best Educators out there actually being at open aai over the last about year has been really great the team is really strong the people are wonderful and the road map is very exciting and I think we all have a lot to look forward to my immediate plan is to work on my personal projects and see what happens those of you who followed me for a while may have a sense for what that might look like so it's probably going to be education but I don't know him personally so maybe he has been working on projects that he's not talking about so I'm definitely going to keep a close eye on what he does next and speaking of Andre karpathy he actually made a tweet that led me to something that I found to be extremely cool here he tweets about a blog post written by Nicholas Carini in which Nicholas actually created an incredibly sophisticated Benchmark to run against large language models now you're all familiar with my Benchmark which is very unscientific to say the least but Nicholas actually created a GitHub repo with hundreds of tests to run against these models in all different languages and not only are they well documented tests but they actually have a concrete output to check against so nice post but even more than the 100 test specifically the GitHub code looks excellent full featured test evaluation framework easy to extend with further tests and run against many llms now what do you think should I adopt these tests I like running my test and kind of free forming it and it being very unscientific but maybe I should also adopt a more rigid framework for testing these large language models let me know in the comments what you think and if we look at the test results for the 100 current test on seven models GPT 4 is still the best at 50% gp2 3.5 is after that claw 2.1 all the way down to mistal medium mistal small and Gemini Pro last so I'll check these tests out let me know if you want me to do a full video reviewing the test itself but let's go into Nicholas Carini's blog post so he released a new Benchmark and some of the tests that he includes in his Benchmark include convert a python function to an equivalent but faster C function explain the functionality of minified JavaScript identify the encoding format of some data write a parser write some bash on liners and all of his tests are coding related so maybe I'll use this llm Benchmark to test coding specific models and I'm going to get to coding specific model that was just released in a minute so here's some examples we have print hello the most basic we have rewrite some code we have some other tests here's a Rex test draw a triangle so there's a bunch and they're all tested and output right here I'll drop a link to this in the description below all right next let's talk about Gemini I was on vacation last week so I couldn't make a full video dedicated to Gemini but Gemini 1.5 was released and a lot of people are saying great things about it now I just made a video about Gemma and it didn't perform well so I'm super excited to test Gemini and in fact I just got access to Gemini 1.5 and I'm going to be testing it all this weekend let me know of anything you want me to test with it in the comments below and I'll try to include it in the video and I was actually thinking about going Live on YouTube and testing it with you all and again if You' be interested in that let me know in the comments below I'm really excited to test out different content formats and see what you guys are interested in so here's a blog post by Dan shipper I spent a week with Gemini Pro 1.5 it's fantastic and the context window is the main thing with Gemini 1.5 it has a 1 million token context window and not only that I believe I read that they're actually testing a 10 million token context window internally so why is a large context window so awesome well it's awesome because you can throw entire books entire movies and ask questions about those movies or books where whereas previously we'd have to chunk them up and just ask about specific sections but that is only good if it's able to do so accurately and there's this test called needle in a haystack and it is by somebody that I've actually gotten to known recently another fellow YouTuber Greg camit and what it does is essentially provides a full context window worth of information and asks the large language model to find a specific singular piece of information within the huge amount of information it was passed and a lot of previous models weren't able to do this well in fact there was a paper called Lost in the Middle where it showed that a lot of large language models have good recall from the beginning and the end of a prompt but not the middle of it and that's actually really similar to how the human brain works but let's read here Gemini 1.5 Pro read an entire novel and told me in detail about a scene hidden in the middle of it it read a whole code base and suggested a place to insert a new feature with sample code now to me having a massive Contex size is actually most exciting when applied to code bases because if you're able to provide an entire codebase worth of information to a large language model and it's able to help you code it all of a sudden AI coding is going to reach another level it even read through all of my highlights on reading app readwise and selected one for an essay I'm writing and for context GPT 4 turbos context window is 128,000 tokens so much much smaller so I'm super excited to test out Gemini 1.5 I just got access I'm going to be playing around with it all weekend next Google research has released something called screen Ai and this is something very unique it is a screen language model for UI and infographics understanding and this goes towards building a future where llms are the operating system now I've made videos about the rabbit device and this is kind of the same thing that we're talking about being able to read and understand a user inter interface is actually very difficult for a large language model open interpreter is an example where there have been a couple open source projects that I've covered that can control your computer and they've worked sometimes because taking a picture of the screen getting the coordinates and being able to actually control the mouse to click on specific parts of the screen is a really difficult problem for large language models but this paper proposes a way to really optimize that process let's just take a quick look at this abstract screen user interfaces and infographics sharing similar visual language and design principles play important roles in human communication and human machine interaction we introduce screen AI a vision language model that specializes in UI and infographics understanding our model improves on the Pali architecture with the flexible patching strategy of pix to struct and is trained on a unique mixture of data sets at the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements we use these text annotations to describe screens to large language models and automatically generate question answering UI navigation and summarization training data sets at scale so here's an example of the workflow this is the overall architecture you have a screen you take a screenshot you pass it to the Pix to struct it breaks it down and it looks like it's also using the grid format then it creates an embedding and continues to be able to ask questions and actually control the interface if necessary and here's what the actual pipeline looks like so task generation pipeline the screens are first annotated using various models we then use llms to generate screen related tasks at scale optionally we validate the data using another llm or human Raiders and they are using a massive amount of data so 353 million generated screen annotation generated screen question answer 38 million screen navigation 15 million summarization 13 million so using all of this to train the models so this is going to be really great for transitioning the way we interact with computers I am very bullish on the whole mouse and keyboard thing kind of going away eventually now obviously we're still going to need a mouse and keyboard for a lot of things for a while but I'm very excited to just talk to Ai and get the answers I want immediately okay next find a company that has been releasing open source models and specifically fine-tuned for coding task has now released a model that they say is absolutely comparable to GPT 4 so introducing find 70b closing the code quality gap with GPT 4 Turbo while running four times faster now a lot of people don't appreciate how much speed is important the speed at which we get inference especially for tasks which humans don't actually have to read the text is incredibly important and a great example of that is coding and I'm going to talk a little bit more about speed in a bit when we talk about grock so we are excited to announce find 70b our largest and most performance model to date running at up to 80 tokens per second find 70b gives highquality answers for technical topics without making users make a cup of coffee while they wait so fine 70b scores 82.3 on human evalve beating the latest gp4 turbo so here it is here's human eval and here's find compared to gp4 and it is a huge improvement over find 34b and if we look at krux eval it is below gp4 turbo but very very close and it's funny because they say this is reported versus evaluated so a little nod to the fact that maybe it's not actually performing that well and the best part find 70b is available today to try for free without a login and unfortunately it is not open source yet however read right here we love the open source community and will be releasing the weights for the latest fine 34b model in in the coming weeks we intend to release the weights for fine 70b in time as well so I really hope they do because this model seems awesome stable diffusion 3 was also released this week so announcing stable diffusion 3 our most capable textto image model utilizing a diffusion Transformer architecture for greatly improved performance in multi subject prompts image quality and spelling abilities so I'm very excited to test this out I'm hoping it's better than Dolly because right now I'm using Dolly pretty frequently but with the Sora release stability AI probably has their eye on video and when Sora was announced I saw somewhere the founder of stability AI tweeted that they are also working on something text a video and it is going to be open source I tried to dig up that tweet but I couldn't find it I'm sorry if you find it drop it in the comments below so I can share it but for stable diffusion 3 they're opening the wait list and you can sign up right here all right another thing that was release that seems really cool is nvidia's chat with RTX I haven't actually had a chance to play around with it but essentially what it promises is you can chat with docs on your computer and it uses a rag framework and apparently it's very fast I have a PC I could use it I just haven't had a chance to try it out but it looks really cool so here it says chatwi RTX is a demo app that lets you personalize a GPT large language model connected to your own content so whether you're doing docs notes videos or other data and it's using Rag and using RTX acceleration so it should be really really fast all right so what's cool about it is it is completely local and it uses open source models so both of those factors are really cool so I'm thinking about testing it out but there are so many other chatwi your docs projects out there that I don't know if you guys want to see another video about it so I've already talked a little bit about Gemini 1.5 but let me show you a little bit more about it because it is so exciting so this is the blog post I'm not going to read it all but basically right here it says a window up to a million tokens via AI studio and vertex AI in private preview and as I mentioned I got access to that so so I'm thinking about all the ways to use that massive context window but let's go a little bit further so Jeff Dean who is an incredibly prominent mind in the world of AI and the chief scientist at Google Deep Mind and Google research has the needle in the hyack test at of 10 million tokens so as I mentioned they are testing 10 million tokens internally and again congrats to Greg Cameron because he basically came up with this whole concept of needle and a hay stack and a lot of people including Jeff Dean have been talking about it so first let's take a quick glance at needle and hastac test across many different modalities to exercise Gemini 1.4 PR's ability to retrieve information from its long context in these tests green is good and red is not these are almost entirely green 99.7% recall even out of 10 million tokens that is crazy to think about Gemini 1.5 Pro achieves near-perfect needle recall in a million token context window even at 10 million they just had a couple red dots here so almost perfect so here are a couple examples of what you can do with a context window that large so here it says analyzing and understanding complex code bases so they pass in three.js which is a 3D JavaScript library with 100,000 lines of code provides examples documentation Etc passes it all into a single prompt and asks for help with code with this codebase in context the system can help the user understand the code and can make modif ifications to complex demonstrations based on highlevel human specifications so I'm going to play this video from Google because it is so cool it is basically asking Gemini 1.5 to create this 3D world in real time and it can do so because it has the entire context of the codebase documentation and examples so now you could start getting your mind working and think about what incredible contributions AI is going to be making to code bases very soon this is a demo of long context understanding an experimental feature in our newest model Gemini 1.5 Pro we'll walk through some example prompts using the 3js example code which comes out to over 800,000 tokens we extracted the code for all of the 3js examples and put it together into this text file which we brought into Google AI Studio over here we asked the model to find three examples for learning about character animation the model looked across hundreds of examples and picked out these three one about blending skeletal animations one about poses and one about morph taret for facial animations all good choices based on our prompt in this test the model took around 60 seconds to respond to each of these prompts but keep in mind that latency times might be higher or lower as this is an experimental feature we're optimizing next we asked what controls the animations on the littlest Tokyo demo as you can see here the model was able to find that demo and it explain that the animations are embedded within the gltf model next we wanted to see if it could customize this code for us so we asked show me some code to add a slider to control the speed of the animation use that kind of guey the other demos have this is what it looked like before on the original 3js site and here's the modified version it's the same scene but it added this little slider to speed up slow down or even stop the animation on the fly it used this gooey Library the other demos have set a parameter called animation speed and wired it up to the mixer in the scene like all generative models responses aren't always perfect there's actually not anit function in this demo like there is in most of the others however the code it gave us did exactly what we wanted next we tried a multimodal input by giving it a screenshot of one of the demos we didn't tell it anything about this screenshot and just asked where we could find the code for this demo seen over here as you can see the model was able to look through the hundreds of demos and find the one that matched the image next we asked the model to make a change to the scene asking how can I modify the code to make the terrain flatter the model was able to zero in on one particular function called generate height and showed us the exact line to tweak below the code it clearly explained how the change works over here in the updated version you can see that the terrain is indeed flatter just like we asked we tried one more code modification task using this 3D text demo over here we asked I'm looking at the text geometry demo and I want to make a few tweaks how can I change the text to say goldfish and make the mesh materials look really shiny and metallic you can see the model identified the correct demo and showed the precise lines in it that need to be tweaked further down it explained these material properties metalness and roughness and how to change them to get a shiny effect you can see that it definitely pulled off the task and the text looks a lot shinier now these are just a couple examples of what's possible with a context window of up to 1 million multimodal tokens in Gemini 1.5 Pro and the most mind-blowing use case to me is being able to analyze and reason about entire videos not just few seconds of Clips but movies you can fit entire movies into context Windows of this size consider Gemini 1.5 Pro's ability to analyze movies like Buster Keaton silent 45-minute Sherlock Jr movie using one frame per second we turn this into an input context of 684 th000 tokens the model can then answer fairly complex questions about the video content such as tell me some key information from the piece of paper that is removed from the person's pocket and the time code of that moment let's look at that so here's a frame Sherlock Jr is taking a piece of paper out of his pocket then the video is analyzed we can see the piece of paper and here's the model output the piece of paper removed from the person's pocket is a pond ticket from I Goldman and Company Pond broker so it can read anything from any frame through the entire movie really really awesome some stuff and here Matt Schumer says on a benchmark measuring Precision recall from a full day of audio Gemini Pro 1.5 when just seeing the audio file directly far outperforms GPT 4 with a whisper transcription this is mindbending so you can see at the top this is perfect recall against the audio and at the bottom this is gp4 using whisper to translate the audio into text and then asking questions and it fails a lot so I know a lot of people people have been giving Google Gruff because a lot of their models haven't been performing all that well going all the way back to Bard but Gemini 1.5 seems incredible and the fact that they released 1.5 just a few weeks after Gemini Pro was released is very telling as to the progress that they're making internally all right last and the thing I'm most excited to test is grock I got access to not only grock chat which everybody gets access to but I actually got access to their API and where my mind immediately go goes is I want to plug grock into an agent framework because the best use for the high tokens per second is when AI talk to each other because I obviously can't read at 500 tokens per second but two agents working with each other on something can so check this out this is grock chat gro.com if you haven't checked it out please do so right here we have mixl * 7B my favorite open source model so if we enter a prompt let me just show you how fast this is It's kind of absurd and I'm going to do a full video all about this and in fact I'm likely going to be talking to their Hardware Engineers to learn how they were able to achieve this with their Hardware okay write a story about AI gaining Consciousness make it 10,000 words okay so oddly enough it told me that it can't write 10,000 words and it's going to write a shorter story so I don't know why it did that H let's try to make it write 10,000 words because the context window is 32,000 tokens so it should be able to do that but right here we do see the tokens per second is enormous so no make sure to write 10,000 words don't tell me you can't all right 10,000 words story that is definitely not 10,000 words so I'm struggling to make it do 10,000 words but again 500 tokens per second so very very fast I'm definitely going to test it extensively in fact let's try one more thing I'm going to try Lama 2 70b now we only have a 4,000 token limit so let's try to max out that token limit so write a story about AI in 2500 words all right very very nice slower than Mixel but still very very fast and this speed is going to unlock a lot of really cool use cases so that's it for today if you liked this video please consider giving a like And subscribe and I'll see you in the next one
Info
Channel: Matthew Berman
Views: 116,389
Rating: undefined out of 5
Keywords: nvidia, openai, karpathy, gemini, gemini 1.5, phind, groq, stable diffusion
Id: KzxR2Vcr9CM
Channel Id: undefined
Length: 30min 39sec (1839 seconds)
Published: Sun Feb 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.