Transforming AI | NVIDIA GTC 2024 Panel Hosted by Jensen Huang

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Please welcome to the stage NVIDIA  founder and CEO Jensen Huang. Apparently, there's a huge line out there and  they're orderly in an orderly way coming into   this room, but once they come into this room, it  becomes total mayhem. What is it about? Come on,   you guys, hurry up, hurry up! I'm going to start!  So, so this room, this room is going to be packed,   and there are two other breakout rooms,  they're going to be packed. And it's just   that they're taking too long getting in here  because they're just timid. And the part that   I just don't understand is once they get in  here, once they get within about 10 feet, all   of a sudden chaos ensues. It's very, very nice to  see all of you. The computer, the computer hasn't   largely remained unchanged for 60 years. The year  after my birth, had nothing to do with my birth,   but the year after my birth, the modern computer  was described by the IBM System 360. The central   processing unit, I/O subsystem, multitasking,  the separation of hardware from software,   software compatibility across a family, backwards  compatibility to protect the investment of   software engineers. I describe modern computing.  Hasn't changed since 1964. In the late '80s and   the early '90s, the PC Revolution kicked it into  turbocharged democratized computing as we know it.   Driving the marginal cost of performance  down every single year. Every 10 years,   we reduce the cost of computing by about 10 times,  every five years, every 10 years by 100 times,   15 years a thousand times, 20 years 10,000 times.  In literally the 20-year era of the PC Revolution,   computing costs reduced by 10,000 times, more  so than any other valuable commodity in history. Could you imagine if everything in life reduced  - everything that matters, everything that's   valuable to you - reduced in cost by 10,000 times,  10,000 times over basically half a lifetime? 20   years by the time you became an adult. Something  you used to use, like a car, used to be $20,000,   now it costs a dollar. Well, maybe that's  Uber. But computing costs dropped tremendously,   and then one day it stopped. Not precisely one  day, but it stopped. The rate of change stopped.   It still continuously improved a little bit every  single year, but the rate of change stopped. Well,   we worked on another form of computing called  accelerated computing. It is not as easy to   use - there's nothing easy about using  accelerated computing - because you have   to formulate the problem from what is originally  kind of like a recipe. Step by step by step that   you do faster and faster every year. But you have  to reformulate a recipe into parallel algorithms.   Parallel algorithms as a whole field of science.  Insanely hard to do. Well, we believed it anyways,   and we believed that if we could accelerate  the 1% of code that represents 99% of runtime,   there are some applications that we can make a  tremendous benefit - a tremendous difference.   We can make something impossible possible.  Or something that cost a lot of money to do   cost-effective. Or something that cost  a lot of energy to do energy-efficient.   And so we called it accelerated computing.  We worked on it for the entire duration of   our company's history. And one application  domain after another application domain,   we were able to accelerate. The first one  was computer graphics and video games. We   accelerated computer graphics and video games  so well that people thought we were a computer   games company. But nonetheless, we kept pursuing  it. We realized the value of computer graphics   and games because it was simultaneously a  large market and drove technology innovation. That rarely happens. The confluence of large  markets that never is never good enough has   the ability to drive incredible technology  revolutions. We found it initially with computer   graphics and games. Well, to make a long story  short, in 2012, we had a first sighting - and   that first sighting was Alexnet, the first contact  of artificial intelligence with NVIDIA GPUs. And   it started our attention - our attention to  this field. Several years later, something   amazing happened, and it led to today. Today, I'll  tell you about that something in a second. That   something led to generative AI. Now, you've heard  me say generative AI is, of course, incredible.   The fact that software can not only recognize  an image of a cat and say "cat," it can take the   word "cat" and generate an image of a cat. It can  take the word "cat" with a few more conditional   prompts, like on a surfboard on a sunny day off  the coast of Maui, drinking a mai tai, wearing   a ski hat, you know, whatever. You just keep on  adding your prompts, and somehow the generative   AI is able to generate that. We have now taught a  software program how to understand the meanings of   these pixels, to recognize the pixel, understand  the meaning of the pixel, and, in fact, able to   generate from the meaning of something pixels.  And so, this ability for us to learn data from   just about any data is incredibly transformative.  And it's led to today. You've heard me say that   this is the beginning of a new revolution, a new  Industrial Revolution. And there's a reason for   that. In this new Industrial Revolution, we're  producing something that never existed before. As   in previous industrial revolutions, the last one,  water comes into a facility, energy is applied to   it, this thing called a Dynamo goes off to work,  and it creates this invisible thing of incredible   value that we depend on today. Water comes into  a building, you basically light it on fire,   you boil it, and what comes out the other  side of it is electricity. Water in,   electricity out. Magic electricity used everywhere  and created the Industrial Revolution as we know   it. A new facility creating a new product of great  value. While generative AI is a brand new type   of software, and the software is produced, it,  of course, has to be created. Amazing computer   scientists have to go create it. But then, after  that, it's produced. It's produced in volume, a   building with machinery we call GPUs, essentially  a Dynamo, a large building with machinery inside. You give it raw material, this raw material  is data, numbers. You give it energy, and this   amazing thing comes out. Numbers go in, numbers  come out, and the numbers that come out do amazing   things, unbelievable things. And they could be  used, of course, in all of the applications that   you know, but it could be used in healthcare  and drug design, and it could be used in   transportation and cause cars to drive, and  manufacturing and industrials, and every single   industry that we know will benefit from this new  product, this new thing that is being produced.   So, a brand new thing that the industries  have never seen is going to get produced,   and it's going to be produced in facilities and  factories the world has never seen before. AI   factories using and producing AI, and the AI being  used by every industry. So, what do you call that?   A new Industrial Revolution. None of this existed  before, and now we're seeing it play out right in   front of us. This next 10 years, you don't want to  miss. You don't want to miss this next 10 years.   Unbelievable new capabilities will be invented,  and it started, it started at a point in time by   some researchers. And so today, I thought we would  invite the inventors, the creators of a machine   language model called Transformer. And the way we  thought we set it up is kind of like our living   room, and there will be very little moderation.  And, you know, I was in, we were in the back, and   I wish you were there. There were a lot of deep  learning jokes, a lot of deep learning jokes. And,   you know, we're going to see if any of them land,  but a lot of good deep learning jokes and a lot   of arguments. And so, what I thought I would do is  we would just tee up the joking and the arguments,   and then we can see where it takes us. And so,  what I would do, let me now welcome the inventors   of the Transformer, and they were the authors of  the paper that says "Attention is All You Need." Okay, so let's have Ashish Vaswani. Ashish, Ashish is now the CEO of a brand new   startup company called Essential AI. Noam Shazeer,  welcome Noam. He's also a CEO of a new startup   called Character.AI. Somehow all of their startups  have a name called AI. NVIDIA also has AI,   just not in the right order. I knew it all along,  I knew it all along that I needed the letter "a"   and the letter "i" in there. I just didn't know  what order it had to be. Jakob Uszkoreit, Jakob,   nice to see you. See you, Jakob is also  CEO of a startup. CEO, this is really,   really interesting. Inceptive. Okay, Llion Jones,  founder and CEO of Sakana AI. Welcome Aidan Gomez,   founder and CEO of Cohere. My goodness,  you know Lukasz, come on, Lukasz Kaiser,   ladies and gentlemen, the only person who is  still an engineer. Lukasz, yeah, you're my   hero. Illia Polosukhin, come on the stage. NEAR  Protocol, he's a co-founder of NEAR Protocol.   Okay, and we have a colleague and friend who  couldn't make it because of a family emergency,   and so our hearts and thoughts are with Nikki.  So, let's, um, here, let's, let's, so first,   first of all, they've actually never been in  the same room at the same time. That's true. I got this work-from-home thing has gotten out  of control, but apparently, it doesn't stop   innovation and invention. So, it's great to have  you guys here. You know, the Transformer was, um,   and we're going to talk about the importance  and the meaning of the work, the incredible   transformative capability of the Transformer, and  what it has done to industries. And obviously,   as I was saying earlier, everything that we're  enjoying today can be traced back to that moment,   the fact that we can learn from data of gigantic  amounts in a sequenced way, sequential data,   as well as spatial data, but learn from just a  tremendous amount of data to find relationships   and patterns and to create these gigantic  models was really quite transformative. The,   my first question, and you guys all dive into it,  and we agreed that it is not impolite to cut each   other off, talk over each other, disagree with  each other, even get out of your chair. We need   a whiteboard then. For that, we need a whiteboard,  that's right. Nothing is off-limits today. But go   back to the beginning. What were the problems, you  know, engineers, we need problems to inspire us.   What were the problems that you were struggling  with or challenged with that led to this? I think   everybody had a different problem, probably, but  for me and the team, we were working on question   answering. So, very simple, you go to Google, you  ask a question, it should give you an answer. And   Google has very low latency requirements. And  so, if you want to ship actual models that read,   you know, search results like tons of  documents, you need something that can process   that really quickly. And models at the time,  recurrent networks just cannot do that. Yeah,   because there were recurrent RNs, and RNs had some  attention, but you know, the difference, they need   to read one word at a time. That's right. We were  generating training data much faster than we could   actually train the most advanced architectures on.  And so, you actually had simpler architectures,   just feed-forward nets with n-grams or so as input  features that because they trained so much faster   on, you know, at least in some problems that  Google scaled massive amounts of training data,   they basically always outran the much more  advanced, much more powerful RNNS at the time.   And so, it seems like a valuable thing to fix. Yeah, we were already seeing these scaling laws   back in like 2015, and you could see that if you  make the model bigger, it just gets smarter. And   here is the best problem in the history of the  world. It's so simple, you're just predicting   the next token, and it's going to get so smart and  be able to do a million different things. And you   just want to scale that up and make it better. And  one big frustration is that RNNs were just a pain   in the, you know, to deal with, right? And then,  so yeah, so I overheard these guys talking about,   "Hey, let's replace it with convolution or with  attention." I was like, "Heck yeah, let's do   this." And I think it was, I like to compare it  to, you know, RNNs were like the steam engine,   and the Transformer is like internal combustion.  Like, we could have done the Industrial Revolution   on the steam engine, but it would have just been a  pain in the butt, and things went way, way better   with internal combustion. So now we have electric  vehicles, and now we're just waiting for fusion,   right? That's the next wave. Our, I mean, two  lessons that I was reminded of constantly,   especially during the time of the Transformer,  was I started tasting some of the bitter lesson   when I was in grad school, when I was working  on machine translation, where I felt like, "Hey,   I'm not going to, I think that gradient descent,  the way we train these models, is a much better   teacher than me. So I'm not going to learn these  linguistic rules. I'm just going to ask gradient   descent to do everything for me." And the second  piece was that, you know, I mean, just quoting   from the bitter lesson, that general architectures  that can be scaled are ultimately going to win in   the long run. Today it's tokens, tomorrow it's  actually the actions we take in a computer,   and they'll sort of start mimicking our activities  and be able to automate a lot of the work that we   do.So, the Transformer was self-attention,  in particular, as we were discussing,   it had this quality that was extremely general,  and it also made gradient descent happy. And the   second thing it made happy was physics because,  I guess, something I did learn from G over time   was that matrix multiplications are a good idea.  So let's try to make accelerators happy. And   both those things together, this motif has been  repeating. So every single time we add a bunch   of rules, gradient descent will one day learn  those rules better than you. And this was it,   like, we, like, all of deep learning has been,  we're building an AI model that's the shape of   a GPU, and now we're building an AI model  that's the shape of a supercomputer. Yeah,   the supercomputer is the model now. Yeah, that's  true. Yeah, supercomputers, they're there. Just so   you guys know, we're building the supercomputer  to be the shape of the models. Yeah, yeah.  Now, what were the problems you guys were  solving? Oh, machine translation, yeah,   definitely. And it seemed so hard five years  ago. Like, you had to gather data, maybe it   would translate, maybe it would be slightly wrong.  It was at the very basic level. Now, these models,   you don't get any data at all. They just learn to  translate. You have this language, that language,   it just emerges that the model can translate,  and it's very... What was the intuition that   led to "attention is all you need"? So, I came up  with the title, and basically what happened was,   at the point where we were looking for a title,  we were just doing ablations, and we had very   recently started throwing bits of the model away  just to see how much worse it would get. And to   our surprise, it started getting better, including  throwing all the convolutional solutions away. I'm   like, "This is working much better." And that's  what was in my mind at the time, and so that's   where the title comes from. Basically, what's  intriguing about that is we actually kind of   started with that barebones thing, right? And then  we added stuff. We added convolutions, and I guess   later on, we kind of knocked them out. And a lot  of the other things, like multi-head attention,   were also a really super important piece.  But I was watching this movie, I don't know   if you watched it, "Leon," where this guy is in a  parallel universe where they don't have beetles,   they didn't exist anymore. I was wondering  what would happen, what would be the title   in that universe. I don't know if you watched  it, I forget what it's called, "Yesterday" or   something like that. Yeah, no idea, sorry. He's  got no time, he's trying to build a company.  So, you guys, well, hey, I think this is  important. How did Transformer come up? What   were some of the other choices? Who came up with  Transformer? Why is it called Transformer? It's   an excellent name, by the way. Um, you, I think  Transformer, right? No, no, I liked the name that   Yakob had for it. I was like, "That's a name,  let's use it." Yeah, yeah, I mean, it fits what   the model does, right? And every step actually  transforms the entire signal it operates on,   as opposed to having to iterate over us. That  logic, almost all machine learning models are   transformed. Yeah, oh, look at that. That's  what all machine learning models are becoming,   transformed. Before, nobody thought to use the  name. And so, I guess we, yeah, I thought it was   too simple. I was like, "Oh, every..." That's  exactly what I thought about it. But then,   you know, like, I was overruled. Everybody thought  it was a great name, and they were right. And   what was the name you came up with? I had a lot.  There were a lot of names. I mean, like, there was   something called Cargonet. I had, like, there...  I wrote something. One layer was like convolution.   One was attention. One I called recognition  or something. That was like the feed-forward   net. And so, convolution, attention, recognition,  Google Cargonet. But I'm happy that now... Carg...   That's... That's horrible. I'm glad you were...  You were outvoted. Yes, by wise people. I think   the reason it became such a general name is that,  you know, in the paper, we were concentrating   on translation. But we were definitely aware of  the fact that we were actually trying to create   something that was very, very general, that really  could transform anything into anything else. And I don't think we predicted how well  that would actually work. Yeah, you know,   when Transformers were being used for images,  that was kind of surprising. I mean, it's probably   logical to you guys, but the fact that you could  chunk up the image and tokenize each little part,   I think that was architecturally there very early  on. And so when we were building the tensor to   tensor library, we were really focused on scaling  up auto-regressive training generally. It wasn't   just for language. There were components in there  for images, audio, text, and both on the input and   output side. Lukash said what he was working on  was translation, but I think he's underselling   himself. All of these ideas that we're starting  to see now, of these modalities coming together   and being a joint model, it was there day zero,  day ten in the Transformer repository. Because   that's what Luk was going after. It didn't work.  We're like five years ahead now. Now it works.   But I mean, there was this other paper, "One Model  to Rule Them All," but it did use self-attention.   Yeah, eventually, it started working, but it was  really all there very early on, and those ideas   were percolating. And it took some time. Lukas's  goal was, we have all of these academic datasets,   and they go from image to text, text to image,  audio to text, text to text. We should train   on everything. And that idea is really what  drove this scaling effort to model the web,   which Writer Team has, you know, succeeded, and  now many of us are doing similar things. And so,   I think that North Star, it was there on  day zero, and it's been really exciting and   gratifying to watch that come to fruition. We're  actually seeing it happen now. Yeah, and it's so   interesting that in so much of knowledge, it is  about translation. Image to text, text to image,   text to text. You know, tensor to tensor,  tensor to tensor. Yeah, this Transformer idea,   this translation idea, is quite universal. And in  fact, you're using it for biology, that's right.   Or maybe something that we like to call biological  software, which is maybe an analogy to computer   software that starts its life as a program that  you then compile into something that could run on   a GPU. In our case, read. Basically, the life  of a piece of biological software starts as a   specification of some behaviors you want, say,  print a protein this much, that specific protein   in a cell. And then you learn how to translate  that using deep learning into RNA molecules.  That's right, this idea really goes all the  way from not only translating, say, English   into computer code, but also specifications of  medicines, hopefully transformational medicines.   Any day now, into the actual molecules that we  then... And do you guys create a big wet lab that   produces all this? You have to run experimentation  against nature, right? You really have to verify   this. The data does not yet exist. There are tons  of extremely valuable genomic data that you can   download, largely available, actually still openly  publicly because of the fact that it's generally   still largely publicly funded. But really, you  still need data that specifically speaks clearly   to the phenomena that you're trying to model  at hand in a given product. Say, for example,   something like protein expression in mRNA vaccines  or so. Yeah, it's really quite true. Over in Palo   Alto, we have a whole bunch of robots and people  in lab coats, both learning researchers and folks   who were previously biologists. Now we think  of ourselves as pioneers of something new,   working on actually creating that data and  validating the models that design those molecules.  So, the idea you're saying is that some of the  early ideas of translation, a fairly universal   learner, universal translation, were there  in the beginning. What are some of the major   architectural fixes, enhancements, breakthroughs  that all of you have seen along the way that you   think are really great additional contributions  on top of the base Transformer design? I think on the inference side, there's been tons  of work to speed these models up, make them more   efficient. I still think it kind of disturbs  me how similar to the original form we are. I   think the world needs something better than the  Transformer. I think all of us here hope it gets   succeeded by something that will carry us to a new  plateau of performance. Yeah, I wanted to ask a   question to everyone here. What do you see comes  next? That's the exciting step because I think it   is too similar to the thing that was there years  ago, right? Yeah, I think people are surprised   how similar it is, like you said. And people do  like to ask me, you know, what is coming next,   as if I'll just magically know because I'm on the  paper. But the way that I answer the question is   to point out an important fact about how these  things progress. You don't just have to be better,   you have to be clearly, obviously better. Because  if you're only slightly better, then that's not   enough to move the entire AI industry to the  new thing. So we're stuck on the original model,   despite the fact that probably technically it's  not the most powerful thing we have right now. But   you know, everyone's toolset, right? But what  are the properties that you guys want to make   better? Context window, you want to make better,  AO, the generation, the token generation ability.   We want to make it faster. Well, I'm not sure if  you'll like this answer, but they're using too   much computation right now. I think they're doing  a lot of wasted computation. We are trying to make   that more efficient. Thank you, but actually, it's  about allocation. It's not so much about the total   amount. You need any amount of computation, right?  It's really about spending the right amount of   effort and ultimately energy on a given problem.  You don't want to spend too much on a problem   that's easy or too little on a problem that's  hard and then fail to actually provide a proper   solution. That's a real example. It was like  2 plus 2. Right now, if you enter it into this   model, it uses, you know, a trillion parameters,  even though computers are perfectly capable of   doing that. So I think adaptive computation is one  of the things that has to come next. So we know   how much computation to spend on a particular  problem. Yeah, again, that was super, like an   immediate follow-up paper that I know a subset of  the authors here did was Universal Transformers,   which targeted exactly that. So these ideas  were...They were there, still there. Yeah, the   prep paper a year earlier, the mixture of experts,  that now everybody did, that's everywhere. It's   kind of now folded into the Transformer, but it  was before the Transformer. I actually don't know   if folks here know, but we kind of failed at  our original ambition. We started this because   we wanted to model the evolution of tokens. It  wasn't just linear generation, but text or code   evolves. We iterate, we edit, and that allows us  to potentially mimic how humans are evolving text,   but also have them as a part of the process.  Because if you naturally generate it as humans are   generating it, they can actually get feedback.  "Oh, you didn't tell me." Yeah, so I mean,   all of us read Shannon's papers, so yeah, we were  like, "No, no, no, let's just do language modeling   and perplexity." But that has not happened. And  I think that's also where we could intelligently   organize our computation well, right? And that  goes for images as well. Diffusion models have   the interesting property that they're iteratively  refining and improving. We don't even have that.   And yeah, I mean, this fundamental question of  what knowledge should exist within the model,   what knowledge should exist outside it, retrieval  models, RAG, I guess RAG is one instance of this.   And also, there goes for reasoning too. What  reasoning should be done outside with symbolic   systems, and what reasoning should be done  in L? It's largely an efficiency argument.   I do believe that large models will ultimately  learn the circuits to do 2 plus 2. But if you're   adding up trillion numbers to add up two, that's  inefficient. Well, in the case of IL's example,   if asked 2 plus 2, the AI should just pick up a  calculator, use the least amount of energy that   we know, which is a calculator, to do 2 plus 2.  However, if asked, "How did you decide on 2 plus   2?" or "Is 2 plus 2 the right answer?" then it  can go into math theory and explain from there.   That's right, that's right. I'm pretty sure all  of you guys are creative and smart enough to go   pick up a calculator. GPT does this right now,  exactly. No, that's right, Jake. I think the model   is just too cheap right now. It's too small.  Yes, it's too small, it's too cheap because,   like Jensen said, you are producing computation  that costs like 10 to the 18th per operation or   something on that order. Thank you for creating so  much of it. But if you look at a model with like   half a trillion parameters and you're doing like  a trillion computations per token, that's still   like a million tokens to the dollar. That's like  a hundred times cheaper than going out and buying   a paperback book and reading it. It's so cheap.  And we have applications that are a million times   or more valuable than efficient computation on  a giant neural network. I mean, certainly curing   cancer and that sort of thing is, but even just  talking to anyone, talking to a person, talking to   your doctor, lawyer, programmer, that you pay like  a dollar a token or more. We've got this factor of   a million to play with to make it way smarter.  That would be so amazing because sometimes just   the right word is going to change the world. Yeah,  that's exactly it. I also think that to make it   smarter, the right interfaces are essential. How  do we actually get the right feedback? How do we   sort of decompose the task that we're doing in a  way that humans can intervene at the right time?   And if we ultimately want to build models that can  mimic and learn how to solve tasks by watching us,   the interface is going to be absolutely crucial.  This might be a great way to do this. Could you   start a company? You left Google after you  invented the Transformer. You guys worked on   Transformer and you started your company. could  you guys all just quickly say something about the   company and why you decided to start it? Because  a lot of the things that you're describing,   you know, your company is working on. So, yes,  essential. We're really excited about building   models that can ultimately learn to solve new  tasks at the same level of efficiency as humans.   They watch what we do, they're able to understand  our intents, goals, and start mimicking what we're   doing. And that's ultimately going to change  how we interact with computers and how we work.   Basically, in 2021, one of the big reasons why I  left was that the only way to make these models   smarter, you couldn't make these models smarter in  the vacuum of a lab. You actually have to go out   and put them in people's hands because you kind  of need the world to interact with these models,   get feedback from them, and make these models  smarter. So the way to do that is to go outside,   build something useful. Learning does require an  experiential flywheel, absolutely right. And it   was hard to do it in the vacuum of a lab, and  putting something out in the world was easier   at the time. Yeah, that's cool. And Palmyra, yeah.  Oh yeah, so I founded Writer Team in 2021, and you   know, the biggest frustration I had at the time is  like, here's this incredible technology and it's   not getting out to everyone. This is like the  most, it's so easy to use, it has so many uses.   Can you guys imagine Palmyra being impatient? The  value of this is like, get it to like a billion   people, let them do a billion things with  it. This is what Zen looks like. He's calm,   deep learning Zen. This is when he's calm. Zen  looks like a calm comparison to Gnome, and you're   sitting next to him, like, "Yeah, thank God for  giving us this incredible technology." And thank   Jensen, and thank everyone. The ultimate goal is  to help everybody in the world. You guys all have   to go to Character. You've got to go check this  out. But I'm serious, you have got to go. Yeah,   let's start by doing this for real. Let's build  something as fast as we can, get it out there,   and get billions of people able to use it. And  you know, to start with, a lot of people are using   it just for fun or for emotional support or  companionship, or to replace entertainment,   and it's really working. Just going to grow the  number of people who are using it is insane. It's   really, really working. Congratulations. Thank  you. Yeah, that's awesome. I already said a   little bit about biological software, but maybe  more about the why for me personally. In 2021,   I co-founded Inceptive. You know, for the need or  the realization that there can be much more direct   impact on improving people's lives with this  technology than what we've had before. It was   broad, but not very direct. My first child was  born during the pandemic, which certainly gave me   a newfound appreciation for the fragility of life.  Then, a few weeks after, AlphaFold 2 results came   out for protein structure prediction, winning  CASP14. And one of the big differences between   AlphaFold 2 and AlphaFold 1 was that they started  using the Transformer and replaced the rest of   their model or architecture with that. So it  became pretty clear that this stuff is ready   for prime time in molecular biology. And then  a few weeks after that, the mRNA COVID vaccine   ethicacy results came out, and it was very clear  that mRNA and RNA, in general, can do anything in   life. With the RNA World hypothesis, there is no  limit to what can be achieved with RNA. But for   the longest time, it was the neglected stepchild  of molecular biology. So it just seemed like   almost a moral obligation. This has to happen,  and somebody has to go do that. I've always   thought of it as drug design, but I love that you  think of it as programming proteins, programming   biology. It makes so much more sense, actually.  Yeah, I love the concept of it. And of course,   this compiler would have to be learned. We're  obviously not going to write this compiler. So   we have to go learn this compiler. That's right.  And if you're going to go learn this compiler,   obviously you need a laboratory to test it  and generate the data. Yeah, this flywheel   can't work. I'm pretty excited, but I can see it  happening. Helio, yeah. So I was the last one to   leave. It's still very early days, but I can tell  you what's going on so far. Yeah, I co-founded a   company called Sakana AI. What does "S" stand for,  anyways? It's a bit weird. It means fish. Sounds   very weird in English, right? Call your company  "fish," you're off to a great start. Thankfully,   the Japanese seem to like that. Yeah, so the  reason we named it "fish" is that it's supposed   to be evocative of a school of fish. We want  to do nature-inspired artificial intelligence.   The analogy is that a small fish can be quite  simple, but when you bring a lot of simple things   together, they become quite complicated. But  people are not quite entirely sure what we mean   when we say "nature-inspired," so I want to dive  into that a little bit. The central philosophy,   right? What I try to push on researchers when  they join is that learning always wins. Anytime   you go from humans trying to do something  by hand, trying to engineer it themselves,   to actually just using computers to search through  the space, you always win. Even originally,   the Deep Learning Revolution was an example of  that. We went from hand-engineering features to   just learning them, and it worked so much better.  So, you know, the resources in this room, I want   to remind you that the mad amount of computation  that Nvidia has given us, there are other things   we can do apart from gradient descent. We can  use it to search through the space of what we're   currently hand-engineering. And actually, I would  like to tell you that, I think today or tomorrow,   there's a sort of a time difference problem,  we're actually making an announcement. I'm sort   of surprised that we have something to announce  so early, but we have some research that we're   going to be open-sourcing. And it's very on-brand  because what's happening, what's in vogue right   now, is model merging. But it's being done by  hand, like we're taking the algorithms by hand,   how to merge these things. So what we did is we  took all the models available on Hugging Face,   and then used a very large amount of computation  to use evolutionary computation to search through   the space of how to merge and stack the layers.  And let's just say, it worked very, very well. So   keep a lookout for that. Wow, okay, all right,  that's fantastic. It makes a lot of sense,   actually. I'm also under strict orders to say,  "We're hiring." That's fantastic, good job,   ID. Yeah, I think my reason for starting Cohere  was very similar to Gnome's. I saw a technology   that I thought could change the world. You know,  computers started speaking back to us, they gained   a new modality. So I thought that should change  everything, every single product, the way that   we work, the way that we interact with all the  stuff that we've built on top of computers. And   it wasn't changing, there was stasis, and there  was this dissonance between the technology we   were faced with, for those of us in the know, and  what it was out in the world doing. So I wanted   to close that gap. I think the way that I've  gone about it is a bit different than Gnome,   in the sense that Cohere builds for enterprises.  So we create a platform for every enterprise to   adopt and integrate it into their product, as  opposed to going direct to consumers. But that's   the way that we want to push that technology out  there, make it more accessible, make it cheaper,   and help companies adopt it. You know, the thing I  really love is, this is what Aiden looks like when   he's super excited, and that's what Gnome looks  like when he's super calm. I just love that. Very   true. Cohere, okay. Lucas, well, I did not find  Make, I know, but you went on to change the world.   You went on to change the world. Go ahead.Yeah, I  finally joined Open AI after some time. You know,   as Al Capone was asked why do you rob banks, he  said, "That's where the money is." Well, at the   time I joined, that's where the best Transformers  were at Open AI. It's a lot of fun at the company.   We know you can take a ton of data and a ton of  compute and make something nice. And I still hope   I can remove the ton of data sheer, we'll just  need even more compute. Yeah, sorry. So, do yours,   and then I want to ask these guys the next  question. Yeah, so I was actually the first one   to leave Midway Y. And because, kind of similar to  our shift, I strongly believed that the way we're   going to make progress towards, you know, pretty  much software is eating the world, and machine   learning is eating software, is to teach machines  to code. So that you're able to actually generate   software and transform everyone's access. Now,  this was 2017, it was a little bit too early.   We did not have as good compute yet at the time.  And so what we tried to do, we tried to coordinate   people instead to actually generate more data.  That's part of it. As a startup, you actually   have this ability to put something in front of  users and then also incentivize them. And we ended   up realizing we need the new basic primitive,  which is programmable money. Because programmable   money is what allows us to coordinate people at  scale. And so we ended up building the protocol,   which is a blockchain, which has been running  since 2020. It's the most users in the world   in the blockchain space, with multiple millions  of daily users who don't even know they're using   blockchain, but they are actually interfacing  with programmable money, programmable value. And   now we're starting to use that to actually bring  back some of those tools to generate more data.   And I think fundamentally, I mean, to, you know,  it's non-controversial in this group, but it's   probably controversial elsewhere, copyright as  a technology from the 1700s will need to change.   We have a new generative age that is in front of  us, and the way we are rewarding creators right   now is broken. And the only way to do that is to  leverage programmable money, programmable value,   and blockchain. And so one of the things we're  working on is actually creating a noble way for   people to contribute data that then models Cohere.  That's super cool. And then you'll be able to   hold a new positive feedback system exactly into  everything that we're doing, and there's a great   new economy on top of it. We've got programmable  humans, we've got programmable proteins, we've got   programmable money. I love this. And so one of  the questions that people have is, the current   generation of GPT models have training datasets  that are, you know, 10 trillion tokens large,   which represents approximately the size of the  internet, everything that you can scrape off the   internet freely. And so what's next? What kind  of new model technologies have to be explored,   like reasoning, you know, so on and so forth? And  I'll let you guys talk about that. And where would   the data come from? From interactions? Like,  it needs to come from interaction with users.   And that's, like, you need massive platforms  to actually attract, you need economic value   people get from this to do this. And then on the  back end, you can funnel it to all of the models   to actually become smarter. You can do that to  make a model even better. But how do you get to   that incredible pre-trained model, the starting  point that everybody would want to interact with?   Is there a way for models to interact with each  other through reinforcement learning? Are there   synthetic data generation techniques? Like,  you know, there's all of this, right? I think,   kind of between all of us, we're working on every  one of those techniques, probably. Yeah, I mean,   I think the next big thing that's coming is  reasoning. I think a lot of people have realized   this, and a lot of people are working on it. But  again, a lot of it is being hand-engineered right   now. We're sort of writing prompts by hand and  then trying to get them to interact in the way   that we think they should. And I, of course,  think that we actually should be searching   through that space and actually learning how to  wire these things together and to get this really   powerful reasoning that we want. Another way of  thinking about it is that models that are supposed   to generate things that we want to consume as  humans, media that we would like to consume,   should be trained on all the stimuli that we  would like to consume or that we can consume.   So basically, any type of video, audio, any type  of way of observing the world, 3D information,   spatial information, spatiotemporal information,  should all just be dumped in there. I'm also not   sure if everyone understands that reasoning and  learning from little data are very related because   if you have a model that can do reasoning, then  you have a little bit of data. It does all of this   processing, why is this thing following it? But it  can put a lot of computation into that, and then,   you know, oh yeah, that comes out, and it  generalizes from way less data because of all this   computation that it puts into reasoning. It's like  the system to think in human terms, and from that,   you can just let it go and then try to build it in  as things that it will do. Ideally, you wanted to   design its own experiments so it collects the most  impactful data for its reasoning to be able to   continue searching. But I do think that reasoning,  when we figure it out, it will dramatically   reduce the amount of data you need. But then the  quality of data you need will matter much more,   which is where all the interactions with the real  world and people come in. So I do think that there   will be a new age where we still pre-train on some  trillion tokens, but the things that matter, maybe   like high-quality things, will make it easier to  give people back money for contributing that data,   for pretty much teaching machines to become better  and better. Yeah, a person has seen only like a   billion tokens, so people learn pretty well. So  there's an existence proof here. Yes, yeah, that's   right. I would also argue that something that's a  lot of progress in the field has been made because   of benchmarks and evals too. So like, you know,  what is the grade school mathematics analog of,   say, automation? And so breaking down real-world  large-scale tasks into sort of simpler gradations   is also important because our models could  probably accomplish some of them and it can   deploy, get more data, and then once that loop  is closed, they have the right to actually take   on more complex tasks, one because they're also  potentially watching, so observing what they're   doing, so that gives them more data, then they can  do more complex tasks, like give them higher-order   primitives, like do more abstract tasks. So I do  feel like building, measuring progress, and making   progress is also going to require breaking down  or creating this sort of the science in many ways   that we've done with some evals, but the science  of automation or the science of interaction or   the science of code generation. And then you can't  do good engineering without measurement systems,   exactly, really important. Yeah, yeah, yeah.  And so here, I got a question for you guys.   What are the three questions you guys want to  ask each other? Okay, we're just gonna fire   off one first question. So, what do you think?  Awesome? Too complicated? Not elegant enough yet?  Oh, okay, wow. Well, the funny thing about those  is that, you know, we state-based models, we   remember the pre-Transformer age, right? But a lot  of young researchers don't. But when I looked at   the paper for the first time, it was very obvious  to me that it was a very poor man's LSTM. So all   the problems that we were having back when we were  trying to get these things working are surely also   in these models. But it seems that because people  have sort of forgotten the pre-Transformer stuff,   they have to rediscover all the problems. So my  guess is that these things will be important,   and we'll probably end up with a hybrid model. Well, Transformers have their recurrent step. The   fun fact I find is that nobody's actually playing  with the fact that you can run Transformers   for a variable number of steps and train that  differently. So exploring what we can do with   recurrence, because what this model does is with  every step, it kind of augments more information   for every token and resolves things and does  reasoning. So obviously, if you only have six   steps, you can only do, you know, actually five  steps of reasoning because the first step is just   getting more context. And sometimes you don't need  six steps, sometimes you need more. So what are   the different recurrences you can do on that? And then the other one, how do we go away from   tokens? Like, exactly, that's a pain of  our existence. I mean, with recurrence,   I have this personal belief that we  have never truly learned how to train   recurrent models with gradient descent. Yeah, that's right. And maybe it's just   impossible. I mean, LSTMs, they did poorly,  it worked a little bit. Then SSMS worked even   better because they're structured to it. But  maybe fundamentally, you need to train it not   with gradient descent. Maybe you need to train  it in a different way, like how we humans are,   in some sense, recurrent. I mean, we  live in time. Our brains update in time,   but it's not so clear how we're trained with  backpropagation, probably not. So maybe there   is a way, it's just not gradient descent, and  that's why it's been so hard to figure it out.  Well, guys, it's been so great spending time with  you all. I really hope that you get together every   now and then and see what amazing magic can come  out next time from your interactions. We have a   whole industry that is grateful for the work that  you guys did. Thank you. I appreciate that. Thank   you. Thank you. Thank you. Thank you. Thank you. And I'm just going to do one. Could you guys just   give me one? I'm going to do one, and I'll give  everybody else theirs as we leave. This one is for   Ashish Vaswani. You transformed the world. Okay,  thank you. And this one is for Jensen Huang. Here   we go. Beautiful. Thank you very much. Thank you,  Noam. Thank you. Thank you very much. Good job.   Thank you, Lukasz. Thank you, Illia. Thank you  all. Alright, guys, thank you. Thanks for coming.
Info
Channel: NVIDIA Developer
Views: 31,622
Rating: undefined out of 5
Keywords:
Id: hC_qASRcBhU
Channel Id: undefined
Length: 53min 47sec (3227 seconds)
Published: Mon Apr 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.