The Future of AI and the Path to AGI - David Luan & Bryan Catanzaro | NVIDIA GTC 2024

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So welcome to the talk, Future of AI and Part 2, Artificial General Intelligence, by two of my most favorite people, Bryan Catanzaro and David Luan. I welcome you on the stage. I'll introduce you. So David, CEO and co-founder at Adept, and their company is building AI agents for knowledge. Previously, he was VP of Engineering at OpenAI, and he saw the research on language supercomputing, reinforcement learning, safety, and policy. And before that, he co-led a team at Google Brain where they shipped GPT, CLIP, and DALL-E. Welcome, David. Thanks. And Bryan is the VP of Applied Deep Learning Research Team at NVIDIA. They work on multimodal language modeling, chip design, audio, speech, graphics, and vision, and continue to find practical new ways to use AI for NVIDIA's products and workflows. While at NVIDIA, Bryan has helped create Pix2Pix HD, DLSS, Megatron, QDNN, Pascaline, WaveGlow, and DeepSpeech. Welcome. I'll let you guys take it from there. Thank you. Well, it's so great to see everyone here. Welcome to GTC. It's been a while since we had one in person, and it feels incredible to see you all here. So thank you for coming, and I hope we're going to have an interesting discussion today. David and I are friends, and we're just going to be chatting about the work that we're each doing and where we think it's going. Hopefully, we'll have some time for questions at the end. Yeah, super great to be here with you, Bryan. I feel like you've done so many anchor contributions to the field, so I think this is going to be a lot of fun to get to grill you a little bit about some of the things you believe over the next 45 minutes or so. Don't grill me too hard. All right, that's good. So I guess just to start off, I think NVIDIA, you have driven NVIDIA's AI efforts for quite some time. I'm curious, how would you describe the goals of NVIDIA's AI training and research programs that you've been overseeing? Yeah, it turns out that NVIDIA is working on our own AI efforts, and it's something that I'm very excited about, and something that I'm hoping is going to continue to develop. And I think there's two really strategic reasons why NVIDIA is building its own AI. The first has to do with the nature of accelerated computing. So the value that NVIDIA provides when we sell systems for AI is in speed. And having that delivered to the engineers and researchers around the world that are creating AI requires us to understand the process of creating pretty deeply. There's a lot of things about the structure of networks. How do we use low-precision arithmetic, sparsity? How do we deal with networking? And all the various software stacks, compilers, libraries, frameworks, communication in the network, systems like Grace Hopper, where we have CPUs that are coupled to GPUs in new ways. And all of these things, there's so many choices. And the soul of NVIDIA's work as an accelerated computing company is to make those choices, but that requires us to actually understand what is being accelerated really deeply. I always like to joke that accelerated computing actually means decelerated computing for almost everything. And the reason for that is that if you just say, hey, I'm going to make a computer and it's going to be fast, that's not really saying much. All computers try to be fast, right? So the thing that makes accelerated computing different is that it's specialized. But then that question of what do we specialize for becomes essential. And the only way that we can build the systems of the future is to be building AI ourselves so that we understand what to build. So that's the first reason. The second reason, I think, has to do with the opportunities as AI develops around the world. I believe that AI is going to impact the world economy in every sector, in every company. But how is it going to do that? Because there's a lot of specialized skills that need to be developed and also an enormous amount of compute and data and resources that get put into building AI. Not every company is going to be able to invest in that. And when I think about NVIDIA and our business of supporting the world, you know, NVIDIA is able to partner with every company, old and young, large and small, in finance or in consumer product goods or in technology. You know, we're able to help every company incorporate technology into the beating heart of their business in a way that preserves their identity and allows them to take advantage of their own unique ideas and market position to change the world. And I think in an era where AI is changing everything, it makes sense for AI to be part of NVIDIA's platform. And so that's the second reason why we're developing AI. So you have had a front row seat, basically, to the enormous gains just due to things like model scale over the last couple of years. I'm curious, you know, actually zooming way out for the audience, I think a lot of people may be familiar. But, you know, how do you think about scaling laws? Like, what do they mean for AI? And do you think that's going to continue to hold? Yeah, well, I've been betting on scaling AI now for 20 years, and it's been a good bet. You're one of the original GPU programmers way back in the day, right? Yeah, it's a long time ago. Back when I was a grad student at Berkeley, I was doing machine learning on GPUs and published a paper at ICML back in 2008 on using GPUs to train models so that we could scale them. And I actually got this response from a bunch of people at ICML, this machine learning conference. They were like, what are you doing here? Like, everything that we're doing here is new mathematical formulations of machine learning that allow experiments to be run, you know, new kinds of experiments to be run by a grad student on a laptop. And the data sets that at the time people were using might have, like, a few hundred data points in it, and they might be, like, a few dimensions big. So it was an era of small-scale machine learning, and there was still lots of interesting things happening. But, you know, I believed that scale in data and compute was going to change the world. And now looking back, I think that's been clear. But it's also kind of a disappointing feeling for many people working in AI, this idea that data and compute is all you need. Because, like, we want to believe actually what we need is more PhDs in probability theory, because that's really fun. It's really fun to work on probability theory, but the idea that actually we just need enormous computers and enormous data sets, that doesn't feel as great. But I think we need both. Obviously, I love PhDs in probability theory, but I feel like the foundation that has been moving AI forward for the last several decades has been scale. And I don't think that we're seeing the end of that yet. Yeah, I feel totally the same way, by the way. It's like, this is talk I oftentimes give that basically outlines the different eras of deep learning. Like, everything before 2012, I loosely just lump in a prehistory. And then, like, 2012 to 2017 was, like, you and your three best friends write a research paper that changes the world. And after 2017, though, like, after Transformer, after learning how we map these architectures really efficiently to hardware, then it's just really become a data and scale game. And sometimes people ask me, you know, should I go leave and get a PhD? Should I go think about, like, how I can make new algorithmic advancements? I definitely think some people should continue to consider that kind of stuff. But on net, like, even if you go back to look at the initial AlexNet paper, right, like, people thought that that was really an idea shift. But it was really, like, Alex Krzyzewski sitting in a corner figuring out how to map, like, convolutional neural networks efficiently to the 580. Right. The two of them, actually. The two of them. He was very pioneering in the sense that he, like, built a neural network specifically for the system that he had, which had two GPUs. So he had to, like, partition it in this very strange way. And the systems work underlay the result that he got. And I think it's been true for so many big, big results. Like, back when I was at OpenAI, when we did GPT-2, when Alec Radford and I were writing the paper for that, we had long sections about, you know, all the evaluations. And, like, a short section about how we unified all these tasks into just predict the next token. But the modeling section was, like, a paragraph long. It's like, we used a vanilla decoder-only transformer with, like, these couple of configurations. And we're just sitting there being, like, the academic community is going to roast us. Because they'll say this is no novelty. And for a long time. And long time. And they did. And they did. They did. And I just, like, keep on seeing this happen over and over again. That, like, the new metagame that people need to play to actually get advancements in AI is, like, poo-pooed by the current incumbents. And I think we're seeing that again in this era. I mean, it's a little different from some of the more broad platforms work that you all do. But, like, for us, at least, one of the things that we really believe in at Adept is that this next era of AI is actually going to be about getting product right. And doing the correct co-design of product and the research objective. And having a lot of new ideas and research actually flow from what doesn't work for customers. And I feel like that's actually another change from the way that, like, for example, like, my old team at OpenAI or my old team at Google Research would think about the problem. Yeah. Hey, can we back up a second and you explain, like, what is Adept and what is Adept doing? Yeah. Oh, so Adept is kind of a is an interesting company. We're configured a little differently from the other sort of startup labs that you all know about, like, OpenAI and Anthropic and Mistral and stuff like that. What we do at Adept is we actually have a we have a broader north star that's both a product and research north star. And it's can we train an AI agent that can do anything a human can do on a computer? So how can we build models that don't just read and write text or understand images, but use them to be able to give it a natural language instruction, and do whatever end steps on your on your machine with the software that you already use at work to go achieve that goal? So, like, simple things like, you know, take this invoice that showed up in my email and put it into QuickBooks or find six different ways that I might be able to to plan out this particular trip that my team needs to go on and have the model actually just actuate your machine like a human to go get that stuff done. I think what we see a lot is that is that, like, you know, as enterprises have adopted these these these LLMs in particular, right? People always use them for summarization, text generation. Those things seem to work. And the moment those things start working in a company, they're like, OK, great. How can I actually hand off whole workflows from my team to a model such that they could be augmented? And that's basically the problem that we've been trying to solve. In order to solve workflows, you need to be able to solve things that look much more like agents, which I'm really excited about, because if you start working on agents, all of a sudden you get to bring all the richness of the, like, like broader RL literature, all the work that happened at DeepMind in the mid 2010s around, like, beating humans at Go and all that stuff. You get to bring those to bear in the LLM era, which is really cool. Wow. I feel like what you're articulating is a future where humans are being augmented by these models where, like, the goal of these models is to to help people get things done. I don't know. How do you think about the way that humans and agents are going to coexist? Yeah, this is like a really strong part of what I really believe in from a mission perspective is that, like, you could frame work in this space as being, hey, how do I just outperform people and then replace them at tasks? But I believe really strongly that the much more interesting and correct path for us to focus on is exactly as Brian said. It's like, how do you build AI systems that are actually here to augment people? And I think the line there becomes like the way that we make that happen is we work on tasks that are like 80 percent doable by these models on purpose, because that way you get this 20 percent of human supervision where they oversee what the models do. Like, for example, like one of our customers is a is a is a logistics company that uses ADEPT basically to handle the lifecycle of containers. People on their team log on to their platform and there's like dozens or hundreds of shipping containers that need to be tracked and to figure out, like, heavily cleared customs, all that stuff that's like entirely done by hand now. But like the way that they now use ADEPT is ADEPT in the background basically now goes and visits all the places all the different software tools required to understand where those containers are and then gives the human team a really easy way to supervise whether or not ADEPT did a great job on this whole mass of shipping containers. And so by doing this, we've basically changed the role where humans are able to solve the harder problems. Like, OK, this thing didn't clear customs. We have to go fix it. And then also gives our model feedback about how it can do better next time. And so I think like building these like data flywheels, these data loops by combining the product side with the actual AI R &D side just helps your models get way better, because if you're just working in a pure replacement world, by definition, it's a lights off automation process. You're not getting feedback and your models never get smarter. So I think there's also a better way to get stronger capabilities as well. Yeah. Absolutely. Yeah. But I think, you know, on that note, one of the big challenges always is like I think everyone aspires to build data loops, right? Because over time these days, as you continue to scale up these big models, people pour huge amounts of compute into these base LLMs and they get smarter. There's still this giant missing piece of what's specific to you and your company and your and your customers. And so like we've been talking a lot about how do you actually be able to like what is the role of like privacy and private data and all that stuff? And I know you've been thinking a lot about that. I'm curious. I'm curious how you think that's going to play out over the next couple of years. Yeah, well, I think we are reaching the end of an era right now with large language models, which is the end of sort of easily accessible tokens. You know, to train one of these large language models takes on the order of tens of trillions of tokens of text. And it turns out that that's about the number of tokens that humanity has written, at least that's available that we can get to on the internet in every language, you know, all put together, including programming languages. And so this is an absolutely astonishing amount of data, right? So we're training these models basically to read the entire recorded output of humanity's intellectual work. And then we're hoping that the model, after reading all of that, is going to remember some of it and is going to be able to use it to reason to solve problems. And the fact that that actually works in some ways is kind of surprising. It's really exciting. You know, and it's one of the things that, you know, sometimes I wake up in the morning, I'm just like pinching myself, like, wow, I can't believe that this like crazy thing that we as computer scientists tried to do of like find all text that humans ever wrote and then train a model on it, that that actually leads to a thing that can help people solve problems if it's fine-tuned and supervised in the right way. So we've been pushing this, you know, for the past few years, really since, I guess, GPT-2, I think, is GPT-2 really kicked off that search for larger and larger data sets and models. And, you know, the progress has been really incredible, as you know, but there just isn't more text to read, right? Like there's just not more. And so but yet we know that our models are not actually done, right? There are many problems that our models right now just have no hope of solving it. I love that you brought up the customs example, right? So like getting something through customs, it doesn't seem intellectually very difficult, but actually it's there's a lot of sophistication that you have to understand how these different systems work. And the rules are written down in kind of vague ways. And like there's negotiation that's happening. And different companies have different rules. So it's like there's no one-size-fits-all procedure for getting that done. That's right. And so it's clear to me that the future of these models has a lot more to do with the kinds of data that you're using to train the model or to supervise it, to fine-tune it, right? So we're going to need to teach the models very specific things rather than the general thing of read the entire internet and then we're going to do RLHF or supervised fine-tuning with a little bit of human feedback about like some pretty basic kinds of problem solving. That is going to need to shift towards something more specialized, something more in-depth. And I think it's pretty clear that data quality and data sort of the purpose of the data is going to matter much more in the future than it has. And, you know, today I think it's true that the world's most valuable data is also the world's most protected data. For example, like if you think about my own personal valuable data, you know, like we all have things that we protect. Like maybe my text messages. You know, I wouldn't like those to be public. Or my medical records. Or my calls, you know, to my family members and friends. And yet if there was a model that saw my life in that detail, it probably would be a really great assistant for me, right? And so the most valuable tokens to me are also the most heavily guarded tokens. And I think that's true for businesses as well. So I think, you know, my personal belief is that every business is founded on a secret. It's usually the kind of secret that, you know, Jensen Huang can shout from the rooftops for 30 years. Like, hey, accelerated computing. It's a thing. But the world doesn't understand it, right? Like that's the thing about a good secret. Is that even when you explain it, there's something about it that, you know, is unique. Like you have a unique way of thinking about the world. You can explain it, but it's still yours. And other people think you're crazy, usually. They often think you're crazy. They don't understand, like, how this would matter. But, you know, I believe every company, not just tech companies, but every company has something unique about it. Some secret, you know, that is kind of the core of the purpose of the company, its mission, or its market position, or the way that it goes about problem solving, or maybe culturally, you know, how is the company held together? These are enormously valuable, and yet they could never be public, right? So like the act of, like, exporting all of your most secret data, basically exposing the beating heart of your business to an agent actually is, it requires a lot of sort of data provenance and also safeguarding because, you know, these models, as they learned from this very valuable data, they're going to become very powerful. But then, you know, the question is how are we going to use them to augment the work that we're doing? And so to me, what that says is that we're going to be entering in an era of increased specialization where entities, companies, people are going to be able to use their own data that's very, very valuable, but very protected, and combine that with these models to make agents that are actually super useful. Yeah, I think that's got to be the way it's going to play out. Just to layer on my own perception of how the last couple years have played out in that particular space, like, I remember, I think back in maybe 2018-ish, I was at a bar in Noe Valley of this guy, Dirk Kingma, who invented the variational autoencoder. Really cool dude, and we're just catching up on the state of research. He had just left OpenAI to go to Google Research, and he was just like, you know, David, like, I feel like this whole behavioral cloning thing has a long way to run, and it might just end up working really well. And I was like, oh, well, what do you mean? He's like, well, maybe the critical path to general intelligence isn't actually that you need to go solve this whole crazy RL problem and learn every possible behavior from scratch, including language from running simulated agents running around in virtual environments. Maybe the right answer is you just clone everything that people have ever done in their lives, and throw all the weights of that into one model. And that's exactly what we're doing now, right? With LLMs, we're just doing that for text. With multimodal models, we're doing that for images and text, or audio and text, or all of YouTube and text. We were just training these models that, like, simply just predict, given the context so far, like, what is actually going to be the most plausible thing a human in a similarly situated situation would have done. And so it's really cool that that works at all, but I think there's a couple of corollaries to that, one of which is that these models are, one, only as smart as the smartest data in its training set, really. Like, it has some generalization capabilities, but, like, anything that is true new knowledge discovery under the training objectives we have right now are actually going to be penalized by the model, right? Because it doesn't match anything in the training distribution that you put in. And also, these models end up basically learning how to compress all the text or images or whatever that you put in to go train it on. So if you have a bunch of crappy data, the model is just wasting so many parameters on that kind of thing, right? So I think the combination of those two things really point you in the right direction. in the right direction. I have a joke about that, which is, like, we were training a model, this was probably five years ago, and, like, our model was diverging. And we couldn't figure out why. And it turned out that we had downloaded some web pages where people were drawing pictures with, like, ASCII art and emojis. And so we were, like, feeding those tokens into our model as if they were English tokens. And our model wasn't big enough to kind of understand that this was a different language, a language of, like, ASCII art. And so it just exploded. So just, like, P of, like, pound sign was, like, extremely high. The model just couldn't, at the time, you know, five years ago, it could not learn how to draw ASCII art and learn the English language at the same time. It's so funny. I think, like, we've all sort of built up these, like, battle scars of, like, stupid, like, quantities of data thrown at these things. I remember for one of the GPTs, I forgot which one, there was just, like, turned out part of the corpus was just, like, pages and pages and pages of Canon printer serial numbers. But we hadn't done a good job of filtering out. It just, it really does not end. And so that's actually why, like, I feel like, you know, going back to the private data thing, right, like, part of the goal for ADEPT is, like, you know, we're training these agents to do work on your computer. We need to learn from the, we need to learn from the smartest humans possible doing the hardest tasks. Because if you don't have that kind of data, which is not public data, it's not sitting around on the internet, then it's really hard to push the increased capabilities of your base models. And so I think, like, you know, there's lots of interesting work that you can now bring to bear on this particular domain in the agent space of controlling your computer that help you sidestep that a little bit. Because in the usual text LLM domain, you don't really have a simulator. And because you don't really have a simulator, you can't do as much interesting work, like, for example, one of the things that we've been spending a lot of time thinking about is self-play, right? So how can you train a model that can use your computer, that can also scrutinize its own decisions and let you spend compute at this sort of post-training time to collect new experiences about how you might do things on your machine and take the ones that are successful and train on those and build loops like that. And, like, in addition to, like, you know, solving these, like, specialized models with private data problems, the other set of problems I'm really excited about is that I think in the next year or two, we're just going to see tremendous gains to AI capability in the post-training step, not just in the pre-training step that people know and love today. Yeah. Yeah. Well, I mean, I do think that the post-training step has already shown to be enormously important, right? Like, if you just take a raw language model without doing SFT, RLHF, trying to align it to human preferences and give it some problem-solving capability, it turns out that the language model isn't actually nearly as helpful, right? And so I think we're seeing a lot of progress happening when we figure out how to make something that's generally smart, but then we specialize it to try to do a thing that's helpful, right? And then the question is, you know, how is that going to continue going forward? So I have kind of a spicy question for you, and I don't know the answer to this question, but I have an opinion. So do you think that the way that these problems are going to be solved is mostly going to be through general intelligence, or do you think it's mostly going to be through specialization? Or is that a stupid dichotomy and we shouldn't ask that question? No, I think it's a good question. I feel like, so my experience is at least, and I'd love to flip it on you after I give my answer on it, my experience is at least is that the quality of your raw pre-trained model sort of sets the ceiling for behaviors and intelligence that you might see, regardless of what you do after the fact, right? So you kind of want to make sure during pre-training stage that you have support in the training distribution for most of the tasks that you're going to care about downstream. And then I think about everything that happens after that as really like teaching specialized rules, teaching specialized knowledge, teaching like, hey, in this collapsing the waveform of like, in time step X, I could do one of N things because like in my training set, people did it N different ways to like, hey, at my company, I do this one particular way. So push up the likelihood of just like the next step according to that particular way. I feel like that's really the role of everything that happens during the post-training phase. So I think what's going to happen is like in the next couple of years to get state of the art capabilities, not like necessarily like fast local capabilities, right, but to get state of the art capabilities, every organization is going to need a combination of access to one of the few like true frontier models that has like just the highest level intelligence possible with the private data that's like teaching that model, the particular things that are special to you and your task. So probably a combo. I'm curious. Yeah. Yeah. I mean, I think I'm generally aligned. One of the things that I really like thinking about is how multidimensional intelligence is. For example, I don't know how many of you love, let's say Beyonce, right? Obviously iconic artist. I believe that she has a special kind of intelligence. It's a really rare kind of intelligence, the way that she's able to understand other people and sort of cultural trends and then her own life experience and then synthesize that into a thing that captivates the attention of hundreds of millions of people around the world and thereby makes large amounts of money. This kind of intelligence is pretty rare, pretty useful. At least us humans, we resonate with this. Like a lot of our culture is driven by super unique forms of intelligence. I would almost say like not to say that these are aliens among us, but they're certainly icons among us, right? Of people that just really have special skills. And you know, I don't know what Beyonce's SAT score was. I don't know. I'm not actually very interested in that question, right? It's not directly relevant to the reason why she's so interesting and her work is so valuable. Now I do think that, like you said, having a more general, a smarter, more general intelligence, it does place kind of a ceiling on the capabilities. If your model just isn't very general and just doesn't know very much, it's hard to get it to be really amazing at anything, right? And so I do believe that general intelligence is useful and we're going to continue pushing that frontier. But my belief is that because intelligence is so multidimensional, I think there's probably 8 billion different dimensions of intelligence because there's 8 billion humans on the planet. And I believe that there's something that I could learn from pretty much all of them. And I think that we're going to find out as we deploy AI to solve problems around the world that there are so many different forms of intelligence that we're going to build in order to solve these problems. And I think that's going to be pretty exciting. But one of the sort of the implications of that, which I think goes along with the work that you're doing at ADEPT, is that I think replacing people isn't actually very interesting, right? Because if you have this thing that's so multidimensional and so complicated, making something general that just all of a sudden does everything, I just don't see that that's where we're going to go. Because I think the problem is much more complex than that. I think it's way more complex. I mean, one of my coworkers actually has a really good analogy for this, which is that the best way to build AI that is really good at augmenting people is to think about it kind of more like a cognitive tool and cognitive technology than a robot. In the same way that our brains changed when we evolved writing and when we invented mathematics. And similarly, our brains changed when we were able to offload the majority of facts to our phones and to learn how to use calculators, right? I think the same thing is going to happen as we build these increasingly sophisticated AI agents, right? Because then you have another set of things that you don't really need to do, so you can spend your own limited representational capacity learning how to do something else. Then you sort of co-evolve this joint way of thinking with these new models as they get smarter and smarter. And I think that's probably how this is going to play out. And I think most people don't think that yet and are spending all their time figuring out how do I take, in the same skeuomorphic way we saw with early touch devices, right? People are still in the mode of figuring out how the current analogies we have today could apply to this world when you have smarter and smarter AI agents. When what you really need to do is you actually need to go revisit those interaction principles from scratch. Yeah. I totally agree with that. This is one of the reasons, actually, why I continue to be excited about the omniverse, as we call it at NVIDIA, or virtual worlds in general, is that I think a lot of the most interesting ways that people are going to be solving problems with AI are going to happen in a virtual world, as opposed to the skeuomorphic, like, oh, how do I interact with my phone today or how do I work today? So how do we build agents that are able to kind of bridge that gap, I think, is really interesting 100%. Taking it down maybe a different direction, I'm really curious, Brian, what you think about, you know, if you go look at this broader North Star of either generalized AI agents or if we want to call it AGI, what do you think are the remaining big open research problems? Like, the stuff that isn't just, you know, scale these things more, put more data in them. Yeah. Do you think there's anything that's left? I do. I think fundamentally the way that we're doing inference today doesn't really allow for the kind of problem solving that we need to be able to do, because it's fairly linear, so, you know, most of the time when these models are actually being deployed, you ask them a question, they provide an answer. But I don't know, maybe I'm anthropomorphizing it a little bit, but back when I was in school and I was taking a test, you know, the answers to some questions, sure, you just write it down, the answers to other questions, they could take a thousand times more thought. And right now it's really difficult at inference time for our models to be able to allocate compute. You mean like adaptive compute? Yeah. Yeah. It seems like we need like the ability for these models to be much more introspective about the outputs they're generating, and that involves allocating compute in different ways. And like if you need to spend a thousand or a million times more compute to generate one token than the others, then we should figure out how to do that. Do you feel like things like all of the chain of thought prompting related tricks and stuff like that are a way to approximate that? I think it's a start. I think it's a start, but, you know, those aren't widely deployed right now. I think one reason is because they're so expensive. And so I kind of, you know, going back to the bitter lesson that we spoke about at the beginning, I think we haven't really seen how the bitter lesson applies to inference as much as to training. You know, most of the time when we talk about the bitter lesson, we're talking about how do we build frontier models and, you know, just dump insane amounts of training into them. But I think actually there's going to be an analog of that to the deployment phase. And the research for that, I think, is pretty nascent. Interesting. Interesting. So when you say bitter lesson for deployment phase, it'll be like getting rid of handcrafted tricks during inference time to get the base model to have already learned sort of the right things to do during inference? Or do you mean something else? Yeah. I mean, I'm thinking something about there's going to be some connection to the amount of compute that we can spend on inference and the smartness of the models that we produce. And I think that's the underlying thing that I'm getting at. The way of, you know, how we actually implement that, I think that's where the research is going to have to go. I mean, there's a lot of papers about this kind of topic right now. But I don't think anybody's fully cracked it yet. And I think we're going to see some pretty amazing things that come out from much more compute intensive inference. I completely agree with that. I feel like when I think about post-training, you know, there's straight up, you know, how do you get the models to be smarter during inference time? But then there's also just after you're done pre-training the thing, how do you use the artifact you just created to go actually improve the model itself before you even deploy it? And I think that second bucket is going to be huge. I feel like this doesn't have to be RL, but this combination of these base models that have a lot of instincts that have been honed into them through the pre-training phase with, you know, trying to get these things to actually understand the reward signal of the task you're actually trying to solve and then be able to spend compute to push up numbers on that particular reward signal. And we see the early signs of this with things like RLHF, right? But like, yeah, that's like we're at the very, very surface of the full scope of research and what's possible down that particular path. And I think over the next year or two, we're going to see and we're already seeing in the papers, but like we're going to see like true, like discontinuous gains that that happen when you're when you're able to like to like hook up RL and or search in all sorts of different environments with these base models. I mean, just even thinking about like within the agents domain, your computer is one example, but also like all this excitement around around like universal foundation models for robotics. Like that's another great example of how, you know, right now we're doing the pre-training phase for that. But then there's a very obvious second step that happens after you've done the pre-training to make those models seem like amazing controllers and amazing planners for all sorts of robotics tasks. I think it'd be so cool. Yeah, I totally agree. And I feel like there is a there's a bootstrapping that's happening, which, you know, this is a classic technology development cycle that we're going through where, you know, you know, like Moore's law for many years was powered by semiconductors. Right. So like you needed to have better semiconductors in order to build the machines to build the next generation of semiconductor. And I think that we're seeing that with A.I. right. You know, one of the most interesting things that I'm excited about doing with our foundation models is using them to understand our data set, synthesize new data sets, train much smarter next gen foundation models, because I do think there's a there's a loop that's happening. Yeah, yeah, for sure. And I think we maybe see the early stages of that by using the models as like data filters or augmenters and stuff like that. Yeah. Very cool. All right. Well, we're running out of time. So I think maybe just one more question. Like, what's something that you're excited about that you think everybody else is not excited about yet? Oh, that's a good one. That's a that's a good, good. So let's see. Well, I think I mentioned a little bit about I mentioned a little bit about, you know, how. I think right now, so much energy and so much effort is going into multimodal foundation modeling as it should. Right. Because, you know, it's clear multimodal models have sort of have sort of taken over as the default model family. Right. Like I think in some time people will just always be bundling all this stuff in. And then soon we're going to add audio and that soon we're going to add video like any other like all the tokens, all the tokens, just all the tokens in one particular model. I think actions like just like trajectories of behavior like are going to also be added into that. And now you just have this base thing that can itself decide how it wants to allocate its capacity to learn how to model all those things. So that's all great. I'm really, really excited about that. But but I think that's actually going to be the majority of like of like new advancements over the next couple of years. But there's these like domain specific things that I am also very excited about, even though I don't work on them. Like we just talked about robotics. But one of the ones one of the projects that I that I helped fund at Google was done by a friend Nal who and his team up in Amsterdam in Europe. But they what they did was they trained a model basically to outperform the best scientific simulators at weather prediction. And all they did was they just turned the whole planet into these like small cells, each of which is represented by a couple of numbers. That's like the current precipitation level and humidity and temperature and all these different things. And then and they just treated it as like a like, OK, I now have this like tensor and I just need to predict the tensor at the next time step, the tensor at the next time step. And let's forget about any physical modeling. And it turns out if you just do this, you now have this sort of like universal like like Earth model of these of these variables that actually outperforms the physics simulator up until some amount of time. Like that's just so cool. There's so many other domains where you could just literally say, hey, I have this infinitely flexible input output engine. Let me just model it and see what happens. But I'm curious, I think. Fantastic. Well, I wanted to put in a plug again for virtual worlds and omniverse. I think that we're going to find that some of the most interesting experiences are going to come through people interacting with AI. I think one of the most curious questions I'm curious about is like how is AI going to change our culture? And I think that it's going to create a new form of media. You know, in the way that video games were different from movies, AI is going to be different from video games. It's going to be much more useful, much more profound, much more interesting and engaging and helpful. And I think that that's going to happen in virtual worlds. And I think that virtual worlds are going to make AI smarter. They're going to make AI more grounded and understanding the problems that we face. And then we're going to work together to solve problems with AI and virtual worlds. And and, you know, for me, this is kind of the synthesis of a lot of research that's been happening that I've been watching at NVIDIA and elsewhere for the past 20 years. And I'm really excited to where that's going next. I just love this framing of like how will AI impact culture? So I think to me, that is also like by far one of the most important things about like like I feel like anything you're working on hasn't really hit true utility until it starts impacting culture. And we're already starting to see the early days of that. But it's one of the things I love about Brian is like you're like a super like well rounded person. Like we talk about AI on stage, but like when we're not doing this, it's like all sorts of all manners of different things about and that sort of like liberal artsy bent you're taking all of this is really cool. Oh, well, thank you. I mean, I feel like it's good to be human. Awesome. Well, we we have a few minutes for questions. I think there's microphones in both aisles. Yeah, go ahead. Software partner for NVIDIA. Let's think about some kind of Wintel dupli, you know, I think two to dominate the player is still better than one. So I won't get Brian's take on who will be the best software partner for NVIDIA. OK, so I think NVIDIA partners with every software entity around the world. You know, we work with with all of them. And so the answer is all of them are going to be successful. We're going to support all of them. You're a great diplomat, Brian. Well, I mean, I'm also self-interested. I mean, I it's it's one of the great joys about the work NVIDIA does, that we do support many, many different companies with many different perspectives. And and, you know, in that way, as AI prospers, we get we get to to kind of ride along. And so I don't think that NVIDIA would want to choose any anyone as like the most important of our software partners. But but we do we do love working with them. Hi, my name is Vijay. I'm from Dilation Capital from New York. I have a question on scaling loss. You said this, that you have bet on scaling loss in a very successful manner for the last couple of decades. Human brain is estimated to have about hundreds of trillions of synapses. Given your conviction in scaling loss in the last two decades, how do you view these scaling loss working for the next decade? Is the hundreds of trillions of synapses sort of a high watermark, like where these models could eventually go? And what's the risk of overfitting these models? Thank you. I mean, on this one, I feel like the whole parameter count thing is a little bit of a megapixel war thing for the cameras. Right. It's like, hey, I have 15 megapixels, but I have a shitty lens. I still have a bad camera. I feel like it's really ultimately, I think a better proxy is actually just the number of flops you've pushed through the model. I think that's a better, like in the near term, I think it's a better measure. But like, I think that in the same way that, you know, every scaling law is ultimately an S-curve. It's like, where are we on the S-curve? But I think that like, not only is there more to run on the on the pre-training S-curve, we have not even started the like what we were loosely earlier talking about, like the post-training S-curve that much yet. And it's like waiting in the wings to take over for another huge amount of progress over the next over the next while. So I personally am quite bullish that we're going to continue to see predictable improvements in progress due to compute and new ideas over the next 10 years. Yeah. And I would also want to say that we still don't understand a lot about how the human brain works. It's really complicated and reducing it to a number is probably oversimplifying it quite a bit. I think there's a lot of baked in specialization into the structure of the human brain that means that we don't have to learn in the same way that our models do, where our models are literally started from random numbers. But the human, like each of us, we start with a lot, a lot more knowledge that's baked in to the structure of our brains. And I think that that's hard to quantify and understand as well. What we're building with AI is quite different. So I don't like comparing these numbers because I don't think ultimately it tells us very much. Thank you. Hi, and wonderful chat. I'm glad I was able to attend. You guys mentioned, you know, your big bet before was like, you know, speed and everything else and updates in terms of not system level, but like architectures. What is maybe your next big bet that you would do for the next 10 years? For example, I've been interested lately in neural symbolic architectures, world models. So things like that that are more that maybe give some some different form to do, you know, algorithms for the future. So what can I buy a lottery ticket for for the next? Well, I love I love world models. I mean, I think we're going to we are seeing just amazing progress. And I was talking about virtual worlds earlier because of because of that. I also I want to put in a big plug for sparsity of different kinds. So I think that we are about done playing out low precision arithmetic where we have crunched it down, you know, quite a bit. We're running out of bits, you know, so the way to have less bits than one is to go sparse. And so I think we're going to find that we do want to go sparse. We want more structure. Like I was saying earlier, the human brain, there's I think there's a lot of knowledge that's baked into the structure. It's not an all to all network. Right. And so how how do we learn how to build sparsity into our into our network so they can be dramatically more compute efficient? So like the intelligence per flop can be increased. I think that's that's going to be a big frontier for us. Plus one to world models, I think if you frame pre-training correctly, world models super hand waving, but world models kind of pop out for free. And then the other one is the other one is on the architecture side. It's just whatever maps better to hardware. I think it's so much of this is driven by hardware, hardware cycles. Like you could have such a clever architecture and just have it not run efficiently. And you'll never be able to scale a thing as well as someone doing something more vanilla does map the hardware. The bitter lesson again. Yeah. All right. So I have a question about the pre-training data set and quality. So you were talking about how quality for pre-training data sets is so important for knowledge. And if you have a bunch of noise, it can really diverge the entire model. But our current methods are already super noisy. We have stuff like at the next stop, take a left, right or go straight. And if the model with our current next token prediction predicts left, but it's right, it's penalized as much as it if it predicts left or as if it predicts a banana, something totally random. So do you think with our current next token prediction that we'll be able to achieve the kind of next levels of AGI or whatever? Or do you think we'll have to go with a different kind of optimization there? And if so, what kind are you again? It's kind of like a what I get a lottery ticket for this one here with what kind of like beyond next token prediction optimizations do you think? Well, I first of all, like David was just talking about how the thing that wins is the thing that's easy and scalable and you can dump compute into and next word prediction has that property, which is why I think it's been so successful. So anything that comes along after next word prediction, I think is going to share that property. But the second thing that I wanted to say is it's hard to know what we should do beyond that next word prediction because, you know, that these more intelligent, let's say more symbolic approaches like penalizing the model specifically in specific cases, they tend to run into the same problems that other approaches to AI have over the past 70 years where the number of cases is just too enormous and we can't enumerate them. And when we try, we end up messing it up. So the models don't actually learn the right thing. And so that's one of the strengths of next word prediction is that we can't mess it up with our cleverness. But then the third thing I want to say about next word prediction is that it is tempting to reduce artificial intelligence to flops and and loss functions and so forth. But we can do that biologically as well and just be like, does it make sense that intelligence would come from like amino acids and lipids and like, you know, the elements can be simple and the implication, the elaboration of those elements can be quite complex. And so I don't really feel like the simplicity of next word production disqualifies it any more than the simplicity of biochemistry disqualifies it. That's a pretty good analogy. I would also just say, like, sometimes I hear thinking about how architectures work as, you know, will this thing just will it will architecture X be able to do Y? Will like training decision X be able to do Y kind of things? And the answer is always, well, as long as you haven't screwed it up, the answer is always yes. It's just how much compute. Right. And so like when we go evaluate whether or not any idea is good, we look at does the new idea change either the slope or the intercept of the scaling laws? And and usually the answer is never like is never like, OK, there's some like straight up discontinuity that only happens if you try some particular architectural idea. So I think as a result of that, it's like I think many of these things have room for innovation. But I don't think that they're like strictly necessary, actually, even for us to get to that next level of intelligence. Thank you very much. Hi, I'm really enjoying this, and I wanted to bring up a concern I have that dovetails off of you're talking about exhausting the available tokens and the multidimensionality of intelligence. I come from a education and psychometrics and neuro and neuroscience background. Also did a little AGI back in 2005 long before it was cool. But anyway, as we've kind of run out of the the time to do this, the text based stuff, a lot of people are jumping on the hot buzzword is is synthetic data. And I see that as both an opportunity and a potential real trap where where I know from my work in education and human testing and other things like that, that you can get yourself into this solipsistic loop where where you're you're creating very artificial, ungrounded systems that where where you're getting great results. But but they don't actually mean anything as far as as far as intelligence and problem solving goes. And I wanted you guys to to talk about opportunities and dangers there. I just feel like synthetic data is super useful as an augmentation, but it's a crutch because at the end of the day, the underlying complexity of the generator for synthetic data is usually, at least in my experience, and maybe there's things I don't know about, but it's usually capped at some degree of model of model capacity. You kind of just end up modeling the generator and then you're kind of done. So the crutch is over. So I'm curious what you think. Oh, absolutely. It's it's interesting. You know, in graphics, we've been doing bottom up. Like, let's model every single blade of grass and every light source and every photon and like bounce it all around. And and we've kind of run into the limits there, right? Like, I think the future for for graphics has to be a world model because we've run out of ways of enumerating everything else. And I think synthetic data is kind of a similar thing that, you know, the way, just like you were saying, modeling the generator ends up becoming the same problem. And so so I'm a big believer in synthetic data, but I don't. And we use it and it is important. And yet I don't think that it I mean, to the extent that it's a trap, I think we're all aware of it and trying to make sure we don't fall in it.
Info
Channel: NVIDIA Developer
Views: 10,178
Rating: undefined out of 5
Keywords:
Id: 1xDidxh2ZCA
Channel Id: undefined
Length: 53min 32sec (3212 seconds)
Published: Tue Apr 16 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.