Behind Google’s Accelerated AI Push with New System Gemini | WSJ

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- I think the last time we talked was almost a year ago, and it was right before Google launched Bard, and now we're here and Sundar says it's the Gemini era. You just announced a bunch of new products around Gemini, including a subscription version. So tell these people what they should know about Gemini. What is this Gemini era that we're in now? - First of all, I can't believe it's been a year. It feels like it's been longer. So Gemini is the most capable model that we've built at Google and Google DeepMind in collaboration with other parts of Google, like Google Research. And it's a multimodal model. And as you mentioned, it's available through an app and through Enterprise and developer endpoints in Google Cloud. And then we're integrating it into our products and services as well. So search generative experience and workspace and so on. - So, I mean, I've noticed you like really leaning into the multimodal like part of this. Like what makes it so special, these multimodal capabilities? Like, what can you do now that you couldn't really do previously? - I mean, the world is multimodal, so I think that's... The most limited sense you can do a lot with kind of text in and text out. But even just adding code, which a lot of people think as a modality was huge. Coding is one of the top use cases. Image understanding and image generation has been very popular. We've added that to the app. But part of it's also just the... If you think about the kind of concepts, when we learn concepts, we don't always learn them in a single modality. And so if you think about what you can learn from a video or in a book or code, it's, a lot of times, intelligence comes from mixing those modalities. So it's not just being able to generate something in a modality, but when you train the model, having it learn across modalities is quite powerful. - So like, that sounds all like very cool and abstract, but like what's an example of like how a business would use that? - Yeah, well, I mean, there's a lot of a... So most businesses have a lot of assets across various modalities. So imagine like you're building a marketing campaign and you're doing sentiment analysis, that's an area where you might want to reason about both your concrete images or videos or media in conjunction with text. And so there's some use cases where it's almost kinda natively multimodal. Or imagine you are a sports broadcaster and you've got a large video corpus and you say like, "Hey, show me the most interesting part of this video, or show me when the quarterback... I think there was a game when the quarterback throws a touchdown in the second inning." - [Miles] That was yesterday, right? - Yeah. And which games were that? And then it can just say, "Oh, these are these three games." So I think yeah, exactly. Too soon for those of us in San Francisco. (audience laughing) But there are a lot of those things where like, I think years from now, if you think of just like querying video with natural language, that's something that's still actually really hard today. Like queries of video have gotten a lot better. There are a lot of people who use use YouTube as a search engine, for example. But there's so much more we can do in terms of like natural language interfaces to video that I think once we have it, we're gonna wonder how we lived without it. - Right. Like, can you give us even an example? Like what are some early customers saying? Like, I know you've been testing this most powerful one with some business customers. Like what are some sort of frequent use cases that you're seeing? - Yeah, I mean, a lot of the early uses now are trying to test like the frontiers of the capability. So people who had use cases and didn't work, like we're doing, for example, code generation and kind of want the next level of performance or are pushing on advanced reasoning. There are of various dimensions in the model that perform better as you scale it, and one of them is its reasoning capabilities. And one of the other interesting areas where we see people apply the model is you can use these models directly, prompt them, get some output, but one of the most promising use cases for these models is actually using them as building blocks for other systems. So if you think of something like AlphaCode, AlphaCode is built with Gemini, but it's a whole system. - And briefly, AlphaCode is just a model just for code generation? - Exactly, AlphaCode is a system we built for coding competitions. And so you can think of it as kind of like an AGI programmer. And so you take a complex problem statement and then generate code to solve it, specifically that kind of format of coding competition and AlphaCode was built using Gemini and they had kind of major performance improvements. So I think there's gonna be a lot of systems rather than just thinking about like, hey, what can you do with the model, kind of prompt in, prompt out, which is obviously interesting and compelling, a lot of the interesting systems are gonna just be systems, like they presume models like Gemini as a building block. - And then like there's also a smaller version, a Nano version. - Yeah. - What goes into the decision to make like a really small version of this model? So what is it useful for? - There's roughly three high-level sizes. So there's Nano, which is for on device. So think your phone, your laptop, like things where you're gonna run the model on your device. Pro, which is the kinda standard model, which is a good balance for like performance and capability. So when you're using Bard or the developer API, the default model's quite good, it's Pro. And then Ultra is the highest performant model for when you need something that pushes those capabilities. For Nano, like I said, it's really motivated by on-device. So we've been doing AI on phones for quite a while. We see a lot of also opportunities on laptop and desktop computing. And so a lot of platforms we believe are gonna have those models deployed as part of the platform. So when you get your phone, you're gonna have a model. And so I don't know if you saw the Circle to Search work that we launched with Samsung. That was powered by Nano. We've done a lot of work with Pixel powered by Nano, and so it's just kind of like, just similar to the previous point, it just becomes a building block that when you have a platform that you start to use. - Yeah, I mean, are you getting a sense yet what use cases you think will be like enduring? What use cases people will be willing to pay for years out from now? And what may not be. What are you seeing that is just like not up to the task right now? - Yeah, for Gemini specifically? - Yeah, just for generative AI. - Yeah, I mean, it's still very early. I mean, I think everyone is still kind of supply and capacity constrained, so we're still at a point where like, I think we're very much like in the exploration phase in terms of figuring out like what the frontier and what works and what doesn't. I mean, coding is an example where you have an early adopter, software developers, kind of easy use interface, a chatbot and a free product. And so it wasn't a surprise that coding kind of popped out as one of the main use cases because it's a direct productivity boost. The monthly subscription price is quite reasonable, especially for people in that profession. - So coding will endure. - Yeah, coding for sure will endure, well, not just endure, I think will evolve. Like right now, if you look at how people are using coding, they're learning how to write code. Like if you think of like, well, coding's a big category, so there are people who don't know how to code, who are using Bard to learn how to code in the first place. There are other people who maybe don't know how to code, but don't know how to... A backend developer who doesn't know how to do front end development and who is kind of doing something they wouldn't be able to do otherwise. And then there are people who are, like I said, like what we're doing internally using the model as a building block to build an amazing coding system that like competes at an international level on coding competitions is kind of an extreme example. And so I think that pattern of, I think we're gonna see that kinda range of uses for coding, we're gonna see that same pattern for other use cases as well. So I think coding will expand over time. So you can imagine like porting. These are all relatively straightforward examples really. Like I have a coding description, like a competition problem, and I generate a solution, which is again, quite impressive that we're exceeding human performance error. But you can imagine saying like, hey, here's a 50-year-old COBOL code base, port it to Java. - That was actually something we were talking about at the drinks. So that's a very relevant example. - And then explain this part to me. I think there's a bug here. What is it? If you think about what it would mean to have like an intelligent agent encoding, again, we're just kind of scratching this kind of these short prompts and short generate code out. We're really just scratching the surface for what you could do for coding, - Right. I mean, are there any things that you thought were promising, but like, we're not quite there yet, like we need a few more? - I mean, there's a lot of stuff that still doesn't work. If you think about Bard, we're integrated quite heavily in search where like, we annotate the links with search output and it's really a change in paradigm, but we're still very much in this, you think of like the issue with like hallucinations, for example. We're not in a situation where you can just trust the model output, right? Like where we're still grounding it on public information in the web or providing you links or you can almost think of it as like a research assistant. So rather than you going out and saying, oh, hey, I wanna explore something, I'm gonna go a bunch of links. I put a bunch of tabs and then synthesize it. The model does that for you. But at the end of the day, I'm still gonna wanna know what the source's information is for and I might go there. And so it's more of like a change in the UX paradigm for some of those. If you think of like travel planning, for example, which is one of the ways I use it. It's more of a change in like the user experience of travel planning but the end of the day, I'm still kind of interacting with the web. It's now I've got like an agent-centric user interface to it versus I versus I'm the agent with a bunch of tabs clicking around. You could imagine a world where you could do a complete travel experience where it does all the research, you don't need the sources, books flights for you, asks your feedback, proactively prompts you when your flight changes to rejigger things. I mean, you can imagine, just use travel planning as an example where we go well beyond where we are today. - Sure. I wanna go back to something you said, you said we're still supply constrained. How does that influence the way you develop things like Gemini? How do you think about costs and how do you think about chips? It seems like it's not going away any time soon, right? Sort of constraint. - Yeah, well, I think we're just so early in the process. I mean, we're just so early in the process in terms of figuring out where the technology works and I think it'll take, I don't think we'll be supply chain constrained forever. It's just that, I remember we were thinking through the transformer in BERT, and I remember when I joined Google, I was going on a bike ride with one of the infrastructure engineers who optimized transformers quite a bit to the point where now I think BERT's been mentioned by Sundar in our earnings calls two or three times. And like, every time you type a search query or many times, you're hitting a BERT model on the backend. And so we took something that like didn't exist, was a research innovation, and then we did a huge amount of infrastructure work to deploy that to like multi-billion user products. So like better document understanding, better query understanding. And so this process of like inventing a piece of research, figuring out how it could be applicable in a product context with all the various constraints that that has, and then kind of optimizing the infrastructure and scaling it, that's a journey we've been on before. And so while the models are still quite expensive computationally to develop and run, like this is the kind of problem we know how to solve. - Right. How confident do you feel that you can sort of achieve the AI roadmap given the current constraints? - In terms of our ability to build and deploy the models? - Given the constraints around chips especially. - Oh, around chips. Yeah, I mean, we definitely, we've done a ton of work to make our model development and training a lot more efficient. So if you look at like what we can train given the number of chips we have, that's an area where we've been doing a ton of work in terms of efficiency. And part of that is just because we wanna constantly train, like there's so much progress in the research that you're constantly retraining because you're enabling new model techniques. So like Gemini, for example, Gemini Ultra is a far bigger model than the previous larger model we built, but it's far cheaper to serve. - [Miles] By what magnitude? - Multiple, single digits, but a multiple. It's more performant by a multiple, but it's also cheaper to serve by a multiple. And so it wasn't kind of one breakthrough that led us to do that. It's just a bunch of sustained innovation in the model architecture, improvements over years. And so both in terms of like sizing the compute to make sure that we can constantly build these models, we figure that out. We're still working on how to deploy them at scale. But we've launched a lot of freely available products now that have, if you're a Google One subscriber, you can go try Gemini Ultra yourself. So we found a way to kind of deploy them to a reasonable scale, and we're gonna keep increasing from there, kind of just like we did with BERT. - Right, I think Sundar told me around October last year that if there's one thing that keeps him up at night, it is the chip situation. Does it keep you up at night too? - I mean, it's certainly a constant, I mean, we were in such a regime where humans were expensive and machines were cheap, like in the kind of PC era. And so it's interesting to go back to this world now where like, the machines are quite expensive. I wouldn't say the humans are cheap, but the machines are... I mean, just the sheer amount of compute that goes into these, it's pretty wild. So it's not something that keeps me... I mean, we're fortunate to, like, we've developed our own chips in house, the tensor processing unit for years, for generations. We deploy it in our own data centers. So we have many fewer constraints in terms of we kind of own our pipeline of compute. - Are you leaning more into that over time versus GPUs and other sources? - I mean, we're leaning into both. Like, I think that we are leaning into our own capacity quite heavily because we have a pretty insatiable appetite in terms of our products and then making them accessible. But also NVIDIA's a great partner of our, we use their chips internally for some of our workloads, and then they're a key part of I think any kind of cloud platform. So it's an and for us. Yeah. - Sure. Okay. I wanna switch subjects very quickly. So it's also almost a year since Google's two big AI teams, Brain, which you were a part of, and DeepMind merged. And what do you think, how is it going so far? - Yeah, well, Brain and DeepMind had really similar agendas in terms of the types of things that we wanted to achieve. And so that really made things a lot easier. There's alignment on the kind of the research aims and the product goals and whatnot. And so that's really made so much of the things that would be... There was no kind of big culture integration point. We were always both part of Google, and so there was kind of DNA, like people who would move between the two groups. And so actually, like the paper that kind of started this whole revolution was called "A neural conversational model." And this was pre-transform, but it kind of articulated this whole idea of using a deep neural network to build a chatbot. And that was co-authored by Oriol and Quoc who kind of senior researchers, both at DeepMind and Brain who now sit on the same team. And so it's kind of like in some of the areas, we've actually brought back together people who collaborated a long time. So that part's been great. I mean, I would have to say, like, I work quite a bit with people in London. I definitely wake up earlier than I used to. So there's that, but it's been pretty smooth. - So if both teams were working on sort of similar research agendas, it seems to me like there would be a lot of overlap, right? And so how do you deal with that overlap? I mean, I guess there was maybe duplication of work before, so now are the teams just twice as big working on the same issues or how do you allocate resources between these? - Yeah, well, I mean, that was part of the motivation is the compute. If you look at the amount of compute that we needed in the legacy of kind of Brain and DeepMind teams to achieve our objectives, it was eye watering and we realized that if we joined forces, we'd effectively be getting twice as much compute. And in kind of the area that we work, that's a huge deal. So that was definitely quite beneficial in terms of like being able to join forces. The work is so dynamic, so there are certain areas like generative media where we had probably like five or six different teams building text-to-image models, but with different techniques. And when those models were less computationally expensive to train, mostly because it was early and the quality wasn't as good, you could have five or six teams pursuing different approaches. Now we're at the point where like some of the models, we have a pretty good understanding of how to build a high-quality model. And so we've merged the teams, we've merged the compute budgets, and then we've basically kind of divided up the labor and some of the people have gone on to work on video instead of imagery, so that it's not that so much that the teams are twice as big, it's just that we've taken four or five different models, now we're building one model, and some of the people are now working on video or music or other areas. And so it hasn't been difficult, like there's so much to work on right now that it hasn't been... Compute is definitely much more of a challenge than where to put people. - Right. I think there was a moment after ChatGPT came out where people were wondering, well, where's Google? And like, how did Google not put out something like this first, and to some, Google looked slow. What kind of an impact has the merger do you think had on the speed of deploying things like a Gemini? - Yeah, I mean, by design, the goal is to accelerate. So we have a shared agenda. We have a shared compute set of resources. If you just look at, like, we actually started Gemini before the merger as a joint... This is actually why it was called Gemini. - How did that start? Like how did the decision to start Gemini? - I mean, we basically looked at, we had this, we both wanted to build the world's best multimodal foundation model. And so it was kind of a natural... DeepMind had a whole series of really great innovations, like the Chinchilla and the scaling laws and a set of models that they had built. On the Brain and Google Research side most recently had been building PaLM, but when we looked at a lot of the architectural advances, investments were pretty complementary. So we were using the scaling, the Chinchilla scaling laws quite heavily in building PaLM. And then a lot of the ideas for PaLM became the basis for Gemini. We both had multimodal LLMs in development as well, Poly and Flamingo, Penguin, et cetera. And so it was pretty natural to put those teams together. And then we've executed, I mean, it's been less than a year and we've already probably hundreds of launches that we've done in that time. I mean Gemini was pretty much built in that timeframe and now shipped widely available. So it's been... There's some areas where coordination will slow things down, but there are other areas where like the model training now follows the sun because we have both time zones. So there's some things that have gotten sped up in ways that were unexpected. - Wait, I'm sorry, can you repeat that? The model training follows the? - Well, we pretty much have people online all the time. So if you have these like large model training runs, for most hours of the day, someone in the model team is working just because of the geographic distribution of the team. - And previously what? You would be working in Mountain View and you would go to sleep and what would happen? - Well, if you have a bug, someone would fix it in the morning, and there was there's just kind of the normal work rhythm of a single time zone. Whereas now, we're pretty much the sun never sets on Gemini development. - Okay. (Miles laughing) I know you're working on the next version of Gemini. I think Sundar already said that. How far along are you in that process? - I can't share the specifics, but we will share more soon, but we're very excited about what's in the pipe. - What's your ambition for the next version of Gemini? - Yeah, I mean, we laid out the kind of vision of a multimodal foundational model back in December when we launched Gemini v1. And there are a lot of ways where you can imagine better performance on the kind of dimensions that we outlined, like multimodal understanding and generation, all of the kind of various benchmarks. And so that's obviously been one area of focus, but the other is AGI is a big part of our mission. And so if you think about memory and planning and reasoning and if you think about all of the capabilities, if these models are gonna become part of systems that are effectively intelligent agents, then you can think of all of those capabilities that we start to need to develop as well, not not just kind of being better across the existing dimensions, but kinda getting the models to do new things. And so that's an area where we're pretty hard at work too. - Right, and people talk about agents, models that can take action on behalf of you. Do we still need more technological work until we get there, or is it another question that we still haven't answered yet? - Yeah, I mean, we see kind of glimpses of that, like with Bard, we have something called extensions, which is, well, now Gemini in using tools, and so we've gone from a world where it's just kind of a prompt input, model output to now the model, Gemini can use tools on your behalf, for example, can use Google Flights or other services. And so that's an early step in tool use when tool use is like one of the components that you need for building an agent. But if you think of memory and personalization and all of the attributes, you can start to see the kind of early versions of those in Gemini. And so that's an area where I think, like, I don't think it's gonna be a kind of a black or white moment when all of a sudden an agent pops up and you're like, ah, you've woke up and now you have like a new coworker. I think it's gonna happen incrementally as the capabilities advance. So we're still gonna see some pretty dramatic improvements, like there'll be new types of tasks that the model can do that it couldn't do previously. And that will happen. And I think will be kind of some of the most exciting parts of these these new models. - Do you think it'll be one universal agent or assistant in the workplace, for instance? Like, it resides on my desktop and it does all these different tasks across all these different dimensions? Or will it be more like specific and embedded or even like industry specific? - Yeah, I think it's gonna be both. I mean, I think there's gonna be both. I think the technology is going to, like I said, we're optimizing the technology and make thinking through how to make it deployable. And so it's gonna just show up in more places as a building block. So I mean, you already kind of see this in Gmail or Docs or like, there's areas where now Nano in your phone, which is areas where developers are gonna start using this capability. So like in the Recorder app, like in the Pixel 8, it can automatically summarize your recording. You don't even think about is that it's like, previously a human would've done it. If you rewind, not that many decades ago, that would've been something a human does. And now just in your pocket for free, you have something that like just using the tensor chip in your phone will summarize a two-hour-long recording. And we don't even think of that as an agent, even though that would've been something that maybe a human would've done before. So I think there are a lot of areas like that where you're building out a marketing campaign or you're creating a new deck and we're starting to generate images, and you're like, oh, actually, I would've gone to a team to develop those images and now Slides just did it for me. So I think there are gonna be ways where the technology just kind of incrementally appears and makes our products better. But I think there are also gonna be use cases where you do have a first class agent that you're using to co-develop an idea where it's more of an explicit part of the user interface where you're like going but you're asking a question or it's proactively prompting you. And like travel, for example, might be a good example where you're working with a Gemini app to travel plan something. And it's just kind of an emergent capability that Gemini will do for you. - Sure. I have one more question. So people wanna think about audience questions. What about search? You just rolled out Gemini in the Google app on your iPhone, you can toggle right between search and Gemini, which to me seems like something pretty new for Google, which is so closely associated with search. Do you think people will be using models like Gemini much like they use search today? - Yeah, I mean, and this is part of why there are kind of two different product experiences is that they do do different things. So if you look at, so for example, people using Gemini for software development, we had a lot of people with coding use cases on search, but people weren't going to... And they were kind of learning about coding and whatnot, but we didn't see search as a place where people were going to write code. And that's definitely something we see in Gemini or writing emails. So I think there are a lot of use cases that are new and obviously, like the product form factor is different and it's a kind of conversational user interface. But I think there's a lot. On the flip side, like we have a very long tradition of putting new LLM technology into search. And I use the BERT example where now we don't even think about better documents, understanding better query understanding and SGE, or Search Generative Experience, is just the evolution of that. It's like, we'll put Gemini in search. - Search Generative Experience for the people. - Yeah, exactly, which is putting Gemini in search. And there are a lot of places where we'll do that now and because the technology is expensive to deploy it kind of lies with BERT was originally, we could only do it in some places, but again, as we deploy it more widely, as we get scale benefits from having deployed it over so many services, the unit costs come down. I think there are a lot of places where it'll just, kind of as in the previous example, the product will just get better in a lot of these ways and we'll just think of it as search, but like search has evolved quite a bit over the last 10 to 15 years, but it's just kind of evolved incrementally in place. And so I think we'll continue to evolve and innovate in that kind of product form factor while we push on the Gemini app, which is a pretty different product form factor. - And they might sit side by side like they do in the app today. - Yeah, exactly. - Interesting. Cool. I wanna take it to the audience. Yes, this gentleman, - Larry Fitzpatrick, OneMain Financial, I'm really interested in the aspect of reasoning in large language models. Today getting reasoning out of large language models is a bit of a contortion with various prompting strategies, chain of thought, tree of thought, et cetera. And then Meta CICERO program married a separate strategy engine with a language model to create a game that played Diplomacy at a pretty high level, right? So I wanna ask about your forecast. Do you think language models and generative AI will learn reasoning or do you think that's a separate endeavor that needs to be married together with these language models? - Yeah, that's a great question. I mean, to some extent, some of your examples like chain of thought and tree of thought, I would say your example is where the model is learning reasoning. But to your point, like I think a lot of the most advanced systems will be the coupling of a model plus a separate system. And so AlphaFold and AlphaCode being good examples of that where we take a model as a building block to solve more domain-specific problems. So I do think we'll see progress on reasoning just because that is something that we would expect to get better both through the... Not just in terms of like the prompting strategies and kind of being better able to pull that performance out of the model, but also as a native capability, that is something that gets better as the models scale up. But I think a lot of the kind of biggest advances are also gonna be domain specific in that way that we apply a model to like playing a game, for example, or a coding competition. Like in those environments, you can get really powerful things when you combine a model plus a discrete system. But yeah, I do think we're gonna see the model... Actually, if you use Gemini Ultra, like on the Gemini app, the reasoning is pretty wild. Like we're already quite good at a number, it beats me at a number of problems. So it's an area we've made a lot of progress in the last two years. We collectively, the kind of researchers in this area. I think we might have time for one more question. Sure. Here. - Hi, I'm Ted Suji, coming from Tokyo, Japan, working for Nippon Houston. Very, very simple question. Assuming that GPT-4 has a benchmark, what's the difference between GPT-4 and Gemini? - Yeah, so like I said, Gemini is available in three sizes. So Ultra is the largest, most capable model, which is you can think of as being a GPT-4 class model. So in 30 of 32 of the benchmarks, we're ahead, but... - [Audience Member] But that's Gemini Pro. - Sure. (audience member speaking indistinctly) Yeah, so Gemini Pro is a smaller, more efficient, a more model. And so depending on the product application you would use, and again, then the deployment factor, you would use Nano versus Pro versus Ultra. So if you wanted the most capable highest-end model for like complex SaaS, that would be Ultra, which would be likely what you're using GPT-4 for. (audience member speaking indistinctly) - [Audience Member] In which point you are superior to GPT-4. - Yeah, if you go to the Gemini website, we kind of break down a set of academic benchmarks and you can see, you can kind of dig into the benchmarks and look at natural language understanding and reasoning and the various kind of scores in those benchmarks. There're also external benchmarks where people put the models through their paces in like a chatbot competition context, but it's effectively better performance on the kind of core capabilities that you see in those benchmarks. - Is there any feedback you're getting from customers who specifically are testing it versus GPT-4? - Oh yeah, the fun part about making Gemini Ultra accessible through the app is I now get all these side-by-sides of GPT-4 and Ultra. And we see a lot about the areas where people find is Gemini better. I also see ones where GPT-4 is better. And so we look at those and learn a lot from them. - Are there any themes emerging? - Yeah, I'm trying to think of some of the top-level themes. I mean, I think reasoning is an area where Gemini Ultra is consistently better. I mean, I think there's some areas where, because the model's only been out and we've only been RHFing on human feedback for less time where it's worse. So it's probably not at a steady state yet.

Info

Channel: WSJ News

Views: 19,933

Rating: undefined out of 5

Keywords: google, google news, google deepmind, google ai, deepmind ai, ai push, google gemini, artificial intelligence, ai, google gemini ai, google gemini demo, gemini ai, gemini, google gemini vs chatgpt, gemini google, future of ai, gemini vs gpt, eli collins, wsj, wsj inteview, vp of product management, google vp of product management, cio network, wsj news, new ai system, ai news, google gemini explained, large language models, machine learning, ai tools, tech news, techy

Id: Kvz9NRexCPE

Channel Id: undefined

Length: 33min 18sec (1998 seconds)

Published: Tue Feb 13 2024