- I think the last time we
talked was almost a year ago, and it was right before
Google launched Bard, and now we're here and Sundar
says it's the Gemini era. You just announced a bunch of
new products around Gemini, including a subscription version. So tell these people what
they should know about Gemini. What is this Gemini era that we're in now? - First of all, I can't
believe it's been a year. It feels like it's been longer. So Gemini is the most capable model that we've built at
Google and Google DeepMind in collaboration with
other parts of Google, like Google Research. And it's a multimodal model. And as you mentioned, it's
available through an app and through Enterprise and developer endpoints in Google Cloud. And then we're integrating
it into our products and services as well. So search generative experience
and workspace and so on. - So, I mean, I've noticed
you like really leaning into the multimodal like part of this. Like what makes it so special, these multimodal capabilities? Like, what can you do now that you couldn't really do previously? - I mean, the world is multimodal, so I think that's... The most limited sense
you can do a lot with kind of text in and text out. But even just adding code,
which a lot of people think as a modality was huge. Coding is one of the top use cases. Image understanding and image generation has been very popular. We've added that to the app. But part of it's also just the... If you think about the kind of concepts, when we learn concepts, we don't always learn
them in a single modality. And so if you think about what
you can learn from a video or in a book or code,
it's, a lot of times, intelligence comes from
mixing those modalities. So it's not just being
able to generate something in a modality, but when
you train the model, having it learn across
modalities is quite powerful. - So like, that sounds all
like very cool and abstract, but like what's an example of like how a business would use that? - Yeah, well, I mean,
there's a lot of a... So most businesses have a lot of assets across various modalities. So imagine like you're
building a marketing campaign and you're doing sentiment analysis, that's an area where
you might want to reason about both your concrete images or videos or media in conjunction with text. And so there's some use cases where it's almost kinda
natively multimodal. Or imagine you are a sports broadcaster and you've got a large video corpus and you say like, "Hey, show
me the most interesting part of this video, or show me
when the quarterback... I think there was a game when the quarterback throws a touchdown in the second inning." - [Miles] That was yesterday, right? - Yeah. And which games were that? And then it can just say, "Oh,
these are these three games." So I think yeah, exactly. Too soon for those of us in San Francisco. (audience laughing) But there are a lot of those things where like, I think years
from now, if you think of just like querying video
with natural language, that's something that's still
actually really hard today. Like queries of video
have gotten a lot better. There are a lot of people
who use use YouTube as a search engine, for example. But there's so much
more we can do in terms of like natural language
interfaces to video that I think once we have
it, we're gonna wonder how we lived without it. - Right. Like, can you
give us even an example? Like what are some early customers saying? Like, I know you've been
testing this most powerful one with some business customers. Like what are some sort of frequent use cases that you're seeing? - Yeah, I mean, a lot of the
early uses now are trying to test like the frontiers
of the capability. So people who had use cases and didn't work, like we're doing, for example, code generation and kind of want the
next level of performance or are pushing on advanced reasoning. There are of various
dimensions in the model that perform better as you scale it, and one of them is its
reasoning capabilities. And one of the other interesting areas where we see people apply the model is you can use these models
directly, prompt them, get some output, but one of
the most promising use cases for these models is actually using them as building blocks for other systems. So if you think of
something like AlphaCode, AlphaCode is built with Gemini,
but it's a whole system. - And briefly, AlphaCode is just a model just for code generation? - Exactly, AlphaCode is a system we built for coding competitions. And so you can think of it as kind of like an AGI programmer. And so you take a
complex problem statement and then generate code to solve it, specifically that kind of
format of coding competition and AlphaCode was built using Gemini and they had kind of major
performance improvements. So I think there's gonna be a lot of systems rather than just
thinking about like, hey, what can you do with the model, kind of prompt in, prompt out, which is obviously
interesting and compelling, a lot of the interesting systems are gonna just be systems, like they presume models like
Gemini as a building block. - And then like there's
also a smaller version, a Nano version. - Yeah. - What goes into the decision to make like a really small
version of this model? So what is it useful for? - There's roughly three high-level sizes. So there's Nano, which is for on device. So think your phone,
your laptop, like things where you're gonna run
the model on your device. Pro, which is the kinda standard model, which is a good balance for like performance and capability. So when you're using Bard
or the developer API, the default model's quite good, it's Pro. And then Ultra is the
highest performant model for when you need something that pushes those capabilities. For Nano, like I said, it's
really motivated by on-device. So we've been doing AI on
phones for quite a while. We see a lot of also opportunities on laptop and desktop computing. And so a lot of platforms we believe are gonna have those models deployed as part of the platform. So when you get your phone,
you're gonna have a model. And so I don't know if you
saw the Circle to Search work that we launched with Samsung. That was powered by Nano. We've done a lot of work
with Pixel powered by Nano, and so it's just kind
of like, just similar to the previous point, it
just becomes a building block that when you have a platform
that you start to use. - Yeah, I mean, are
you getting a sense yet what use cases you think
will be like enduring? What use cases people will be willing to pay for years out from now? And what may not be. What are you seeing
that is just like not up to the task right now? - Yeah, for Gemini specifically? - Yeah, just for generative AI. - Yeah, I mean, it's still very early. I mean, I think everyone
is still kind of supply and capacity constrained,
so we're still at a point where like, I think we're very much like in the exploration phase in terms of figuring out
like what the frontier and what works and what doesn't. I mean, coding is an example where you have an early adopter, software developers, kind
of easy use interface, a chatbot and a free product. And so it wasn't a surprise
that coding kind of popped out as one of the main use cases because it's a direct productivity boost. The monthly subscription
price is quite reasonable, especially for people in that profession. - So coding will endure. - Yeah, coding for sure will endure, well, not just endure,
I think will evolve. Like right now, if you look at
how people are using coding, they're learning how to write code. Like if you think of like,
well, coding's a big category, so there are people who
don't know how to code, who are using Bard to learn
how to code in the first place. There are other people who
maybe don't know how to code, but don't know how to... A backend developer who doesn't know how to do front end development and who is kind of doing
something they wouldn't be able to do otherwise. And then there are people
who are, like I said, like what we're doing
internally using the model as a building block to build
an amazing coding system that like competes at
an international level on coding competitions is
kind of an extreme example. And so I think that pattern of, I think we're gonna see that
kinda range of uses for coding, we're gonna see that same pattern for other use cases as well. So I think coding will expand over time. So you can imagine like porting. These are all relatively
straightforward examples really. Like I have a coding description, like a competition problem,
and I generate a solution, which is again, quite impressive that we're exceeding
human performance error. But you can imagine saying like, hey, here's a 50-year-old COBOL code base, port it to Java. - That was actually something
we were talking about at the drinks. So that's a very relevant example. - And then explain this part to me. I think there's a bug here. What is it? If you think about what it would mean to have like an
intelligent agent encoding, again, we're just kind
of scratching this kind of these short prompts and
short generate code out. We're really just scratching the surface for what you could do for coding, - Right. I mean, are there any things that you thought were promising, but like, we're not quite there yet, like we need a few more? - I mean, there's a lot of
stuff that still doesn't work. If you think about Bard, we're integrated quite heavily in search where like, we annotate the
links with search output and it's really a change in paradigm, but we're still very much in this, you think of like the issue with like hallucinations, for example. We're not in a situation where you can just trust
the model output, right? Like where we're still grounding
it on public information in the web or providing you links or you can almost think of it as like a research assistant. So rather than you going out and saying, oh, hey, I
wanna explore something, I'm gonna go a bunch of links. I put a bunch of tabs
and then synthesize it. The model does that for you. But at the end of the day,
I'm still gonna wanna know what the source's information is for and I might go there. And so it's more of like a change in the UX paradigm for some of those. If you think of like travel
planning, for example, which is one of the ways I use it. It's more of a change in
like the user experience of travel planning but the end of the day, I'm still kind of
interacting with the web. It's now I've got like an
agent-centric user interface to it versus I versus I'm the agent with a bunch of tabs clicking around. You could imagine a world where you could do a
complete travel experience where it does all the research, you don't need the sources,
books flights for you, asks your feedback,
proactively prompts you when your flight changes
to rejigger things. I mean, you can imagine,
just use travel planning as an example where we go well
beyond where we are today. - Sure. I wanna go back to something you said, you said we're still supply constrained. How does that influence the way you develop things like Gemini? How do you think about costs and how do you think about chips? It seems like it's not going
away any time soon, right? Sort of constraint. - Yeah, well, I think we're
just so early in the process. I mean, we're just so early
in the process in terms of figuring out where the technology works and I think it'll take, I don't think we'll be supply
chain constrained forever. It's just that, I
remember we were thinking through the transformer in BERT, and I remember when I joined Google, I was going on a bike ride with one of the infrastructure engineers who optimized transformers
quite a bit to the point where now I think BERT's been mentioned by Sundar in our earnings
calls two or three times. And like, every time
you type a search query or many times, you're hitting
a BERT model on the backend. And so we took something
that like didn't exist, was a research innovation, and then we did a huge
amount of infrastructure work to deploy that to like
multi-billion user products. So like better document understanding, better query understanding. And so this process of like
inventing a piece of research, figuring out how it could be applicable in a product context with
all the various constraints that that has, and then kind of optimizing the infrastructure and scaling it, that's a
journey we've been on before. And so while the models are still quite expensive computationally to develop and run, like this is the kind of problem we know how to solve. - Right. How confident do you
feel that you can sort of achieve the AI roadmap
given the current constraints? - In terms of our ability to build and deploy the models? - Given the constraints
around chips especially. - Oh, around chips. Yeah, I mean, we definitely, we've done a ton of work to
make our model development and training a lot more efficient. So if you look at like what
we can train given the number of chips we have, that's an area where we've been doing a ton
of work in terms of efficiency. And part of that is just because
we wanna constantly train, like there's so much
progress in the research that you're constantly retraining because you're enabling
new model techniques. So like Gemini, for example, Gemini Ultra is a far bigger model than the previous larger model we built, but it's far cheaper to serve. - [Miles] By what magnitude? - Multiple, single digits, but a multiple. It's more performant by a multiple, but it's also cheaper
to serve by a multiple. And so it wasn't kind of one breakthrough that led us to do that. It's just a bunch of sustained innovation in the model architecture,
improvements over years. And so both in terms of like
sizing the compute to make sure that we can constantly build these models, we figure that out. We're still working on how
to deploy them at scale. But we've launched a lot of freely available
products now that have, if you're a Google One subscriber, you can go try Gemini Ultra yourself. So we found a way to kind of deploy them to a reasonable scale, and
we're gonna keep increasing from there, kind of just
like we did with BERT. - Right, I think Sundar told
me around October last year that if there's one thing
that keeps him up at night, it is the chip situation. Does it keep you up at night too? - I mean, it's certainly a constant, I mean, we were in such a regime
where humans were expensive and machines were cheap,
like in the kind of PC era. And so it's interesting to
go back to this world now where like, the machines
are quite expensive. I wouldn't say the humans are
cheap, but the machines are... I mean, just the sheer amount of compute that goes into these, it's pretty wild. So it's not something that keeps me... I mean, we're fortunate to, like, we've developed our own chips in house, the tensor processing unit
for years, for generations. We deploy it in our own data centers. So we have many fewer constraints in terms of we kind of own our pipeline of compute. - Are you leaning more into that over time versus GPUs and other sources? - I mean, we're leaning into both. Like, I think that we are leaning into our own capacity quite heavily because we have a pretty
insatiable appetite in terms of our products and then
making them accessible. But also NVIDIA's a great partner of our, we use their chips internally
for some of our workloads, and then they're a key
part of I think any kind of cloud platform. So it's an and for us. Yeah. - Sure. Okay. I wanna switch
subjects very quickly. So it's also almost a year since Google's two big AI teams, Brain, which you were a part
of, and DeepMind merged. And what do you think,
how is it going so far? - Yeah, well, Brain and DeepMind had really similar agendas in terms of the types of things
that we wanted to achieve. And so that really made
things a lot easier. There's alignment on the
kind of the research aims and the product goals and whatnot. And so that's really made
so much of the things that would be... There was no kind of big
culture integration point. We were always both part of Google, and so there was kind of DNA,
like people who would move between the two groups. And so actually, like the paper that kind of started this whole revolution was called "A neural
conversational model." And this was pre-transform, but it kind of articulated this whole idea of using a deep neural
network to build a chatbot. And that was co-authored by Oriol and Quoc who kind of senior researchers, both at DeepMind and Brain
who now sit on the same team. And so it's kind of like
in some of the areas, we've actually brought
back together people who collaborated a long time. So that part's been great. I mean, I would have to say, like, I work quite a bit with people in London. I definitely wake up
earlier than I used to. So there's that, but
it's been pretty smooth. - So if both teams were working on sort of similar research
agendas, it seems to me like there would be a
lot of overlap, right? And so how do you deal with that overlap? I mean, I guess there was maybe
duplication of work before, so now are the teams just twice as big working on the same issues or how do you allocate
resources between these? - Yeah, well, I mean, that
was part of the motivation is the compute. If you look at the amount of compute that we needed in the
legacy of kind of Brain and DeepMind teams to
achieve our objectives, it was eye watering and we realized that if we joined forces, we'd effectively be getting
twice as much compute. And in kind of the area that
we work, that's a huge deal. So that was definitely quite beneficial in terms of like being
able to join forces. The work is so dynamic, so there are certain areas
like generative media where we had probably like five or six different teams
building text-to-image models, but with different techniques. And when those models were
less computationally expensive to train, mostly because it was early and the quality wasn't as good, you could have five or six teams pursuing different approaches. Now we're at the point where
like some of the models, we have a pretty good understanding of how to build a high-quality model. And so we've merged the teams, we've merged the compute budgets, and then we've basically
kind of divided up the labor and some of the people have gone on to work on video instead of imagery, so that it's not that
so much that the teams are twice as big, it's
just that we've taken four or five different models,
now we're building one model, and some of the people are
now working on video or music or other areas. And so it hasn't been difficult, like there's so much to work on right now that it hasn't been... Compute is definitely much more of a challenge than where to put people. - Right. I think there was a moment
after ChatGPT came out where people were wondering,
well, where's Google? And like, how did Google
not put out something like this first, and to some, Google looked slow. What kind of an impact has
the merger do you think had on the speed of deploying
things like a Gemini? - Yeah, I mean, by design, the goal is to accelerate. So we have a shared agenda. We have a shared compute set of resources. If you just look at, like, we actually started Gemini before the merger as a joint... This is actually why it was called Gemini. - How did that start? Like how did the decision to start Gemini? - I mean, we basically looked at, we had this, we both wanted to build the world's best
multimodal foundation model. And so it was kind of a natural... DeepMind had a whole series
of really great innovations, like the Chinchilla and the
scaling laws and a set of models that they had built. On the Brain and Google Research side most recently had been building PaLM, but when we looked at a lot
of the architectural advances, investments were pretty complementary. So we were using the scaling, the Chinchilla scaling laws
quite heavily in building PaLM. And then a lot of the ideas for PaLM became the basis for Gemini. We both had multimodal LLMs
in development as well, Poly and Flamingo, Penguin, et cetera. And so it was pretty natural to put those teams together. And then we've executed, I mean, it's been less than a year and we've already probably
hundreds of launches that we've done in that time. I mean Gemini was pretty
much built in that timeframe and now shipped widely available. So it's been... There's some areas where
coordination will slow things down, but there are other areas where
like the model training now follows the sun because
we have both time zones. So there's some things
that have gotten sped up in ways that were unexpected. - Wait, I'm sorry, can you repeat that? The model training follows the? - Well, we pretty much have
people online all the time. So if you have these like
large model training runs, for most hours of the day, someone in the model team is working just because of the geographic
distribution of the team. - And previously what? You would be working in Mountain View and you would go to sleep
and what would happen? - Well, if you have a bug, someone would fix it in the morning, and there was there's just kind of the normal work rhythm
of a single time zone. Whereas now, we're pretty
much the sun never sets on Gemini development. - Okay.
(Miles laughing) I know you're working on
the next version of Gemini. I think Sundar already said that. How far along are you in that process? - I can't share the specifics, but we will share more soon, but we're very excited
about what's in the pipe. - What's your ambition for
the next version of Gemini? - Yeah, I mean, we laid
out the kind of vision of a multimodal foundational
model back in December when we launched Gemini v1. And there are a lot of ways where you can imagine better
performance on the kind of dimensions that we outlined, like multimodal
understanding and generation, all of the kind of various benchmarks. And so that's obviously
been one area of focus, but the other is AGI is a
big part of our mission. And so if you think about memory
and planning and reasoning and if you think about
all of the capabilities, if these models are gonna
become part of systems that are effectively intelligent agents, then you can think of
all of those capabilities that we start to need to develop as well, not not just kind of being better across the existing dimensions, but kinda getting the
models to do new things. And so that's an area where
we're pretty hard at work too. - Right, and people talk about agents, models that can take
action on behalf of you. Do we still need more technological
work until we get there, or is it another question that we still haven't answered yet? - Yeah, I mean, we see
kind of glimpses of that, like with Bard, we have
something called extensions, which is, well, now Gemini in using tools, and so we've gone from a world where it's just kind of a
prompt input, model output to now the model, Gemini can
use tools on your behalf, for example, can use Google
Flights or other services. And so that's an early step in tool use when tool use is like
one of the components that you need for building an agent. But if you think of
memory and personalization and all of the attributes,
you can start to see the kind of early versions
of those in Gemini. And so that's an area where I think, like, I don't think it's
gonna be a kind of a black or white moment when all of
a sudden an agent pops up and you're like, ah, you've woke up and now you have like a new coworker. I think it's gonna happen incrementally as the capabilities advance. So we're still gonna see some
pretty dramatic improvements, like there'll be new types of
tasks that the model can do that it couldn't do previously. And that will happen. And I think will be kind of some of the most exciting parts of these these new models. - Do you think it'll
be one universal agent or assistant in the
workplace, for instance? Like, it resides on my desktop and it does all these different tasks across all these different dimensions? Or will it be more like
specific and embedded or even like industry specific? - Yeah, I think it's gonna be both. I mean, I think there's gonna be both. I think the technology
is going to, like I said, we're optimizing the technology and make thinking through
how to make it deployable. And so it's gonna just
show up in more places as a building block. So I mean, you already kind
of see this in Gmail or Docs or like, there's areas where
now Nano in your phone, which is areas where developers are gonna start using this capability. So like in the Recorder
app, like in the Pixel 8, it can automatically
summarize your recording. You don't even think
about is that it's like, previously a human would've done it. If you rewind, not that many decades ago, that would've been something a human does. And now just in your pocket
for free, you have something that like just using the
tensor chip in your phone will summarize a two-hour-long recording. And we don't even think
of that as an agent, even though that would've been something that maybe a human would've done before. So I think there are a
lot of areas like that where you're building
out a marketing campaign or you're creating a new deck and we're starting to generate images, and you're like, oh, actually,
I would've gone to a team to develop those images and
now Slides just did it for me. So I think there are gonna
be ways where the technology just kind of incrementally appears and makes our products better. But I think there are
also gonna be use cases where you do have a first class agent that you're using to co-develop
an idea where it's more of an explicit part of the user interface where you're like going but
you're asking a question or it's proactively prompting you. And like travel, for example, might be a good example
where you're working with a Gemini app to
travel plan something. And it's just kind of
an emergent capability that Gemini will do for you. - Sure. I have one more question. So people wanna think
about audience questions. What about search? You just rolled out Gemini in
the Google app on your iPhone, you can toggle right
between search and Gemini, which to me seems like
something pretty new for Google, which is so closely
associated with search. Do you think people will be using models like Gemini much like
they use search today? - Yeah, I mean, and this is part of why there are kind of two different product experiences is that they do do different things. So if you look at, so for
example, people using Gemini for software development, we had a lot of people with
coding use cases on search, but people weren't going to... And they were kind of learning
about coding and whatnot, but we didn't see search as a place where people
were going to write code. And that's definitely
something we see in Gemini or writing emails. So I think there are a lot of use cases that are new and obviously,
like the product form factor is different and it's a kind of conversational user interface. But I think there's a lot. On the flip side, like we
have a very long tradition of putting new LLM technology into search. And I use the BERT example where now we don't even
think about better documents, understanding better query
understanding and SGE, or Search Generative Experience, is just the evolution of that. It's like, we'll put Gemini in search. - Search Generative
Experience for the people. - Yeah, exactly, which is
putting Gemini in search. And there are a lot of places
where we'll do that now and because the technology
is expensive to deploy it kind of lies with BERT was originally, we could only do it in some places, but again, as we deploy it more widely, as we get scale benefits
from having deployed it over so many services,
the unit costs come down. I think there are a lot of
places where it'll just, kind of as in the previous example, the product will just get
better in a lot of these ways and we'll just think of it as search, but like search has evolved
quite a bit over the last 10 to 15 years, but it's just kind of evolved incrementally in place. And so I think we'll continue to evolve and innovate in that kind
of product form factor while we push on the Gemini app, which is a pretty different
product form factor. - And they might sit side by side like they do in the app today. - Yeah, exactly.
- Interesting. Cool. I wanna take it to the audience. Yes, this gentleman, - Larry Fitzpatrick, OneMain Financial, I'm really interested in
the aspect of reasoning in large language models. Today getting reasoning out
of large language models is a bit of a contortion with
various prompting strategies, chain of thought, tree
of thought, et cetera. And then Meta CICERO program married a separate strategy
engine with a language model to create a game that played Diplomacy at a pretty high level, right? So I wanna ask about your forecast. Do you think language models and generative AI will learn reasoning or do you think that's a
separate endeavor that needs to be married together
with these language models? - Yeah, that's a great question. I mean, to some extent,
some of your examples like chain of thought and tree of thought, I would say your example
is where the model is learning reasoning. But to your point, like I think a lot of the most advanced
systems will be the coupling of a model plus a separate system. And so AlphaFold and AlphaCode
being good examples of that where we take a model as a building block to solve more domain-specific problems. So I do think we'll see
progress on reasoning just because that is
something that we would expect to get better both through the... Not just in terms of like
the prompting strategies and kind of being better able
to pull that performance out of the model, but also
as a native capability, that is something that gets
better as the models scale up. But I think a lot of the
kind of biggest advances are also gonna be domain specific in that way that we apply a model to like playing a game, for example, or a coding competition. Like in those environments, you can get really powerful
things when you combine a model plus a discrete system. But yeah, I do think we're
gonna see the model... Actually, if you use Gemini Ultra, like on the Gemini app, the
reasoning is pretty wild. Like we're already quite good at a number, it beats me at a number of problems. So it's an area we've
made a lot of progress in the last two years. We collectively, the kind
of researchers in this area. I think we might have time
for one more question. Sure. Here. - Hi, I'm Ted Suji,
coming from Tokyo, Japan, working for Nippon Houston. Very, very simple question. Assuming that GPT-4 has a benchmark, what's the difference
between GPT-4 and Gemini? - Yeah, so like I said, Gemini
is available in three sizes. So Ultra is the largest,
most capable model, which is you can think of as
being a GPT-4 class model. So in 30 of 32 of the benchmarks, we're ahead, but... - [Audience Member] But that's Gemini Pro. - Sure. (audience member speaking indistinctly) Yeah, so Gemini Pro is a smaller, more efficient, a more model. And so depending on the product
application you would use, and again, then the deployment factor, you would use Nano
versus Pro versus Ultra. So if you wanted the most
capable highest-end model for like complex SaaS,
that would be Ultra, which would be likely what
you're using GPT-4 for. (audience member speaking indistinctly) - [Audience Member] In which point you are superior to GPT-4. - Yeah, if you go to the Gemini website, we kind of break down a
set of academic benchmarks and you can see, you can kind
of dig into the benchmarks and look at natural language understanding and reasoning and the
various kind of scores in those benchmarks. There're also external
benchmarks where people put the models through their paces in like a chatbot competition context, but it's effectively better
performance on the kind of core capabilities that
you see in those benchmarks. - Is there any feedback
you're getting from customers who specifically are
testing it versus GPT-4? - Oh yeah, the fun part about
making Gemini Ultra accessible through the app is I now
get all these side-by-sides of GPT-4 and Ultra. And we see a lot about the areas where people find is Gemini better. I also see ones where GPT-4 is better. And so we look at those
and learn a lot from them. - Are there any themes emerging? - Yeah, I'm trying to think of
some of the top-level themes. I mean, I think reasoning is an area where Gemini Ultra is consistently better. I mean, I think there's some areas where, because the model's only been out and we've only been
RHFing on human feedback for less time where it's worse. So it's probably not
at a steady state yet.