AIDAN 00:00
It was quite extraordinary, quite extraordinarily convenient. That
simply by scraping more data off the web, not necessarily clean data, like messy data
it’s just web data, you're just taking in everything and there's tons of junk out there. But
taking in a very noisy, messy, massive data set, and just making the model bigger, throwing some
more chips at it. And what came out the other side was something that understood language, in a way
I personally thought we were we were decades from. CRAIG 00:55
We're talking this week to Aidan Gomez, who helped develop the transformer algorithm,
which lies at the heart of generative AI and powers large language models, such as GPT-4.
Aidan now leads a startup Cohere, a platform that offers users access to pre-built LLMs, as
well as allowing users to create their own LLM. But first, I want to give a shout out to our
sponsor, and encourage anyone with a business to take advantage of a deal from Oracle, which
is offering a full NetSuite implementation with no down payment and no interest for six months.
NetSuite is a cloud-based business management software for enterprise resource planning,
financial management, customer relationship management, and E commerce. To take advantage
of the offer, go to netsuite.com/eyeonai. Now let's get back to Aidan, AIDAN 02:20
I’m Aidan. I am the CEO and co-founder of Cohere. I started the company with Nick [Frosst] and Ivan
[Zhang] about three and a half, four years ago. Before that, I was kind of the perpetual intern at
Google Brain during my undergrad and then later, my PhD. I started down in the
Bay Area and Mountain View. AIDAN
I was part of the team that created the transformer. And it was incredibly exciting.
You know, it took the world by storm, I think, certainly, to my surprise, and I think everyone on
the team was quite taken aback by its popularity. But before Google, I was an undergrad or also
during Google, I was an undergrad at U of T [University of Toronto]. I
grew up in rural Ontario, Canada in a maple forest. And so I'm the
world's most Canadian man. Yeah, that's me. CRAIG 03:25
And so you were at U of T studying with Geoff Hinton, I guess, he was
probably kind of retired from teaching by then. AIDAN 03:38
He was definitely not teaching, but he was still at the university. This
is before the Vector Institute was created. And so yeah, he was, you know, like I,
I didn't really get into deep learning until after second year. And then when I started
looking into it, I became obsessed, and I was just reading papers, night and day, I would fall asleep
with a research paper sitting on my bedside, I would, in between sets at the gym, you know,
have a stack of papers that I was reading through. And I kept seeing this name.
And his affiliation was U of T, which was where I was. And so I reached
out to Geoff, this is before Google. And I just, you know, I'd been reading his papers
at that point when I was studying, you know, reused and MLPs and just the most simple piece
of the AI deep learning stack. And I was like, you know, why do you have these functions that are
just flat and then up? I think that they should be periodic. And so I emailed him with an idea
being like, Hey, why did you make this decision? I think they should be periodic. There
should be some regularity and it should be bounded so that you know, it doesn't
go to infinity if we get a large input. And to my surprise, he responded, and
he actually explained the decision. And so that was pretty amazing. That was
my first interaction with Geoff. And then when I came back from Google, in Mountain View to
Toronto, Geoff said, Hey, come work with me in, in the Toronto Brain office. And that
was where I met my co founder, Nick. CRAIG 05:31 just on, on so so you, you worked on
the transformer algorithm with a team in Mountain View at Google. Google Brain?
Was it? Yeah, Google Brain. Yeah. So can you explain that periodic versus stable, or
what? Which algorithm were you talking about? AIDAN 05:54
Yeah, I mean, it's not very important, because I was
wrong. It doesn't, it doesn't really matter. And I think it's
more just to Geoff's credit, the fact that he responded to a second year
undergrad with, you know, a wacky idea, earnestly. And this guy was literally the
top of his field yet toook time for me. And so I think that that particular piece that,
I mean, he made me - it's interesting. So for instance, in deep learning and neural networks,
we have these neurons, these neurons fire, there's some function that determines their
firing, there's some generally some threshold at which they don't fire, they stay dormant. And then
above that, they fire. And so when they're firing, they basically, they fire linearly proportional
to the input intensity that they're getting. So if the input intensity is high, the output intensity
is high when they're firing, but that leads to potentially unstable behaviors, if you have, for
whatever reason, some sort of blow up or some sort of like burst of signal coming in, then you'll
get a huge burst out. And it'll that'll propagate and make things more and more noisy. And that
leads to instability. It makes things complicated in training. And so my proposal was, instead of
just firing linearly proportional to your inputs, instead have some sort of predictable, regular
periodic pattern, like a sine wave or something. So that you always know your output is bounded
between some values. But that has not taken off. And we've since solved the training, instability
in the blow ups and that type of thing. So it was just my first email to Geoff, I think
six months into my study of deep learning. CRAIG 08:02
Wow. That's impressive. And from a Maple forest. AIDAN 08:11
Yeah, I love that. But I go back, CRAIG 08:13
Then at Google Brain, what was the project that you were working on? What
was the initial idea that led the transformers? AIDAN 08:30
So I was on the infrastructure side, like the original idea, I joined Google for I was working
with Lukasz Kaiser, and what we wanted to do, I think, Lukasz operates half a decade to
a decade ahead of his time, constantly. And so the project that I joined for was actually
this paper called one model to learn them all. And the idea was, we're going to take every
single dataset that machine learning researchers have compiled, and we're going to put it into one
model. And that means it needs to be multimodal, because we have datasets for images, we have
datasets for audio, video, you know, text, everything. And so what we wanted to do was throw
all the modalities in, as well as out. So you can consume video and let's say, describe the video
or you can consume audio and transcribe it. But you can also take in some text and then produce
audio, you can also just describe the video that you want and video comes out the other side. So
it's just like fully multimodal on both input and output side and we just train on everything
like truly everything we've come across. AIDAN
This now sounds kind of familiar, right? Because this is sort of the project roadmap that we're
on right now with these large language models that we're throwing everything we have, the entire
internet, and now we're starting to add in every modality that we can. So that was what I joined
for that was a different project altogether. To support that project, we built this, we built this
piece of software, this piece of infrastructure. Because that model was going to be huge.
And the data pipelines were going to have to be extraordinarily complex. And so we
needed something to suit that. And so, what we did was created this program
called tensor tensor. It could distribute across arbitrary numbers of GPUs like 1000s
and 1000s and 1000s. And it was very focused on auto regressive modeling, which is the
type of modeling that the transformer is. AIDAN
And so at that time, I was sitting next to Noam [Shazeer], who
was fiddling with autoregressive models and in particular, attention based models.
He was really interested in attention. And then we heard about a team over in
translate which was being led by Jakob which was also interested in attention based
autoregressive models. And so Lukasz convinced Noam and and Jakob to come over build it on our
stack, build it on tensor tensor. And they did. And so over the next I think, 10 weeks, it
was just a sprint to build this model. And the intensity just ramped up and ramped up because
the results we were getting were extraordinary. AIDAN
So I think this was like, it wasn't the first, but it was one of the very
early, extremely successful scaling projects, like hyper scalable architectures,
massive data, massive model sizes and massive GPU clusters just lead
to extremely high performance. CRAIG 11:55 And, and the first of all the tensor tensor.
That's a framework or an orchestration layer. AIDAN 12:04
Yeah, yeah. So it, it's, it was built on top of TensorFlow at
the time. But it was basically just a library to support large distributed model training. And
it had all the latest kind of tricks and hacks with learning rate schedules and initialization
techniques, and it had all this stuff built in. And so it let us experiment really
rapidly. I think, if I'm being honest, tensor tensor was a mess. It was crazy. It was
just like all over the place that supported everything we were just throwing, every new paper
that was coming out into it. It’s a little bit chaotic. And there exists far, far better
systems nowadays. But back then it did the job. It did the job, we were able to move
insanely fast. And so I'm quite proud of it. CRAIG 13:13
And you were - attention was already something that was being talked about. A couple
of questions in that process. What was your role? I mean, I'm a journalist. I imagine you guys
sitting next to each other furiously coding, I mean, were you coding? Or is it more that you're
in a room with a whiteboard trying to figure out the architecture or is it something else? AIDAN 13:51
There's a lot of like, whiteboarding and diagrams and just conceptual structuring these building blocks and putting them
together and the thinking about the architecture itself, there was a lot of that. And that was
mainly done by Noam, Ashish, Nicky and Jakob. I think, for me, like I wasn't sleeping. I was
working, like 14-hour days coding, building up the infrastructure, making it more robust, running
experiments. And so it was very much hands on coding and no one was sleeping. Everyone was just
hacking, experimenting, running little tweaks, little ablations to see if I add this in what
changes if I if I remove it, if I tweak it? Every single one of us was just messing
with everything and trying to figure out what was the optimal configuration. And so
that's how we got to that finished product. CRAIG 14:57
Yeah, and and certainly the result now is leading to auto code generation. Were you using
any tools to speed up the writing of the code? AIDAN 15:14
At that time? Nothing existed. Truly nothing, nothing existed. It
was all. You wrote it yourself. Yeah. Yeah, that came that came later. And that was
powered by transformers. Yeah, they kind of. CRAIG 15:35
I've read the paper and, and certainly talked to a lot of people about transformers and, and
their progeny. But can you explain as in as simple terms as as you can muster what the transformer
algorithm is and what it does? And I'm just curious, too, if if, if you were to send me the
transformer algorithm, sort of the basic algorithm is it a million lines of code? Is it 20 lines
of code? I'm just curious what it looks like. AIDAN 16:21
Yeah, nowadays, it's probably closer to 20 lines of
code. Extremely, extremely simple. I think a big part of the beauty of the model,
the architecture was the fact that it was just so simple. Like it, it is among the simplest
architectures that were going around at the time, it was just some, like the most basic
layer, the layer that has existed for like, I don't know how many years now, maybe over half
a century. Like the the basic layer is called, like an MLP. That's just what it's called
MLP. And really, the transformer is like, it's a simplification, but it's just some MLPs
stacked on top of each other, plus an attention CRAIG 17:20 NLP? You're saying like natural
language process? M No. Okay. AIDAN 17:25
Yeah, yeah. So this is just the name doesn't
matter. multi-layer perceptron, okay. CRAIG 17:33
Multi- layer perceptron sounds like a neural deep net. But AIDAN 17:38
totally, yeah, that's the fundamental unit. And before, before, transformers, there
were these very complicated LSTM architectures with gates and all of these like confusing
bits and bobs that just made it made it work. With the transformer, all of that was torn away,
and the layer became MLPs plus one attention. That was it. And so that was that was super. I don't know that there was a very, it was
beautiful, that you could just carve away so much stuff and just leave something so simple
that performed so well, that was so scalable. So the architecture is not this hyper
complex beast. It's actually just a very simple scalable compute saturating, you know, CRAIG 18:38
well explain what it does. So you have the multi-layer perceptron
as as the base How do you create attention? AIDAN 18:53 How do you create attention? Yeah. So attention is
like this idea that you want to relate parts of a sequence, to other parts, fundamental property,
that there are relations, if you have a sequence of things a thing in a list in an order, there are
going to be relationships between those things. Obviously, that appears on language,
very, very strongly, you have adjectives, which are tied to nouns, and, you know,
tons and tons of structures like this. And so since we were developing this explicitly
for language, we wanted the model to be able to represent those relationships quite easily.
That's what attention does. Attention says, For this word, in this sentence, I'm going
to learn which other words or which other word in the sequence it's related to.
And so for the sentence, the brown dog, you're going to want to learn that brown
refers to dog and maybe The refers to dog. So you're gonna want to model those relationships
and attention enables you to do just that. And it's not that simple. It's not just like the
model is learning adjective noun relationships, it's learning far more complex stuff that we
probably don't even have a language to describe. But we just do it intuitively in our heads. So
that like that attention layer is the fundamental unit of learning relationships in sequences.
And it turns out to be extraordinarily powerful. CRAIG 20:37
And how then does that scale because I've spoken to Ilya
[Sutskever] on the podcast, and he talks about seeing the paper, like the next day implementing
it in, in what they were doing that that led to the GPT models. How does that scale them into
the large language models that we see today. AIDAN 21:12 In their earliest form, it was like a very naive
scaling, it was just take it, take the model, and make it bigger. And the way that you do
that, as you add more neurons to the network, you add more layers. So it becomes, you know,
a much taller model much more deeply stacked. And you just take a much larger dataset than the
one that we were considering and a much, much larger model than the one we were considering.
And a much larger pool of compute. You plug those all together. And what comes out the other
side, I think it shocked virtually everyone. It was quite extraordinary,
quite extraordinarily convenient. That simply by scraping more data off the web,
not necessarily clean data, or like messy data is just web data, you're just taking in everything.
And there's tons of junk out there. But taking in a very noisy, messy, massive data set, and just
making the model bigger, throwing some more chips at it. And what came out the other side was
something that understood language, in a way I personally thought we were we were decades from. Yeah, it was it was quite a extraordinarily
convenient and exciting reality. CRAIG 22:42
So in that led to Bert, is that right? AIDAN 22:49
That that in particular, like Bert predated, or maybe I have them in the wrong order. There's
some order there's, there's GPT one, which was the first of these scale up large language model
papers. I think Bert predated GPT one, I think. But Bert is a different thing. Bert
is kind of like a different beast. Instead of learning to generate language it learns
to represent and that's a subtle distinction. Now, like, we're all paying attention to
the Generate side, because it's so it's visceral, right? It's like, you can talk
to these things they can write back to you. It feels there's a very visceral human
reaction to something that can speak to you. AIDAN
But there's another side to this whole thing, which is representing language in a
numerical form. And that's extremely important. It's hard to overstate how significant that is.
And that was like the first killer application of transformers. It was integrated into Google
Search and Google themselves describe it as the most significant advance in
search quality in I think it was two decades 20 years like basically
Google's entire lifespan. So that was, that was amazing. We got we got something we
got a model, we got a program that was capable of representing language to be used downstream for
applications like search and classification, etc. Extremely, extremely faithfully, like in a
very, very high utility way, in a way that just boosted performance in a way we really
didn't expect across pretty much any tasks you throw at it. And anytime you want it
to use language for some downstream thing, putting a Bert model there and taking the
representations from that and running with those representations, you beat state of
the art, you outperformed everyone else. So maybe, maybe Bert was like
the first seed of this idea. We can take a transformer, we can set it against
a very simple task on a very diverse set of data. And what comes out is something that seems
to get language, but it seems to just get it. If I'm right, that predated
GPT-1, I'm not sure that's true. CRAIG 25:36
You'll forgive me, I want to get to Cohere. But I, I'm a layman. My audience is somewhere in between
me and you. I mean, they're, they're fairly sophisticated. But so you've got 20 lines of
code. You feed it some data, let's say a sentence. How is it and it's it's relating within the
neurons of the or the perceptrons of the multi layer perceptron? It's relating one piece of data,
one word, to another word, how is it doing that? Does it is it? Is it by feeding huge volumes of
data that it begins to see patterns? Or within that 20 lines of code, Something incredible
is happening? Is it possible to explain that? AIDAN 26:46
I think it's not, it's maybe one line of code that leads
to that behavior. The other 19 are support. I would say the one line is is the
objective. It's like what you're asking the model to do with the data. You're feeding
through this, like hypercomplex pool of data. And what does it mean to feed it through? Well,
what you're actually doing is you're saying, in the generative case, this is like
the GPT style case, you're asking it to given all the words up to a point in a sentence,
predict the next one. And that sounds simple. It sounds like stuff we've had for a while,
which is like autocomplete tab autocomplete or no, it's like that that objective is horrendously
complex. Because if I give you on the internet, there's examples of translation, right? Like these
forums online where people teach each other how to speak different languages, and someone asks,
Hey, how do I say, the brown dog in Spanish? And then stop, and then the person responds, oh,
you say it by? I don't know how to speak Spanish, but whatever it is, right? And so
if you ask your model to model this, the only way for it to accurately model
this, it has to know how to speak Spanish, because it's seeing the English part saying hey,
how do I translate the brown dog into Spanish stop. And now I need to produce
the Spanish translation. And so you can see like, just organically,
by learning to generate sequences in order, you're forced to learn extremely complex
behaviors like translation, like classification, like writing code, you know, at the top of a
piece of code, you'll have a function signature, you'll have a comment a docstring, saying, this
function does XY and Z, it takes these inputs of this structure and outputs the following.
And then if you're going to model that code, you have to learn to program because you're just
given a function signature, and then a doc string that humans wrote for other humans to read. And
so I think one of the most beautiful things that falls out of this is using this very, very simple
structure, which is just hears a ton of data, learn to generate it, learn to predict the next,
the next token, you're you think you're asking the model to do something quite simple and
minimal. The reality is, you're asking it to do an extraordinarily complex task set of tasks.
You're asking it to understand our culture, our language, the interactions between us
your app, you're asking it to understand that data at the deepest level and so what you
get out the other side is a model that, you know, roughly does understand and does have the capacity
to do all that stuff does understand our culture. I think That's another one of these like, beautiful Simplicity's. Such a simple object.
Such a simple object Pick, pick the next word. And what falls out of that what you're actually
asking you to do. It's so extraordinary. CRAIG 30:15
And when you're - so there's what five? Have you working side by side? How
many people were working on the project? I think weren't there five or six names on the paper? I
think there were eight or eight? Yeah. But in any case, you're it? Was there a moment? Or did you
know, going in just from whiteboarding that, wow, this this could work? Or was there a moment
when you were, you know, running tests that you began to see these extraordinary results
and knew you were on to something amazing? AIDAN 31:01
Yeah, there are definitely moments where like, someone would come
running over from their desk and be like, Yo, come, look. And they had just run the eval. And it
was like, it was state of the art beat everything that came before. And then we would all be like,
next, okay, let's, let's keep pushing. And the funny thing is, it came together so quickly, it
was really like over the span of three months. This wasn't like a year long effort or anything
like that. It was just like super fast iteration pace. I don't know if there was a moment, I really
don't think anyone fully grasped the significance. And that's mostly because the
significance wasn't there at the time. The significance came from the fact that people
adopted it, they could have adopted something else, they could have leaned into something
entirely different. They chose a transformer for whatever sort of mimetic effects led to that. But
they chose a transformer, they started investing, the community started investing tons
of time in building infrastructure and support all the way down to the hardware level,
for this particular architecture. And they enabled us to us being the entire, like aI community,
to consolidate, consolidate on one architecture. And so I've said this before, and I, I
feel quite confident almost everyone on the paper would agree. The transformer could
have, it could have been another model. Frankly, it could have been another model, the
transformer was just this bliss, they had the best support, and then the community reinforced
that. And the community made some sort of decision to consolidate on this architecture and really
invest in it, and they made it a success. It could quite easily have been another architecture that
similarly scaled up, well saturated compute. Well, CRAIG 33:20
you think there are other architectures out there that could
that just haven't been discovered or explored? That could lead to such dramatic results? AIDAN 33:34
Absolutely, like, unequivocally, I think, definitely. They exist, they're
out there. And with enough work and effort, maybe we could flip to another architecture,
but we've already done half a decade of infrastructure development and software support
and you know, writing highly optimized kernels for the the hardware for transformers. And so
there's a there's like this resistance to move, and it would take a lot of community will
willpower to move away from the transformer. And the only thing that would motivate that is
like some new substantial breakthrough at the architecture level. Yeah, so I don't see that
happening. But I also don't make the claim that like, the transformer architecture is something
like divine. Yeah, clearly, you need pieces. CRAIG 34:34
I mean, right. But presumably, these large language models themselves could
at some point suggest other architectures. AIDAN 34:48
Yeah, people have wanted to use models in that sort of like feedback loop. Yeah, yeah. I
think that's definitely we're already starting The chip architectures being decided by by models.
No one's heard, right. Yeah. And so the chips train the model and the model change, you know, decides the next generation of the chip.
And there's this feedback loop that CRAIG 35:20
who's doing that? AIDAN 35:23
Google mostly there V four or five TPU. chips were model
placed designed. Yeah. So I think that's, that's exciting. That happens on a super slow
timescale, because it just takes so long to actually fabricate chips, push them out, verify
them. So that happens, too slow a timescale. The stuff that you're describing, like the
architecture search projects, I would say those have actually surprisingly been quite low
yield. And that's probably because humans have spent so much time on neural net architectures,
they've explored that space so thoroughly, and done a pretty like pretty compelling job
at it. And so when we threw models at it, like, the gains were marginal always. Or, or they
like rediscovered stuff that we had discovered previously, and kind of missed. And they just
brought it to light, they surfaced it again. So people have kind of tried that. But it
seems like in architecture space, it's actually it seems to have been saturated. Or perhaps
the methods used, this was also a Google. Perhaps the methods used weren't the right
ones. It's hard to say. But there was an effort to try to get models to produce
new model architectures and have this self improving feedback loop. And I would,
I would say that it largely fell flat. CRAIG 37:05 So you, you went then from Google? Web,
tell me about how you started Cohere? AIDAN 37:15
Yeah, so I spent the better part of three years bouncing around. So I was in Mountain View for
the transformer. And then I went to Toronto, and Geoff said, Hey, come come and
hang out at Google and in Toronto. And then I graduated from undergrad, I went to
Oxford for my PhD, Jakob from the transformer paper, he had actually decided to leave Mountain
View and go back home to Berlin. And he was like, Yo, I'm going to set up a brain office
in Berlin. And so I was like, Hey, that's pretty close, like a 40 minute flight
from London. Let's work together. And so then I was on a plane every two weeks to Berlin
to see Jakob and work there. And eventually eventually, I just realized, like there
was a revolution kind of promised. Back when I was in Mountain View, just after
we had released the transformer paper publicly Noam immediately started working on language
modeling, and scaling the models up and he was like, actually deeply involved in the GPT
one paper, he was helping OpenAI with it. And then I went back to Toronto, and I got
an email from Lukasz. And he's like, Hey, have a look at this. And in that email, there was a Wikipedia article.
And the title was the transformer and then I saw I was like, Oh, hey, this
Wikipedia article on this I kept reading down and then with a Japanese punk band
and consistently these members and this member had left and I was just like, What the
fuck like Lukasz What is this? He was like But transformer wrote this I just put in the
transformer as the title everything else. And that was just like, You're kidding. Like,
it was like surreal. It was just like, you know, you went to bed one night and models could barely
spell and then you woke up the next morning and they were writing as fluently as a heat like
such a plausible story about a Japanese punk band called The transformer and I I think that was like
the moment that I was like, Okay, this unlocks in product space this unlocks something categorically
different like it just something extraordinary. AIDAN And I thought it was gonna happen.
And I waited and I waited and I you know, I was In my PhD, and I was putting out new
research and proving fundamental methods. And after three years there, nothing
had changed, the world was the same. And Nick and Ivan, my co founders,
like, I think we all felt the same disappointment. Nothing had changed. We saw
something magical three years ago, and nothing had changed. No one's talking about it. And
so eventually, that disappointment turned into resolve to do it ourselves. And so we
decided, okay, let's leave. And let's go build cohere to bring this to the world. This
is before GPT-3, just after GPT-2, in in 2019. And back then the mission was really just a, this
is the most amazing technology that humans have ever created. Let's model the web, let's build a
model of the entire Internet. And be, let's put it into the hands of every single developer on Earth.
And Let's inject it into every single product and just create a new generation of magical product
experiences. So that was really the seed. Yeah. CRAIG 41:29
And then, so Cohere is, at its core, a large language
model, or a suite of models. For different vertical tasks are what describe
what what it is, and how people use it. AIDAN 41:53
So at its core, yeah, it's like a, we're an intelligence factory, building these big models,
making them as usable, as usable as useful as possible. There are like a suite of models, we
have both sides of that coin that I was describing before, where there's the generative and then
representation, so both styles representation, and GPT styles, the generative side. So we
have both of those, and we build them in house. AIDAN
The way that we bring them to the world is that we partner with enterprises,
and we solve really, what what are some of the today's largest blockers for adoption,
which are privacy, privacy blockers, data compliance blockers, if you're really
gonna put these large language models into useful applications at the forefront of your
product, they're going to be touching data that's the most sensitive, like user
data, right, like people's private data. And so that very, very high security bar. So
for us, one of the benefits of being independent, our competitors mostly are sort of bound to
one cloud provider. There's exclusivity there. For us being independent means we can play with
everyone. And with the enterprises that use us, they don't get vendor lock in. So
they're not trapped into one cloud provider. They can bounce between,
and we can deploy wherever they go. AIDAN
So for cohere, one of our core efforts right now is making it so that these models can
be deployed on any cloud provider, in situations where the data is the most sensitive, because
that enables the most interesting and impactful applications. Otherwise, you you kind of get
what I've been seeing a lot of recently, which is superficial deployments of these models,
not real, not product changing, not like fundamental shifts in infrastructure, but
more like, here's my product, and I'm just tacking it on to the side. Here's like a delivery
experience. I think that makes a lot of sense, given the fact that this year everyone just
kind of like woke up. And so it's gonna take a while to actually replace this with the
the thing that we want. So it makes sense. AIDAN
But really, the piece that's
blocking this is the fact that there's not a lot of trust in some of our
competitors due to the fact that in the past, they've trained on their user data and they
disintermediated people. And so for us, we want to regain that trust and be the trusted
partner for enterprise to actually bring large language models into like a truly transformative
way. So I think there's like right now. There's a product transformation that's
kind of similar laying under the water, because the whole world just woke up, every
single company now is trying to figure out what does this mean? What does this technology
mean for my product? My experience? What am I users? My the consumers? What are they going
to expect from me? How do I not get left in the dust by my competitors who are going to reinvent
their product on the back of this technology? So they're starting to do the work. AIDAN
in 18 months, product space is gonna look completely different, because right
now, everything is shifting behind the scenes. And so for cohere, we really want to power
that transformation, and be a trusted partner to the largest enterprises and
the best developers on Earth. CRAIG 45:49 And and enterprises span the gamut of industrial
verticals, or are you focused on one industry, AIDAN 46:02 it's totally, totally horizontal. So it impacts
everything. Like, I think you're going to be doing your banking with a conversational agent,
you're going to be doing your shopping with a conversational agent, I think it's really hard to
think of a particular vertical or industry that's, that doesn't need to be changed by this, because
consumer expectations are going to be, there's going to be this interest, when I show up to this
new product, there's going to be this interface that I expect, which is language. So with these
interface level changes, and in the same way that if you're a product or a service, you have
to have a mobile app, because everyone's on their phones. And that's, you know, how they
want to interact with products and services, in that in that same way that that mobile
transition led to everyone having to support this interface that the consumer expected, everyone is
going to have to support conversation and dialogue with an intelligent agent, as an interface onto
your products and services. So there's like this resurfacing of product space that
is literally happening right now. CRAIG 47:11
Is there an example without naming names that you can give
that you think is gonna blow everybody away? AIDAN 47:21
I mean, it's no secret. It's no secret that we're starting to see
some very compelling assistant like offerings. There were the promises with Siri and
Google Assistant and Alexa that came 10 years ago, or whatever it was. And those
fell flat, and I think the technology truly just was not there to support it. There is now
the possibility of a truly general assistant. Like we actually have the technological bedrock to
support that. it's emerged recently, it's a fairly recent development that that has been unlocked
as a thing you could possibly build. Yeah. CRAIG 48:19
You know, I talked to Ilya about RLH F reinforcement learning with human
feedback, his way of kind of guiding the model toward more grounded responses. But I've talked to
other people who say that's still speculative and and takes a lot of time and they're, they're using
vector databases and loading vector databases with authoritative data. And then the language model
in effect is just the the mouthpiece it's it's not it's it's not calling up the answers
from it's accumulated knowledge it's referring to this vector database How
do you guys deal with hallucinations AIDAN 49:25
Yeah, I there's like there's someone Sarah hooker echo
here. She said this before and I really I really like it. You have to distinguish between
the hallucinations that you want which are like creativity, and the hallucinations that you don't
want it like it's great when it hallucinates a story or a new joke or you know, you want that
and so you don't want to like beat that ability, that capability out of the model. At the same
time you need ways to control it. So for instance, if you're doing knowledge gathering
or research, you definitely don't want anything made up. There's like almost
zero tolerance for hallucination. AIDAN
And so you kind of want a gradient, or a parameter that you want
to set, which might be the creativity parameter. And I think that's becoming increasingly
possible. Another another really good way to get models to be more truthful, is to actually
force them to cite their work. So there's Patrick Lewis, he was the first author of meta
on creating rag. It's called retrieval augmented generation. And so that's this idea that you have
a model. And you have an external knowledge base, or maybe multiple external knowledge bases, maybe
one's Google ones, your private emails, ones, your blah, blah, blah. And what the model can do is it
can go out, and it can query these sources. So it can say, hey, the user just asked me about this,
I think I should query Google. And then it gets back from Google some documents or gets back from
your email, whatever emails you're looking for. And then now that it can read those, it can
generate a response, and it can cite back to them it can say, hey, you asked me this.
I think this is the answer, because of this sentence inside of this document or this webpage.
By forcing the model to learn to cite it sources, you get two things. One is the fact that you
can actually check its you can verify it right? You can check that it's telling the truth, you
click into that link, you can read the thing, and you can say it lied. Or you can say, oh,
no, it's right. Yeah, you know, that checks out. So one is you get it two sided sources. The
other thing is that you force the behavior, you reinforce the behavior into the model of not
making claims without grounds to those claims. And so it starts to learn the scenarios where,
you know, when I'm writing stories, I don't really need to cite sources, I just need to write and the
user is happy and content and you know, I get a good reward. And in the scenario of ham doing
research on a topic, can you tell me about X, it starts to learn, okay, shit, in this case,
I need to have a very rigorous bibliography, I need to be able to tie that back. And if I
mess up, if the user clicks through and sees a, an error or a hallucination, I'm wanting
to get a super strong negative feedback. And so it learns to differentiate between
these scenarios. So I really do believe retrieval augmentation is going to be one of
the key pieces of along with human feedback, it's gonna be one of the key pieces of making
these models more reliable, more grounded. CRAIG 53:07
That's fascinating. I'm coming up to an hour. Can I ask a few more
questions? Yeah. Yeah. I've got to ask this, you know, this has set off sort of the public
release of chat GPT. has set off this debate about how dangerous these models can be. To everyone's
surprise, Geoff has gone public saying some really dire things. Which, you know, I don't know, like
you do, but I've known him for a while and it's It surprises me. I've never heard him
speak that darkly about something. Do you have a view on that? That's one question.
And then the other is this debate about sentience or self awareness? I mean, you've
you had your fingers in the, in the brain of of these things. Do you? Do you think that
sentience or self awareness could really emerge? Or do you think that you know, these
are bits of code and it's all an illusion? AIDAN 54:40
There's a lot to say that we need we need another hour or two together
to properly represent my beliefs around that question, I think the first part for Geoff Geoff
is like Geoff went through the same thing. I think many of us in the field went through where our
timelines got pulled forward massively. And so it you know, we thought we'd have models that could
write compelling English and a few decades and then suddenly it shows up a year later. And so
that throws you into this state of shock and uncertainty. And you're quite caught off guard
and he's spoken about this I think publicly. The scent sentiment of surprise, progress and
rate of change. I remember having conversations with him myself, where both of us were kind of
like these people who talked about AGI you know, what, nonsense, haha, this was back when models
could barely spell. But then you kind of get surprised and shocked and your your uncertainty
blows up. And sometimes that can have the effect that okay, anything's possible. Oh, my God, like
I was so far off on that. But now I am shooting up my uncertainty across all anything could be
possible super intelligent god. Okay, maybe that's even. So I think like, a lot of folks.
We're all reckoning with that and recalibrating. And, you know, adjusting our, our own timelines
and understandings of progress and pace. AIDAN
Geoff is extraordinarily thoughtful. And he's been thinking about this,
since at least the beginning of cohere. So at least the last three and a half years, he's
been thinking about this very, very deeply. So I think people should take him very seriously.
I think there will be a lot of sensationalism. And a lot of extrapolation from what he's saying.
But if you actually listen to what, what he's actually saying, it's quite a measured, he's
like, I'm highly uncertain about what can happen. And that means we should take this stuff
seriously. Because we just don't have certain bounds. We don't have certainty around
the future. And so we should be taking all the different possibilities, quite seriously.
Not saying that they're likely to happen. But just saying that we they can't be ruled out
yet. And so let's take them very seriously. I think there's a lot of journalistic texts
and headlines and clickbait and nonsense. But if you actually listen to Geoff, I think
his take is quite measured and reasonable. CRAIG 57:47
And, and actually, I'd love to have you back on to talk at
length about these things, but on the idea of sentience or the illusion of sentience. I mean,
you've, you know, more than almost anybody having built these models, and both what they're capable
of and, and what's behind their expressions. Do you think that? I mean, it's
a philosophical question about what what sentience or consciousness is, whether
it's, you know, whether our consciousness is just an emergent property from the neural
activities of our brain and, and, and it's largely illusion. I mean,
just what would you say to all of that? AIDAN 58:54
Yeah, I would say, I don't place like a divinity on on humanity. I think that consciousness is
in the brain. And it is like a physical process. And it's maybe like, maybe consciousness
is what computing feels like, like, what processing feels like? And if that's the case,
it's really hard to argue that that same phenomena couldn't be present in silicon.
I think you'd really have to, I think there has to be a leap right to say that
the circuits in our brain because they're human, or because they're biological have some sort
of fundamental distinction. I think you really have to take a leap of faith there. And so if
I'm just saying being pragmatic and reductive. Again, we need two hours to discuss
this, I think more completely, but I think it'd be really, just as a scientist, I
think it'd be really hard for me to say, There's no way these machines could become sentient. I
just I can't construct an argument that and that. CRAIG 1:00:27 Yeah. Yeah. Well, let's leave it there. But
Can Can I get a promise that you'll come back? You know, in a few months, and we can
go deep on that subject. Yeah, I'd love AIDAN 1:00:43
to. Yeah. CRAIG 1:00:46 Okay. Yeah. Aidan. This has been
really fascinating. I'm delighted. And I'm sure you heard at the MIT Tech Review
Conference. Somebody asked Geoff, he was on virtually from the UK, but somebody asked him
whether he would divest himself cohere. And he said, No, no, he's, he's gonna stay invested.
So yeah. Yeah, that's a that's a funny question. Yeah. Yeah. Okay, great. Well, I really
appreciate your time and, and we'll talk again. CRAIG That's it for this episode. I want to thank Aiden
for his time. I also want to remind you to check out NetSuite Oracle's business management
software for enterprise resource planning, financial management, customer relationship
management and E commerce, among other things. Go to netsuite.com/eyeonai to
take advantage of this offer. CRAIG And remember, the singularity may not be near. But
AI is about to change your world. So pay attention