[MUSIC PLAYING] DEEPTI SRIVASTAVA:
Let's get started. So I am very excited
to be here today to talk to you all
with Andrew Fikes. Who is a VP and Engineering
Fellow at Google. And I'll let him
introduce himself, and then I'll go and talk
to you about who I am. ANDREW FIKES: OK. Well first off, welcome. I'm excited to see
everyone here today. I was promised an intimate
talk in an intimate setting with just a few small
number of people. This is amazing. This is really exciting to me. I've worked in the distributed
database space for a long time, and seeing this many people that
are as excited about storage as I am-- which is a hard thing to do. I'm super geeky about it,
so I'm excited you're here. First off, just a
little bit about myself, I've been at Google 18 years. This is the only
job I've ever had. I came right out of school
and went into Google, so I've kind of grown up
in the Google culture. I've grown up sort of learning
things from the ground up. I was the tech lead
of the Bigtable team. We have our external
product, the Cloud Bigtable, which I worked on
for many, many years. Tech lead in the Spanner
team through most of its initial development,
and the last three years, I've served in a role we call
Area Tech Lead, which covers, in my case, both
storage and databases. And I cover basically
a tech lead role for everything from the disks
at the bottom to a little bit of the networking in
between all the way to the distributed
databases at the top. So I know quite a bit about
the distributed database space, and I know a little bit
about everything else. So hopefully, I'll be able
to answer a bunch questions that you'll have today and
give you a little perspective about engineering at Google. DEEPTI SRIVASTAVA:
Thank you, Andrew. My name is Deepti Srivastava. I am the product manager
for Cloud Spanner. My entire tenure at Google
has been in Spanner, so it's a few more
years than three. Before that, I was at Oracle
working in the RAC database kernel. And before that
I was researching distributed systems. And I actually never thought
that databases were cool, because that's 30-year-old
technology, right? Who cares? ANDREW FIKES: I had
the same problem. I never took a database
class in college, because I never wanted
to work on databases. DEEPTI SRIVASTAVA:
Yeah, and here we are. ANDREW FIKES: And here we are. DEEPTI SRIVASTAVA: And it's
been a very exciting journey. I worked with Andrew to
launch Spanner internally-- and the Chrisses here. There are some Chrisses here to. To launch Spanner
internally as a service, and Spanner has been
running internally as a service managing a bunch
of critical infrastructure for over seven years. And it was my Noogler
project to launch it, so it's very exciting
for me personally to talk about the history and
journey of Spanner with Andrew. Before we go further,
there is a mic here. Unfortunately, we
have only one mic. So as you have questions,
if you could maybe make your way there, and
if you want to shout it out from where you are, we'll try
and not change your question and repeat it and answer. So let's get started. OK, so first, I get to do
my, what is Cloud Spanner? So for those who don't
know, Cloud Spanner is one of our database offerings
on Google Cloud Platform. It is basically the
same infrastructure as internal Spanner, and it's
exposed via the Cloud API, so that's the difference. Sort of the access
path is different. But it was essentially
a differentiated service on Google Cloud,
because it combines the benefits of
relational semantics with non-relational or
NoSQL horizontal scale. So we call it our no
compromises database, in that you don't have to
compromise between schemas and SQL and strong
consistency and asset transactions and
the global scale that you get with Spanner. So today's talk is really
about how this awesome-- I love this product. I call myself a database
person now through my journey with Spanner-- so how did
this system come about, both the technical history
as well as some of the people history around it. And this is what we offer
essentially externally on the Google Cloud Platform. So Andrew, what
came before Spanner? ANDREW FIKES: So when I start
to tell the Spanner story, I think it's sort of
important to think about where Google was at that time-- so essentially where we were in
about 2007, 2008 as a company. Google historically
started as a search company and over time grew to be an
ads company, a mobile company, a geo company, I'm
probably forgetting mobile, a few different
things like that, and eventually a cloud company. But around that
2008 time, we really had kind of an interesting set
of infrastructure internally. We were a large
MySQL database shop on our ad side of the world. We ran about 90
database shards and had kind of a home-built query
system that worked across them. So we were dealing with
large scale distributed partitioned MySQL instances. We were also a very
Bigtable shop at that time. So we started
Bigtable, I want to say 2003, 2004, somewhere in there. And it was really focused
on the search landscape. We wanted to build a Bigtable
that had as its primary key that URL of a web page,
and then its columns would be the actual documents
and page rank and other sorts of things on the other column. So we had a MySQL shop
and we had Bigtable shop. We were also, however, starting
to become an apps company. So we all know and love
and use G Suite today, but there was a
time when Gmail was a kind of a smaller product. And so as Google was starting to
become a calendar company, a-- DEEPTI SRIVASTAVA: Drive. ANDREW FIKES: --Drive company
and these sorts of things, we were starting
to get these needs of more complex
applications being built on top of our infrastructure. And so really that was kind of
ecosystem we had at that time. DEEPTI SRIVASTAVA: Yeah. Do you want to touch
upon Megastore at all? ANDREW FIKES: Right. So Megastore was kind
of our first foray into something that
looked close to Spanner. And it was an
application library that was built on top of
Bigtable that actually did use Paxos, but provided
a much smaller of data partition, what we call
an [? entity ?] group. And it was rolled
as a client library. So you had to link
it into your client, and you had to deal with client
semantics on top of Bigtable. DEEPTI SRIVASTAVA: Yeah, so
there's a paper on Bigtable, if you care. There's a paper on
Megastore, if you care. And as these products
evolved like we know we can touch upon the journey. And if you thought that we
weren't going to get technical, like we have gotten technical. So what prompted the
need to build Spanner given we had awesome
Bigtable that engendered a bunch of other
NoSQL systems outside, and we had Megastore,
which was almost Spanner? ANDREW FIKES: Yeah, I think the
most obvious thing was scale, and scale with
these applications with things like Gmail,
where we were starting to think about, OK, Gmail
is in three data centers, for example. Well, what happens if Gmail
gets too big for a data center? How do we split it up? How do we manage the geographies
that are associated with users? And we had a large user base
on these applications that was developing in Europe
and in other countries, and we didn't want them to
have the same latency semantics as the people in the US. So we were really thinking
about this sharding concept, so we were really interested
in that geo concept. We were also starting
to really think about the importance
of consistency and what it actually means
to have a consistent system and how it impacts our
developers and our users and our productivity
and velocity. And so that was sort of
coming around in there. Interestingly enough,
right around this time, we decided that 90 shards
was enough MySQL shards, that we had reshard it,
and every resharding was a couple of years. And we kind of thought, if
we were going to do it again, maybe we should think
about whether or not we should do something different. So we really had the
growth of the apps and this interest in consistent,
replication, transactions, and those sorts of things,
because users actually do notice inconsistencies. And we also had this
growth on the ads, sort of more financial database
side, to just really be able to scale to the
enormous business that Google was developing. DEEPTI SRIVASTAVA: Yeah. Just to give you a sense-- so Andrew's being very modest. Like the ads database is how
Google used to make money, so it was really
important that we didn't spend two years plus
every time we had a reshard. And I don't know
how many people here remember when Gmail was like,
you could get five friends? Like if you had a
Gmail account, then you could get five friends
to get a Gmail account. So that scale of like going
from, whatever, some number of users to 7 billion accounts
or whatever it is today, that's truly like the
scale and the time that we're talking about here. So before we go into
that stuff deeper, did you get it right
the first time? ANDREW FIKES: Right. So one thing I
hear a lot is, how did you come up with
this great idea? Like you published these
papers, they're sort of seminal. Like what's the process? And I think what people
miss a lot externally is, we spent years
getting it wrong. So we actually started
thinking about Spanner and split off a chunk of people
from Bigtable to work on it. And I think we built-- and Chris can probably attest--
three or four different systems over 2 and 1/2 or three years
before we landed on the one that we ended up with. We had some horrible mistakes. One version of Spanner
actually supported recursive directories,
and we had RPC operations that could
do recursive descents on directories. And as you can kind of see,
that's really, really far away from the relational model. The good news is, is that
when the ads team came to us, we sort of got our
hands on a real customer that we were trying
to work with. And when people ask
me, how do you decide to build infrastructure,
I think it starts with a really good customer? And for us, that really good
customer was the ads database. DEEPTI SRIVASTAVA:
Yeah, customers first. So let's get into a
little bit of the details and talk about, how
is Spanner built, what are the building
blocks, if you could talk to us about that. We have an awesome picture
here that you can talk about, because I think that
picture's super cool. ANDREW FIKES: Wow. So one of the great things
about building infrastructure at Google is you start with a
really great set of resources. So this is a really
good example of what it's like to build
infrastructure in our cloud. These basically show
our network lines. One of the gifts that
we have is engineers is that our internal networks
within the data center are phenomenally good. I haven't really worried
about networking performance, networking loss for
a really long time. And as a storage
person, that means you can take a whole set of
things and just set it aside, and you can think about the
algorithms and things that lay on top. That same internal experience
we get across large spans, as well. We have three redundant
lines out of data centers. We have extremely
high availability. Being able to depend on that
level from your network really simplifies a lot of your
distributed system building. DEEPTI SRIVASTAVA: I
know it says it there, but anyway, the blue
dots are basically where GCPs or Google
Cloud has presence, externally as well as-- the number indicates how many
availability zones there are, and how they're all connected
via Google Fiber, essentially. So to Andrew's
point, when customers ask me like, well, what
will the latency be like for this globally
distributed massive thing, I'm like, not very much. They're like, how
do you go over RAN? And we're like, we don't have
to go over public internet for the most part. So that's super awesome. I wonder if you want to
touch upon Colossus before we go to the next thing. ANDREW FIKES: Yeah. So I think the other area
we've always really depended on is, we do have a very
large scale distributed file system that sits as the
basis of all of our distributed storage systems. We published a
paper on GFS, which was the very first
version of that, I think probably in 2002, 2003. The one we use now today is
actually called Colossus. And it basically occupies
many, many megawatts of power of footprint of space. And we really depend on that
for all of our write operations. And so that means the durability
of the data, for example, is something that we
can delegate down. It also means that we're
in an environment where we can disaggregate
compute from storage. It gives us the
ability to respond to hotspots on one server by
actually indirectly picking that piece of data up and
moving it to another server while leaving the
storage beneath. Fun little fact-- Colossus
is actually built recursively on top of Bigtable. So this is a place where we're
reusing our infrastructure in order to be able
to scale it out. DEEPTI SRIVASTAVA: The
amount of virtualization that's happening in our
infrastructure stack is pretty astounding. Because again, customers ask me,
what happens when a disk fails? And I'm like, I haven't
thought about a disk failure in I don't know how many years. ANDREW FIKES: Yeah. When I spoke earlier
to now getting a lot of visibility from
everything from the disk layer on up, Google actually builds
its own disk appliances. And we spend an
enormous amount of time thinking about the availability
of a disk appliance, the cost of a disk appliance,
how fast we can execute RPCs against these things. We continually improve our
efficiency and effectiveness of those disk appliances, and
we deploy them at massive scale. And being able to have that
incoming pipe of people working on, even just the
building of raw components that you're building
these systems on, it's an absolute luxury. DEEPTI SRIVASTAVA: Yeah, I think
we're a hardware company too as much as a data center business. So this is the other part
that we basically win on, which is TrueTime. So walk us through this. ANDREW FIKES: Yeah, so I can't
be more excited to show you this most boring rack. I think one of the most
interesting comments I usually get when people read the Spanner
paper is, they're like, wow, atomic clocks. And then you're like, yes,
you can get them on eBay. They're not as expensive
as you think they are. This is actually our
latest generation, so this is rolling out
into the fleet now. The TrueTime team is
a really small group. This basically shows a
fairly standard configuration that we have within
a data center. There are four time
servers in here. Two are connected out to the
GSP antennas that you see. This provides a
GPS-calibrated signal. Two of the other
remaining servers are actually connected to
two atomic oscillators. There are two different
brands of oscillators here, so we get different failure
behaviors, actually, and the same with the
GPS cards, as well. And this is connected
into our fabrics. And what this does is it
provides a really accurate time signal with bounds. And this is sort of the
principle upon which Spanner is built, this idea of not
just saying a point in time, but an interval in time. And what can we do with a
very highly reliable, accurate interval of time? And that's what
TrueTime provides. And I can't speak exactly
to where we are now, but it's clearly less than
a millisecond of drift. And there you go. DEEPTI SRIVASTAVA: Yeah. This is one of the
cornerstones that allows us to be the
differentiated technology in terms of external
consistency. And we'll get into
that, but I think it's important to
understand that it's not just about atomic
clocks and GPS, but about the hard
work that goes into building these systems and
the infrastructure around that. OK. So let's quickly touch upon,
given those building blocks, how does Spanner
do a serve request? What is the life of
a read and a write? I am very excited to
say that Andrew actually drew this by hand. He's good with crayons. ANDREW FIKES: Woohoo! [APPLAUSE] DEEPTI SRIVASTAVA:
And his writing is remarkably good, right? ANDREW FIKES: I went to
an engineering school. So yeah, my kids actually
helped me with this. It was great. We did it this weekend. It turns out that even
an eight-year-old can design a distributed system. Hers didn't look quite as good
as mine, but we will take it. So we do get a lot
of questions about, how does Spanner work,
what exactly happens when I issue a read or a write? So typically what I like
to walk people through is the life of a read
and the life of a write. At kind of a high level-- you can see it here-- I think it's easier to start
with the life of a write. We have a client. These are typically
in a VM, although they could be other places. They go to our load balancers. And from there, they go to a
front end, which sends them on to the very span server. A span server is what
actually stores--- what would be a good
description of this-- slices of your data. So we refer to them
as Paxos groups. These are basically
data containers that are spread across multiple
zones for availability. Commits have to go to two out
of three, or three out of five, or some number that
forms a quorum. As you can see it here, The span server in Zone A will
take that write that it gets, and it will forward it
on to Zone B and Zone c. It will wait until two
of those come back, or two total come back,
both its local one and one of the others. And then it'll return
back to the client, and that's the point at
which it becomes durable. You can see in reference
here, Colossus, which I mentioned earlier, this
is our big distributed file system. Span servers, we
run tons of them. Internally, Spanner is
heavily, heavily used, so we're talking about
thousands to tens of thousand of these
running in the data center. Your data resides on
a fraction of those. DEEPTI SRIVASTAVA: Before
you move on to the reads, really quickly just to
level set, because we live and breathe this. So Spanner is a synchronously
replicated database. What that means is that
we have full copies of the data, database data, in
at least three, if not more, availability zones. The copies of data
are in Colossus, which is the distributed
file system that Andrew talked about. And because we have this
disaggregated storage and compute, span servers
here are the compute units that act on the data. So the questions
around, well, what happens if a server
goes down, do you have to reload
all the data in memory in a different server,
are kind of orthogonal, because the data is always
in the storage layer. And all these compute
units or span servers are doing is being responsible
for serving the data. And so when Andrew
talks about, a client sends a write,
what they're doing is, they're saying
write key with value x. And that means that
there's an operation that needs to happen in any
one of the compute units. This is really
important, really. Any one of the compute
units can pick it up. We use Paxos to figure
out who the leader is. We don't have a
concept of master. And there is a
paper out on Paxos. Essentially, it's a
group consensus protocol that allows you to choose
at-most-once semantics, essentially, for a given point
in time or a given operation. Does that summarize it OK? But I just wanted to
make sure we understand what we're talking about. ANDREW FIKES: Yeah. I think a few other things
that are happening here is, during writes, we do use
TrueTime to pick a timestamp. Those timestamps
are used to create an ordering, which we talk about
as being externally consistent. That means that, if you do
two successive operations, they will get
increasing timestamps. So an operation that happens
after another operation by the concept of time will
actually happen that way. I think the other thing that's
interesting from this diagram is really to think
about, where does high availability come from? And high availability
really comes from the fact that we are replicating data. So for example, if Zone C
falls off the face of the map-- earthquake, California,
that sort of thing-- we can still continue
to issue rights to Zone a or Zone B. If we
have a crashing bug in one of the span servers,
for example, the other ones can pick it up. And so you get high
availability by taking these components
of blast radiuses and what happens when
they fail and making sure that you have other
things to capture them up. DEEPTI SRIVASTAVA: Yeah. Do you want to walk
through the read? ANDREW FIKES: Yeah. A read is also very simple. I drew it here showing that
the reads don't necessarily have to go to the
leaders in order to be strongly consistent. So a client might do a read. He might go to one of the other
replicas of the [INAUDIBLE] group. Now, when he gets
there, he basically picks a timestamp, which is what
we call now plus epsilon, which is typically a time at which
we know no writes that we could have seen were committed. That front end will
basically say, OK, pass it on to a span server. The span server will say,
OK, do I have all of the data up to that time? And if enough time has passed--
because we're talking order milliseconds, and
writes may be going on-- he will actually just
return the data then. If he doesn't have the data,
he'll reach out to the leader and say, dude, I hear you
have some data for me. And the span servers will
say, yes, here you go. Either here's the data
you need, or you actually have all the data, I
just haven't told you that you have all the data. And in that case, the read
will be performed locally. DEEPTI SRIVASTAVA: Yeah. So I think the point here that
is different from other systems is that we are, in
almost all cases, serving the data from
the local replica. There are other system
that ship buffers around when they're replicated. We're not shipping
buffers around. What we're saying
is, hey, what's the latest timestamp
that I can serve data at? Do I have it or not? And if the answer is yes, you
have it, then you go serve it, or you wait for the data to be
shipped as part of the Paxos protocol separately. So you're not actually shipping
megabytes and megabytes of buffers around. ANDREW FIKES: Yeah. And just to give you an
order on the timing here, you could imagine that zones
A, B, and C are actually all within a region. In that case, your
writes are on the order of five to 10 milliseconds,
because that's what it takes to create a quorum
across those data centers. For a read, it would
be much, much faster, because you're only reaching
out to one data center. So once we get through the
load balancer and everything like that, I think it's
in the three to five range from external cloud. DEEPTI SRIVASTAVA: Yeah. All right, so let's talk
about some of the aha moments where we-- well, you at least-- chose to
be a distributed systems guy-- sorry, to a database person from
a distributed systems person. ANDREW FIKES: Yeah, so I
think people say, well, what's great about Spanner,
what should I take away? I think there are two things
that I typically reference. One is that, as a
distributed systems engineer, you're growing up, and everybody
tells you you cannot trust time. The clock on your server
is absolutely something you can't trust right. And you develop sort of this
innate fear of that thing. It's like legendary. And you also probably experience
some really bad things when you decide to trust time anyways. Having the ability to
actually trust time, and we are now 10
years later, and I can tell you that I completely
trust our time system. It has made my life
so much easier. Being able to basically have
a source of global ordering that I don't actually have
to reach out and talk to, that I can reference locally,
is a huge, huge, huge just thinking shift. And it took us actually quite
a long time to get over that. When I said we did two or
three versions of this system, some of the earlier
ones actually worked more on logical clock
passing or figuring out how to weave these things through. And weaving timestamps
through your system actually has some really
bad effects, usually, on your client. You're like, OK, well, this
person can give me a timestamp. Now I can take this timestamp
and give it to someone else. That means that the application
actually has to see it. And so being able to trust
time and use that, in a way, actually has really good upper
level API effects as well. The second sort of aha moment
was that we spent a lot of time not implementing
transactions in Bigtable, for a variety of reasons. And the main reason is
that the availability of any one of our
Bigtable tablets was not sufficient enough
to support transactions. And so what Paxos and
groups bring us is they bring us highly
available participants. So the other aha
moment in here is you can do a lot with highly
available participants. You can trust them
to be coordinators. You can trust them
to hold locks. You can trust them to
do things like that. So it's really the combination
of both trusting time and having highly
available components upon which to build transactions
that may expand are unique. DEEPTI SRIVASTAVA: Yeah,
I remember get time of day was an evil thing and you
never put it in your code. But TrueTime along
with Paxos gives us that external consistency of-- and I love saying
this to customers, because they're like,
I don't believe you. But you can write from
anywhere in the world into a Spanner database, and
you can literally say, I wrote-- like I incremented my
bank account with $50-- and anywhere in
the world, somebody can read that timestamp and
actually get an answer for, it was plus 50, which
is an amazing thing to ANDREW FIKES: Have. Yeah, $50? DEEPTI SRIVASTAVA: Yes,
it's better than 0. OK, so let's move on
to a little bit more on the nostalgia piece,
which I find exciting. But we talked about
scaling systems, but you also have
to scale teams. ANDREW FIKES: Yeah, I
think every good project-- and I talked to
some people earlier with sort of small teams. Spanner, of course,
started with a small team. It started with four
people in an office. I think this picture is in 2012. It shows about
five or six people. We look heavily engaged in
a particular event probably, in terms of a server going down
or us debugging sort of things. It's kind of fun to look back. Chris is down here in the front. He looks much younger
in that photo. DEEPTI SRIVASTAVA: There
were less critical systems running on Spanner at the time. And I also want you to
talk about this one. It's super awesome. ANDREW FIKES: Yeah, so
this was my cube mate. This is Mike. He's a fan of Ascii art. These were our first 100 users. So every now and then
you see a $20 bill posted on the front of a
deli or something like that. This was our screen
as we watched internal users adopt Spanner. And when we hit 100 on
our lovely VT320 here, we had a little party. DEEPTI SRIVASTAVA: I was
actually there for this one. So I want to have people
look at this, and then this. ANDREW FIKES: Yeah, so a
couple years later, we're getting a little bigger. We added a few fish into the
mix, as any good team has. And a few of them I
think are still kicking. DEEPTI SRIVASTAVA: Andrew
and I commissioned this logo, because, of course,
we were a service, and we needed a product. And we needed a logo,
so we had this logo. And if you don't
understand what that is, it's obvious this is a
Spanner turning Google. ANDREW FIKES: It's a Spanner. DEEPTI SRIVASTAVA: And then-- ANDREW FIKES: Yeah,
so this is actually a couple of years
ago at this point. I think the key thing
to take away from this is that building a
distributed system that is as highly available
and actually has all of the features that
you all know and love is actually a relatively
complex problem. The other thing that I've
really grown to appreciate is that database problems
and systems problems are actually full
stack problems. And so the amount of complexity
that is in these systems now is substantially more. When we built GFS in
the early days, which was our first
distributed file system, we almost decided not to do it. Because we thought it was
too complex and nothing that complex would ever work. It turns out it's the simplest
thing we've ever built, and Spanner has many more
ages of complexity on it. I think the other thing
you can see in this picture is this enormous giant
wrench in the background. That actually is
used for trains. So if you ever need to
change a train wheel, we have a wrench for you. DEEPTI SRIVASTAVA: Yeah, we
actually have three of them, because we don't-- ANDREW FIKES: We have three? DEEPTI SRIVASTAVA: Yes,
we have three wrenches on three different sites. And I think the takeaway--
because we are highly available and replicated. So the thing for me
was-- because for me this has also been a journey. And building teams to have
that sort of rigor and mindset when we're doing infrastructure
that we have to have, like we make no
compromises, we don't cut corners, and the processes
for around that stuff is-- like it changes, right? The process that works with five
people, which is nudging people to say, hey, fix this bug,
versus 50 people versus more than 50 people is different. And so it's really been
an exciting journey. And Andrew is--
and I am as well-- very excited about the culture
and the team cohesion and all those things. And they're real things. You really have to worry about
culture and respect and all that stuff as you grow people. And of course, in case
that wasn't obvious, we actually have Spanner people
distributed in multiple areas, too. ANDREW FIKES: We have
people in New York, the Seattle-Kirkland area. A good portion of
our operations staff is actually in Sydney,
where I think they're just starting to wake up. DEEPTI SRIVASTAVA:
In Boston now. ANDREW FIKES: And Boston. And managing a
large team, finding good chunks of work for them
is definitely a challenge. DEEPTI SRIVASTAVA: Yeah. OK, so let's go into the more
fun parts, which is, let's opinionate. So let's touch on
a few things that I know customers
have talked to me, and I know that you're
also passionate about. So first of them is CAP theorem. So does Spanner break
the CAP theorem, Andrew? ANDREW FIKES: This
is a great question. For years, I hated
the CAP theorem. Then I started sharing an
office with Eric Brewer, and he turns out to be
an absolutely nice guy, so it's great. And I think, as I talk
it over with Eric-- DEEPTI SRIVASTAVA: Eric is
the author of the CAP theorem, by the way. Also wrote a paper
on how Spanner doesn't break the CAP theorem. ANDREW FIKES: As you talk
things through with Eric, his real point here is
the CAP theorem was really designed to help you think. And it's really also
designed for you to think about the extremes. So it's really about
100% situations. And in real life,
nothing is 100%. And so when you
really look-- earlier, when we talked about things
like the building blocks that we have, our network, file
systems, and various other sorts of things-- we're able to take things
like partitions, which might be really common in
most people's networks or most people's WANs, and
say, hey, can we think about these
things different? Can we actually look at
the data and understand how often partitions
actually impact things? When a partition
happens, for example, does it also impact
the end application? Would it actually
impact a quorum? And so as you start
to look at these both from a mathematical
perspective, the probabilities, but you also look on the
ground data, what you find is that partitions
are actually much, much less likely than the
other sources of error in your system. And this is one of
the arguments we've made about TrueTime, whether
it be more accurate than CPU failures, we can
sort of make it here. It turns out that most sources
of unavailability in our system have things to do
with like users. I don't know how
many map reduces have been run in the world that
have taken down a system they weren't supposed to. They have things to do with
RSRE teams, missing a comma, turning it into a period,
something like that. So it's those sources-- and
operator error is actually much, much lower
than user error-- those sorts of events happen
much more than partitions. And so once you
basically say, OK, let's assume partitions
are not going to be the source of
the unavailability, can you build a
system that actually has really good availability
and gives you consistency? And so that's sort
of where we land. DEEPTI SRIVASTAVA:
Yeah, so I think we trade off partition tolerance
for availability mostly. Cool. So the other
interesting thing here is, Spanner evolved from a
NoSQL to a Bigtable world into a fully SQL database. And so how has your thoughts
on NoSQL to 3SQL evolved? ANDREW FIKES: Yeah,
so many years ago, I was invited to give
a talk about NoSQL. I knew nothing about
NoSQL, so I went and read Wikipedia like a good person. And it said, a series
of distributed systems as created from the
lineage of Bigtable. And I said, oh, OK,
I know Bigtable. Maybe I do know
something about NoSQL. Typically when people
talk about NoSQL, they're talking about a
horizontally scalable key value system. Typically, they also talk
about some properties around eventual consistency. I was, for years, a huge
fan of eventual consistency. I built Bigtable's first
replication system. It was eventually consistent. I was totally
convinced that anybody who wouldn't want an eventually
consistent system was wrong. I'm here to tell you I was
wrong on leaps and bounds. DEEPTI SRIVASTAVA:
We got it on tape. ANDREW FIKES: Yes. What you sort of learn over time
with an eventually consistent system is you actually go
to start to work with users. And you see users try
to build applications on top of an eventually
consistent system. Yes, it's kind of fun. We get to use that
part of our brain that creates a complicated
algorithm that only works under these certain
situations, and maybe the user sees it this way, or maybe the
user doesn't see it that way. But at the end of day, it's
just not really any fun. And our goal is
to build products. Our goal is not to build
complicated algorithms that fix little edge cases. We want to get a product
in front of our customers. We want to see that product
quickly and get feedback on it. And consistent systems
give you that property. I sort of had the same
idea around transactions. The very first
versions of Bigtable, for example, have
a complicated split and merge protocol in them. We kind of knew
what we were doing. We hand-rolled our
own transactions. I think we did it six times,
because we got it wrong time after time after time. Having primitives
around consistency in transactions
as building blocks are incredibly productive. They let you sort of leverage
those things onto other things. I think the other
thing that you find in this kind of NoSQL
versus NewSQL debate is really this idea of SQL. SQL's great. I think it's a
perfectly fine language. Tuples are great. I think types are
incredibly useful. We've seen systems on
the key value space adopt types more and more. As we start to
think about moving compute closer to the
data, which is a trend, then things databases have
been doing closer to years, you really need a
language to express that. You need to understand types
so you can drive it down to the processor level. And so SQLs a great language. I think where SQL tends to
get a bad rap is it's often associated with vertically
scaled systems, things that only fit in a box. Or they have challenges
around multi-tenancy. I think Spanner does
a much better job on the horizontal
scaling aspects, so that gives us some power. We've also always run in a
multi-tenant environment. Google has tens and twenties
and hundreds of thousands of databases internally,
all competing for the same resources,
so multi-tenancy has been a core component
of what we do for years. And so those sorts of
properties of SQL systems, we can look at them
slightly differently. DEEPTI SRIVASTAVA: Totally. So we talk about consistency,
we talked about SQL, NoSQL. How do you think it
kind of comes together to power Google's
cloud offerings? ANDREW FIKES: Yeah. Spanner is a workhorse. We use it in many, many
different ways at Google. We use it both for
things of scale, so things that are
petabytes that look kind of like traditional
indexing and batch workloads. We use it for very critical
high availability loads. I was just at a talk
right before this where it was
performance benchmarking around VM boot-up
times, and was talking about the complexity of
programming the control plane. That control plane
is done in Spanner. So our ability to
spin things up quickly and have them highly available
is dependent on Spanner. We see it in sort of the
SaaS application space, where you might take something
like Gmail where it takes a whole bunch
of users, maps them into a single database, and
gives them power that way. And we also see it in very
traditional database workloads. We use it for some of our
capacity planning for example, our capacity delivery systems. And so we really
have a workhorse of a system that really
goes from everything from the very big to
the very small work throughout our system. DEEPTI SRIVASTAVA: We actually
have a lot of, a lot, a lot, a lot of small use
cases that are all using Spanner for its
manageability and scale insurance and all that stuff. ANDREW FIKES: It's also
probably a good time-- I know Deepti
mentioned it earlier-- which is tomorrow my colleague
Dennis from the GCS team, which is our Google
Cloud Storage, is going to present how GCS
uses Spanner for its metadata. So our really
large scale systems are actually built with
Spanner as its backbone. DEEPTI SRIVASTAVA: Yeah that's
going to be Spanner Internals Part 2 tomorrow morning. So come watch us. But yeah, it's cool to see both
the GCSs and the Gmails as well as the Drives and a bunch of
other small internal systems, like supply chain
management, use Spanner. All right, so these are
some of the headlines for when we launched Spanner. There are more controversial
ones that I didn't put here. But I know both you and I have
thoughts on why we launch it externally, so why don't you-- ANDREW FIKES: Yeah, I think
one of the great things about why I'm excited about
Cloud, its new workloads. I mentioned why I get
excited about infrastructure. It's because there's
new challenges, new workloads, new customers. I was very excited to
see examples of workloads that I saw internally
externally. I think Spanner's a great
building block for a lot of applications, being able
to get that out to you all , and have you work it into your
systems was a big win for me. DEEPTI SRIVASTAVA: Yeah, I think
it was a while before we could convince ourselves
that we wanted to be a public product,
primarily because we didn't think that externally people had
the same challenges that we had internally. And I remember doing a
whole initial program of going out there and
talking to customers to see whether this would
be a useful system for them. Because at the time-- this
was like three years ago, which is dog years
in Cloud world-- when we trying to
launch Spanner, we were cognizant
that people don't want just another cool tech
because Google built it. Especially on the
databases side, people want something that
solves their problems, and is this something
that is going to solve their problem or not? So we actually wanted to
be cognizant of user issues and see if this could solve our
user and customer pain points. And after serving
customers, we found that customers were coming
to this with a new internet age of everything
connected, everything always on, everything online. Customers wanted that
kind of scale insurance with horizontal scalability
and strong consistency and relational semantics. And so it was like a really good
time to be in the public eye. So thank you so much
for your time again. [MUSIC PLAYING]