Spanner Internals Part 1: What Makes Spanner Tick? (Cloud Next '19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] DEEPTI SRIVASTAVA: Let's get started. So I am very excited to be here today to talk to you all with Andrew Fikes. Who is a VP and Engineering Fellow at Google. And I'll let him introduce himself, and then I'll go and talk to you about who I am. ANDREW FIKES: OK. Well first off, welcome. I'm excited to see everyone here today. I was promised an intimate talk in an intimate setting with just a few small number of people. This is amazing. This is really exciting to me. I've worked in the distributed database space for a long time, and seeing this many people that are as excited about storage as I am-- which is a hard thing to do. I'm super geeky about it, so I'm excited you're here. First off, just a little bit about myself, I've been at Google 18 years. This is the only job I've ever had. I came right out of school and went into Google, so I've kind of grown up in the Google culture. I've grown up sort of learning things from the ground up. I was the tech lead of the Bigtable team. We have our external product, the Cloud Bigtable, which I worked on for many, many years. Tech lead in the Spanner team through most of its initial development, and the last three years, I've served in a role we call Area Tech Lead, which covers, in my case, both storage and databases. And I cover basically a tech lead role for everything from the disks at the bottom to a little bit of the networking in between all the way to the distributed databases at the top. So I know quite a bit about the distributed database space, and I know a little bit about everything else. So hopefully, I'll be able to answer a bunch questions that you'll have today and give you a little perspective about engineering at Google. DEEPTI SRIVASTAVA: Thank you, Andrew. My name is Deepti Srivastava. I am the product manager for Cloud Spanner. My entire tenure at Google has been in Spanner, so it's a few more years than three. Before that, I was at Oracle working in the RAC database kernel. And before that I was researching distributed systems. And I actually never thought that databases were cool, because that's 30-year-old technology, right? Who cares? ANDREW FIKES: I had the same problem. I never took a database class in college, because I never wanted to work on databases. DEEPTI SRIVASTAVA: Yeah, and here we are. ANDREW FIKES: And here we are. DEEPTI SRIVASTAVA: And it's been a very exciting journey. I worked with Andrew to launch Spanner internally-- and the Chrisses here. There are some Chrisses here to. To launch Spanner internally as a service, and Spanner has been running internally as a service managing a bunch of critical infrastructure for over seven years. And it was my Noogler project to launch it, so it's very exciting for me personally to talk about the history and journey of Spanner with Andrew. Before we go further, there is a mic here. Unfortunately, we have only one mic. So as you have questions, if you could maybe make your way there, and if you want to shout it out from where you are, we'll try and not change your question and repeat it and answer. So let's get started. OK, so first, I get to do my, what is Cloud Spanner? So for those who don't know, Cloud Spanner is one of our database offerings on Google Cloud Platform. It is basically the same infrastructure as internal Spanner, and it's exposed via the Cloud API, so that's the difference. Sort of the access path is different. But it was essentially a differentiated service on Google Cloud, because it combines the benefits of relational semantics with non-relational or NoSQL horizontal scale. So we call it our no compromises database, in that you don't have to compromise between schemas and SQL and strong consistency and asset transactions and the global scale that you get with Spanner. So today's talk is really about how this awesome-- I love this product. I call myself a database person now through my journey with Spanner-- so how did this system come about, both the technical history as well as some of the people history around it. And this is what we offer essentially externally on the Google Cloud Platform. So Andrew, what came before Spanner? ANDREW FIKES: So when I start to tell the Spanner story, I think it's sort of important to think about where Google was at that time-- so essentially where we were in about 2007, 2008 as a company. Google historically started as a search company and over time grew to be an ads company, a mobile company, a geo company, I'm probably forgetting mobile, a few different things like that, and eventually a cloud company. But around that 2008 time, we really had kind of an interesting set of infrastructure internally. We were a large MySQL database shop on our ad side of the world. We ran about 90 database shards and had kind of a home-built query system that worked across them. So we were dealing with large scale distributed partitioned MySQL instances. We were also a very Bigtable shop at that time. So we started Bigtable, I want to say 2003, 2004, somewhere in there. And it was really focused on the search landscape. We wanted to build a Bigtable that had as its primary key that URL of a web page, and then its columns would be the actual documents and page rank and other sorts of things on the other column. So we had a MySQL shop and we had Bigtable shop. We were also, however, starting to become an apps company. So we all know and love and use G Suite today, but there was a time when Gmail was a kind of a smaller product. And so as Google was starting to become a calendar company, a-- DEEPTI SRIVASTAVA: Drive. ANDREW FIKES: --Drive company and these sorts of things, we were starting to get these needs of more complex applications being built on top of our infrastructure. And so really that was kind of ecosystem we had at that time. DEEPTI SRIVASTAVA: Yeah. Do you want to touch upon Megastore at all? ANDREW FIKES: Right. So Megastore was kind of our first foray into something that looked close to Spanner. And it was an application library that was built on top of Bigtable that actually did use Paxos, but provided a much smaller of data partition, what we call an [? entity ?] group. And it was rolled as a client library. So you had to link it into your client, and you had to deal with client semantics on top of Bigtable. DEEPTI SRIVASTAVA: Yeah, so there's a paper on Bigtable, if you care. There's a paper on Megastore, if you care. And as these products evolved like we know we can touch upon the journey. And if you thought that we weren't going to get technical, like we have gotten technical. So what prompted the need to build Spanner given we had awesome Bigtable that engendered a bunch of other NoSQL systems outside, and we had Megastore, which was almost Spanner? ANDREW FIKES: Yeah, I think the most obvious thing was scale, and scale with these applications with things like Gmail, where we were starting to think about, OK, Gmail is in three data centers, for example. Well, what happens if Gmail gets too big for a data center? How do we split it up? How do we manage the geographies that are associated with users? And we had a large user base on these applications that was developing in Europe and in other countries, and we didn't want them to have the same latency semantics as the people in the US. So we were really thinking about this sharding concept, so we were really interested in that geo concept. We were also starting to really think about the importance of consistency and what it actually means to have a consistent system and how it impacts our developers and our users and our productivity and velocity. And so that was sort of coming around in there. Interestingly enough, right around this time, we decided that 90 shards was enough MySQL shards, that we had reshard it, and every resharding was a couple of years. And we kind of thought, if we were going to do it again, maybe we should think about whether or not we should do something different. So we really had the growth of the apps and this interest in consistent, replication, transactions, and those sorts of things, because users actually do notice inconsistencies. And we also had this growth on the ads, sort of more financial database side, to just really be able to scale to the enormous business that Google was developing. DEEPTI SRIVASTAVA: Yeah. Just to give you a sense-- so Andrew's being very modest. Like the ads database is how Google used to make money, so it was really important that we didn't spend two years plus every time we had a reshard. And I don't know how many people here remember when Gmail was like, you could get five friends? Like if you had a Gmail account, then you could get five friends to get a Gmail account. So that scale of like going from, whatever, some number of users to 7 billion accounts or whatever it is today, that's truly like the scale and the time that we're talking about here. So before we go into that stuff deeper, did you get it right the first time? ANDREW FIKES: Right. So one thing I hear a lot is, how did you come up with this great idea? Like you published these papers, they're sort of seminal. Like what's the process? And I think what people miss a lot externally is, we spent years getting it wrong. So we actually started thinking about Spanner and split off a chunk of people from Bigtable to work on it. And I think we built-- and Chris can probably attest-- three or four different systems over 2 and 1/2 or three years before we landed on the one that we ended up with. We had some horrible mistakes. One version of Spanner actually supported recursive directories, and we had RPC operations that could do recursive descents on directories. And as you can kind of see, that's really, really far away from the relational model. The good news is, is that when the ads team came to us, we sort of got our hands on a real customer that we were trying to work with. And when people ask me, how do you decide to build infrastructure, I think it starts with a really good customer? And for us, that really good customer was the ads database. DEEPTI SRIVASTAVA: Yeah, customers first. So let's get into a little bit of the details and talk about, how is Spanner built, what are the building blocks, if you could talk to us about that. We have an awesome picture here that you can talk about, because I think that picture's super cool. ANDREW FIKES: Wow. So one of the great things about building infrastructure at Google is you start with a really great set of resources. So this is a really good example of what it's like to build infrastructure in our cloud. These basically show our network lines. One of the gifts that we have is engineers is that our internal networks within the data center are phenomenally good. I haven't really worried about networking performance, networking loss for a really long time. And as a storage person, that means you can take a whole set of things and just set it aside, and you can think about the algorithms and things that lay on top. That same internal experience we get across large spans, as well. We have three redundant lines out of data centers. We have extremely high availability. Being able to depend on that level from your network really simplifies a lot of your distributed system building. DEEPTI SRIVASTAVA: I know it says it there, but anyway, the blue dots are basically where GCPs or Google Cloud has presence, externally as well as-- the number indicates how many availability zones there are, and how they're all connected via Google Fiber, essentially. So to Andrew's point, when customers ask me like, well, what will the latency be like for this globally distributed massive thing, I'm like, not very much. They're like, how do you go over RAN? And we're like, we don't have to go over public internet for the most part. So that's super awesome. I wonder if you want to touch upon Colossus before we go to the next thing. ANDREW FIKES: Yeah. So I think the other area we've always really depended on is, we do have a very large scale distributed file system that sits as the basis of all of our distributed storage systems. We published a paper on GFS, which was the very first version of that, I think probably in 2002, 2003. The one we use now today is actually called Colossus. And it basically occupies many, many megawatts of power of footprint of space. And we really depend on that for all of our write operations. And so that means the durability of the data, for example, is something that we can delegate down. It also means that we're in an environment where we can disaggregate compute from storage. It gives us the ability to respond to hotspots on one server by actually indirectly picking that piece of data up and moving it to another server while leaving the storage beneath. Fun little fact-- Colossus is actually built recursively on top of Bigtable. So this is a place where we're reusing our infrastructure in order to be able to scale it out. DEEPTI SRIVASTAVA: The amount of virtualization that's happening in our infrastructure stack is pretty astounding. Because again, customers ask me, what happens when a disk fails? And I'm like, I haven't thought about a disk failure in I don't know how many years. ANDREW FIKES: Yeah. When I spoke earlier to now getting a lot of visibility from everything from the disk layer on up, Google actually builds its own disk appliances. And we spend an enormous amount of time thinking about the availability of a disk appliance, the cost of a disk appliance, how fast we can execute RPCs against these things. We continually improve our efficiency and effectiveness of those disk appliances, and we deploy them at massive scale. And being able to have that incoming pipe of people working on, even just the building of raw components that you're building these systems on, it's an absolute luxury. DEEPTI SRIVASTAVA: Yeah, I think we're a hardware company too as much as a data center business. So this is the other part that we basically win on, which is TrueTime. So walk us through this. ANDREW FIKES: Yeah, so I can't be more excited to show you this most boring rack. I think one of the most interesting comments I usually get when people read the Spanner paper is, they're like, wow, atomic clocks. And then you're like, yes, you can get them on eBay. They're not as expensive as you think they are. This is actually our latest generation, so this is rolling out into the fleet now. The TrueTime team is a really small group. This basically shows a fairly standard configuration that we have within a data center. There are four time servers in here. Two are connected out to the GSP antennas that you see. This provides a GPS-calibrated signal. Two of the other remaining servers are actually connected to two atomic oscillators. There are two different brands of oscillators here, so we get different failure behaviors, actually, and the same with the GPS cards, as well. And this is connected into our fabrics. And what this does is it provides a really accurate time signal with bounds. And this is sort of the principle upon which Spanner is built, this idea of not just saying a point in time, but an interval in time. And what can we do with a very highly reliable, accurate interval of time? And that's what TrueTime provides. And I can't speak exactly to where we are now, but it's clearly less than a millisecond of drift. And there you go. DEEPTI SRIVASTAVA: Yeah. This is one of the cornerstones that allows us to be the differentiated technology in terms of external consistency. And we'll get into that, but I think it's important to understand that it's not just about atomic clocks and GPS, but about the hard work that goes into building these systems and the infrastructure around that. OK. So let's quickly touch upon, given those building blocks, how does Spanner do a serve request? What is the life of a read and a write? I am very excited to say that Andrew actually drew this by hand. He's good with crayons. ANDREW FIKES: Woohoo! [APPLAUSE] DEEPTI SRIVASTAVA: And his writing is remarkably good, right? ANDREW FIKES: I went to an engineering school. So yeah, my kids actually helped me with this. It was great. We did it this weekend. It turns out that even an eight-year-old can design a distributed system. Hers didn't look quite as good as mine, but we will take it. So we do get a lot of questions about, how does Spanner work, what exactly happens when I issue a read or a write? So typically what I like to walk people through is the life of a read and the life of a write. At kind of a high level-- you can see it here-- I think it's easier to start with the life of a write. We have a client. These are typically in a VM, although they could be other places. They go to our load balancers. And from there, they go to a front end, which sends them on to the very span server. A span server is what actually stores--- what would be a good description of this-- slices of your data. So we refer to them as Paxos groups. These are basically data containers that are spread across multiple zones for availability. Commits have to go to two out of three, or three out of five, or some number that forms a quorum. As you can see it here, The span server in Zone A will take that write that it gets, and it will forward it on to Zone B and Zone c. It will wait until two of those come back, or two total come back, both its local one and one of the others. And then it'll return back to the client, and that's the point at which it becomes durable. You can see in reference here, Colossus, which I mentioned earlier, this is our big distributed file system. Span servers, we run tons of them. Internally, Spanner is heavily, heavily used, so we're talking about thousands to tens of thousand of these running in the data center. Your data resides on a fraction of those. DEEPTI SRIVASTAVA: Before you move on to the reads, really quickly just to level set, because we live and breathe this. So Spanner is a synchronously replicated database. What that means is that we have full copies of the data, database data, in at least three, if not more, availability zones. The copies of data are in Colossus, which is the distributed file system that Andrew talked about. And because we have this disaggregated storage and compute, span servers here are the compute units that act on the data. So the questions around, well, what happens if a server goes down, do you have to reload all the data in memory in a different server, are kind of orthogonal, because the data is always in the storage layer. And all these compute units or span servers are doing is being responsible for serving the data. And so when Andrew talks about, a client sends a write, what they're doing is, they're saying write key with value x. And that means that there's an operation that needs to happen in any one of the compute units. This is really important, really. Any one of the compute units can pick it up. We use Paxos to figure out who the leader is. We don't have a concept of master. And there is a paper out on Paxos. Essentially, it's a group consensus protocol that allows you to choose at-most-once semantics, essentially, for a given point in time or a given operation. Does that summarize it OK? But I just wanted to make sure we understand what we're talking about. ANDREW FIKES: Yeah. I think a few other things that are happening here is, during writes, we do use TrueTime to pick a timestamp. Those timestamps are used to create an ordering, which we talk about as being externally consistent. That means that, if you do two successive operations, they will get increasing timestamps. So an operation that happens after another operation by the concept of time will actually happen that way. I think the other thing that's interesting from this diagram is really to think about, where does high availability come from? And high availability really comes from the fact that we are replicating data. So for example, if Zone C falls off the face of the map-- earthquake, California, that sort of thing-- we can still continue to issue rights to Zone a or Zone B. If we have a crashing bug in one of the span servers, for example, the other ones can pick it up. And so you get high availability by taking these components of blast radiuses and what happens when they fail and making sure that you have other things to capture them up. DEEPTI SRIVASTAVA: Yeah. Do you want to walk through the read? ANDREW FIKES: Yeah. A read is also very simple. I drew it here showing that the reads don't necessarily have to go to the leaders in order to be strongly consistent. So a client might do a read. He might go to one of the other replicas of the [INAUDIBLE] group. Now, when he gets there, he basically picks a timestamp, which is what we call now plus epsilon, which is typically a time at which we know no writes that we could have seen were committed. That front end will basically say, OK, pass it on to a span server. The span server will say, OK, do I have all of the data up to that time? And if enough time has passed-- because we're talking order milliseconds, and writes may be going on-- he will actually just return the data then. If he doesn't have the data, he'll reach out to the leader and say, dude, I hear you have some data for me. And the span servers will say, yes, here you go. Either here's the data you need, or you actually have all the data, I just haven't told you that you have all the data. And in that case, the read will be performed locally. DEEPTI SRIVASTAVA: Yeah. So I think the point here that is different from other systems is that we are, in almost all cases, serving the data from the local replica. There are other system that ship buffers around when they're replicated. We're not shipping buffers around. What we're saying is, hey, what's the latest timestamp that I can serve data at? Do I have it or not? And if the answer is yes, you have it, then you go serve it, or you wait for the data to be shipped as part of the Paxos protocol separately. So you're not actually shipping megabytes and megabytes of buffers around. ANDREW FIKES: Yeah. And just to give you an order on the timing here, you could imagine that zones A, B, and C are actually all within a region. In that case, your writes are on the order of five to 10 milliseconds, because that's what it takes to create a quorum across those data centers. For a read, it would be much, much faster, because you're only reaching out to one data center. So once we get through the load balancer and everything like that, I think it's in the three to five range from external cloud. DEEPTI SRIVASTAVA: Yeah. All right, so let's talk about some of the aha moments where we-- well, you at least-- chose to be a distributed systems guy-- sorry, to a database person from a distributed systems person. ANDREW FIKES: Yeah, so I think people say, well, what's great about Spanner, what should I take away? I think there are two things that I typically reference. One is that, as a distributed systems engineer, you're growing up, and everybody tells you you cannot trust time. The clock on your server is absolutely something you can't trust right. And you develop sort of this innate fear of that thing. It's like legendary. And you also probably experience some really bad things when you decide to trust time anyways. Having the ability to actually trust time, and we are now 10 years later, and I can tell you that I completely trust our time system. It has made my life so much easier. Being able to basically have a source of global ordering that I don't actually have to reach out and talk to, that I can reference locally, is a huge, huge, huge just thinking shift. And it took us actually quite a long time to get over that. When I said we did two or three versions of this system, some of the earlier ones actually worked more on logical clock passing or figuring out how to weave these things through. And weaving timestamps through your system actually has some really bad effects, usually, on your client. You're like, OK, well, this person can give me a timestamp. Now I can take this timestamp and give it to someone else. That means that the application actually has to see it. And so being able to trust time and use that, in a way, actually has really good upper level API effects as well. The second sort of aha moment was that we spent a lot of time not implementing transactions in Bigtable, for a variety of reasons. And the main reason is that the availability of any one of our Bigtable tablets was not sufficient enough to support transactions. And so what Paxos and groups bring us is they bring us highly available participants. So the other aha moment in here is you can do a lot with highly available participants. You can trust them to be coordinators. You can trust them to hold locks. You can trust them to do things like that. So it's really the combination of both trusting time and having highly available components upon which to build transactions that may expand are unique. DEEPTI SRIVASTAVA: Yeah, I remember get time of day was an evil thing and you never put it in your code. But TrueTime along with Paxos gives us that external consistency of-- and I love saying this to customers, because they're like, I don't believe you. But you can write from anywhere in the world into a Spanner database, and you can literally say, I wrote-- like I incremented my bank account with $50-- and anywhere in the world, somebody can read that timestamp and actually get an answer for, it was plus 50, which is an amazing thing to ANDREW FIKES: Have. Yeah, $50? DEEPTI SRIVASTAVA: Yes, it's better than 0. OK, so let's move on to a little bit more on the nostalgia piece, which I find exciting. But we talked about scaling systems, but you also have to scale teams. ANDREW FIKES: Yeah, I think every good project-- and I talked to some people earlier with sort of small teams. Spanner, of course, started with a small team. It started with four people in an office. I think this picture is in 2012. It shows about five or six people. We look heavily engaged in a particular event probably, in terms of a server going down or us debugging sort of things. It's kind of fun to look back. Chris is down here in the front. He looks much younger in that photo. DEEPTI SRIVASTAVA: There were less critical systems running on Spanner at the time. And I also want you to talk about this one. It's super awesome. ANDREW FIKES: Yeah, so this was my cube mate. This is Mike. He's a fan of Ascii art. These were our first 100 users. So every now and then you see a $20 bill posted on the front of a deli or something like that. This was our screen as we watched internal users adopt Spanner. And when we hit 100 on our lovely VT320 here, we had a little party. DEEPTI SRIVASTAVA: I was actually there for this one. So I want to have people look at this, and then this. ANDREW FIKES: Yeah, so a couple years later, we're getting a little bigger. We added a few fish into the mix, as any good team has. And a few of them I think are still kicking. DEEPTI SRIVASTAVA: Andrew and I commissioned this logo, because, of course, we were a service, and we needed a product. And we needed a logo, so we had this logo. And if you don't understand what that is, it's obvious this is a Spanner turning Google. ANDREW FIKES: It's a Spanner. DEEPTI SRIVASTAVA: And then-- ANDREW FIKES: Yeah, so this is actually a couple of years ago at this point. I think the key thing to take away from this is that building a distributed system that is as highly available and actually has all of the features that you all know and love is actually a relatively complex problem. The other thing that I've really grown to appreciate is that database problems and systems problems are actually full stack problems. And so the amount of complexity that is in these systems now is substantially more. When we built GFS in the early days, which was our first distributed file system, we almost decided not to do it. Because we thought it was too complex and nothing that complex would ever work. It turns out it's the simplest thing we've ever built, and Spanner has many more ages of complexity on it. I think the other thing you can see in this picture is this enormous giant wrench in the background. That actually is used for trains. So if you ever need to change a train wheel, we have a wrench for you. DEEPTI SRIVASTAVA: Yeah, we actually have three of them, because we don't-- ANDREW FIKES: We have three? DEEPTI SRIVASTAVA: Yes, we have three wrenches on three different sites. And I think the takeaway-- because we are highly available and replicated. So the thing for me was-- because for me this has also been a journey. And building teams to have that sort of rigor and mindset when we're doing infrastructure that we have to have, like we make no compromises, we don't cut corners, and the processes for around that stuff is-- like it changes, right? The process that works with five people, which is nudging people to say, hey, fix this bug, versus 50 people versus more than 50 people is different. And so it's really been an exciting journey. And Andrew is-- and I am as well-- very excited about the culture and the team cohesion and all those things. And they're real things. You really have to worry about culture and respect and all that stuff as you grow people. And of course, in case that wasn't obvious, we actually have Spanner people distributed in multiple areas, too. ANDREW FIKES: We have people in New York, the Seattle-Kirkland area. A good portion of our operations staff is actually in Sydney, where I think they're just starting to wake up. DEEPTI SRIVASTAVA: In Boston now. ANDREW FIKES: And Boston. And managing a large team, finding good chunks of work for them is definitely a challenge. DEEPTI SRIVASTAVA: Yeah. OK, so let's go into the more fun parts, which is, let's opinionate. So let's touch on a few things that I know customers have talked to me, and I know that you're also passionate about. So first of them is CAP theorem. So does Spanner break the CAP theorem, Andrew? ANDREW FIKES: This is a great question. For years, I hated the CAP theorem. Then I started sharing an office with Eric Brewer, and he turns out to be an absolutely nice guy, so it's great. And I think, as I talk it over with Eric-- DEEPTI SRIVASTAVA: Eric is the author of the CAP theorem, by the way. Also wrote a paper on how Spanner doesn't break the CAP theorem. ANDREW FIKES: As you talk things through with Eric, his real point here is the CAP theorem was really designed to help you think. And it's really also designed for you to think about the extremes. So it's really about 100% situations. And in real life, nothing is 100%. And so when you really look-- earlier, when we talked about things like the building blocks that we have, our network, file systems, and various other sorts of things-- we're able to take things like partitions, which might be really common in most people's networks or most people's WANs, and say, hey, can we think about these things different? Can we actually look at the data and understand how often partitions actually impact things? When a partition happens, for example, does it also impact the end application? Would it actually impact a quorum? And so as you start to look at these both from a mathematical perspective, the probabilities, but you also look on the ground data, what you find is that partitions are actually much, much less likely than the other sources of error in your system. And this is one of the arguments we've made about TrueTime, whether it be more accurate than CPU failures, we can sort of make it here. It turns out that most sources of unavailability in our system have things to do with like users. I don't know how many map reduces have been run in the world that have taken down a system they weren't supposed to. They have things to do with RSRE teams, missing a comma, turning it into a period, something like that. So it's those sources-- and operator error is actually much, much lower than user error-- those sorts of events happen much more than partitions. And so once you basically say, OK, let's assume partitions are not going to be the source of the unavailability, can you build a system that actually has really good availability and gives you consistency? And so that's sort of where we land. DEEPTI SRIVASTAVA: Yeah, so I think we trade off partition tolerance for availability mostly. Cool. So the other interesting thing here is, Spanner evolved from a NoSQL to a Bigtable world into a fully SQL database. And so how has your thoughts on NoSQL to 3SQL evolved? ANDREW FIKES: Yeah, so many years ago, I was invited to give a talk about NoSQL. I knew nothing about NoSQL, so I went and read Wikipedia like a good person. And it said, a series of distributed systems as created from the lineage of Bigtable. And I said, oh, OK, I know Bigtable. Maybe I do know something about NoSQL. Typically when people talk about NoSQL, they're talking about a horizontally scalable key value system. Typically, they also talk about some properties around eventual consistency. I was, for years, a huge fan of eventual consistency. I built Bigtable's first replication system. It was eventually consistent. I was totally convinced that anybody who wouldn't want an eventually consistent system was wrong. I'm here to tell you I was wrong on leaps and bounds. DEEPTI SRIVASTAVA: We got it on tape. ANDREW FIKES: Yes. What you sort of learn over time with an eventually consistent system is you actually go to start to work with users. And you see users try to build applications on top of an eventually consistent system. Yes, it's kind of fun. We get to use that part of our brain that creates a complicated algorithm that only works under these certain situations, and maybe the user sees it this way, or maybe the user doesn't see it that way. But at the end of day, it's just not really any fun. And our goal is to build products. Our goal is not to build complicated algorithms that fix little edge cases. We want to get a product in front of our customers. We want to see that product quickly and get feedback on it. And consistent systems give you that property. I sort of had the same idea around transactions. The very first versions of Bigtable, for example, have a complicated split and merge protocol in them. We kind of knew what we were doing. We hand-rolled our own transactions. I think we did it six times, because we got it wrong time after time after time. Having primitives around consistency in transactions as building blocks are incredibly productive. They let you sort of leverage those things onto other things. I think the other thing that you find in this kind of NoSQL versus NewSQL debate is really this idea of SQL. SQL's great. I think it's a perfectly fine language. Tuples are great. I think types are incredibly useful. We've seen systems on the key value space adopt types more and more. As we start to think about moving compute closer to the data, which is a trend, then things databases have been doing closer to years, you really need a language to express that. You need to understand types so you can drive it down to the processor level. And so SQLs a great language. I think where SQL tends to get a bad rap is it's often associated with vertically scaled systems, things that only fit in a box. Or they have challenges around multi-tenancy. I think Spanner does a much better job on the horizontal scaling aspects, so that gives us some power. We've also always run in a multi-tenant environment. Google has tens and twenties and hundreds of thousands of databases internally, all competing for the same resources, so multi-tenancy has been a core component of what we do for years. And so those sorts of properties of SQL systems, we can look at them slightly differently. DEEPTI SRIVASTAVA: Totally. So we talk about consistency, we talked about SQL, NoSQL. How do you think it kind of comes together to power Google's cloud offerings? ANDREW FIKES: Yeah. Spanner is a workhorse. We use it in many, many different ways at Google. We use it both for things of scale, so things that are petabytes that look kind of like traditional indexing and batch workloads. We use it for very critical high availability loads. I was just at a talk right before this where it was performance benchmarking around VM boot-up times, and was talking about the complexity of programming the control plane. That control plane is done in Spanner. So our ability to spin things up quickly and have them highly available is dependent on Spanner. We see it in sort of the SaaS application space, where you might take something like Gmail where it takes a whole bunch of users, maps them into a single database, and gives them power that way. And we also see it in very traditional database workloads. We use it for some of our capacity planning for example, our capacity delivery systems. And so we really have a workhorse of a system that really goes from everything from the very big to the very small work throughout our system. DEEPTI SRIVASTAVA: We actually have a lot of, a lot, a lot, a lot of small use cases that are all using Spanner for its manageability and scale insurance and all that stuff. ANDREW FIKES: It's also probably a good time-- I know Deepti mentioned it earlier-- which is tomorrow my colleague Dennis from the GCS team, which is our Google Cloud Storage, is going to present how GCS uses Spanner for its metadata. So our really large scale systems are actually built with Spanner as its backbone. DEEPTI SRIVASTAVA: Yeah that's going to be Spanner Internals Part 2 tomorrow morning. So come watch us. But yeah, it's cool to see both the GCSs and the Gmails as well as the Drives and a bunch of other small internal systems, like supply chain management, use Spanner. All right, so these are some of the headlines for when we launched Spanner. There are more controversial ones that I didn't put here. But I know both you and I have thoughts on why we launch it externally, so why don't you-- ANDREW FIKES: Yeah, I think one of the great things about why I'm excited about Cloud, its new workloads. I mentioned why I get excited about infrastructure. It's because there's new challenges, new workloads, new customers. I was very excited to see examples of workloads that I saw internally externally. I think Spanner's a great building block for a lot of applications, being able to get that out to you all , and have you work it into your systems was a big win for me. DEEPTI SRIVASTAVA: Yeah, I think it was a while before we could convince ourselves that we wanted to be a public product, primarily because we didn't think that externally people had the same challenges that we had internally. And I remember doing a whole initial program of going out there and talking to customers to see whether this would be a useful system for them. Because at the time-- this was like three years ago, which is dog years in Cloud world-- when we trying to launch Spanner, we were cognizant that people don't want just another cool tech because Google built it. Especially on the databases side, people want something that solves their problems, and is this something that is going to solve their problem or not? So we actually wanted to be cognizant of user issues and see if this could solve our user and customer pain points. And after serving customers, we found that customers were coming to this with a new internet age of everything connected, everything always on, everything online. Customers wanted that kind of scale insurance with horizontal scalability and strong consistency and relational semantics. And so it was like a really good time to be in the public eye. So thank you so much for your time again. [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 18,522
Rating: 4.9173555 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: nvlt0dA7rsQ
Channel Id: undefined
Length: 41min 42sec (2502 seconds)
Published: Thu Apr 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.