Building Software Systems At Google and Lessons Learned

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Stanford University okay yeah me okay okay um welcome to I guess this is III eighty but it's also been sort of overridden with our distinguished lecture series today's speaker is Jeff Dean of Google and you know Jeff you know I don't want you to pause time and describing all his accomplishments but I'll say you did get his degree at University of Washington but since his adviser was a graduate of here we can claim him with a grandchild of our institution and we certainly do that and he initially landed after his degree at Dec research lab worked there for a while and then made the smart move to jump over to Google in the early days where he was an instrumental and building a lot of the systems that Google uses to make money basically their search ad language translation all these things and along the way he was involved with some very influential pieces of infrastructure including BigTable MapReduce and protocol buffers and so with that introduction I'll let Jeff describe them to you thank you metal um so welcome thank you for having me I will try to speak louder um so the plan for today is I'm going to connect my microphone I'm going to talk about the evolution of various different kinds of systems at Google one is how our computing hardware has evolved over about the last decade or so and another thing is about how our core web search and retrieval systems have evolved what we've done in order to scale those systems as traffic and index sizes have grown and so on I'll talk a little bit about some of the infrastructure software we've built that also underlies a bunch of Google products and then the last little bit of the talk I'm going to talk about some of the techniques that we've developed to build high-performance and reliable systems that have kind of cropped up in the process of building some of these systems and sort of general patterns that seem useful across a wide variety of kinds of systems this is joint work with a huge number of people at Google I've worked on all the systems I'm describing but there were many many people involved and I always appreciate my my colleagues so when I started working on web searching 1999 at Google these are some metrics that I think you can use to evaluate different dimensions of a retrieval system one is how many documents are indexing and that scale is increased by about a factor of a thousand from 1999 to today the second thing that is another important dimension is how many queries you have to handle on a given day and that's also grown by a factor of a thousand ok I have a quick experiment how many of you use Google regularly in 1999 now this may be a biased audience but how about 2002 how about now okay yeah well so our traffic is grown but I don't see a thousand more Hamsun I deserve another important thing that is uh yeah pretty vital for improving your ranking quality is often you want to keep additional information about around about each document in your index and use that to you drive more sophisticated ranking algorithms and the amount of information we keep in our index today is about three times as much as it was then per document one of the things when I was making this slide it was kind of surprising is that the metric that's growing the most or improved the most is actually our update latency in 1999 we were basically updating our index once a month if we were lucky and once every couple months if something horrible went wrong and now we have portions of our index we can update within a matter of seconds from calling the page and so that's a pretty substantial improvement the other important factor for users is how quickly do you get your responses I was measured at the server side not doesn't include client side network latency but basically you know we've had about a 5x improvement in there so the difficulty in kind of engineering or retrieval system is in some sense the product of all these things because you're dealing with larger indices you're trying to handle more queries with more information per document and you're trying to update it more often and you're trying to do it faster now one thing that's really helped us a lot is that we've been able to use more machines and faster machines since 1999 we've upgraded our hardware a bit and that's given us about a thousand x improvement in sort of computational the other kind of cool thing about working on search is that a lot of the stuff kind of happens behind the scenes and we don't necessarily change the user interface every time we change kind of how the guts of our search system work and so over the last 11 years we've rolled out about seven very significant revisions to how our search system work works and often these have been rolled out without users really realizing we've made major fundamental changes underneath the covers other than perhaps you know seeing a larger index or responses faster or things like that new UI kind of looks the same ok so I will start at the beginning so as you obviously know Larry and Sergey we're grad students here and we're doing research on how to build sort of search engines for web pages and using the link structure of the web in order to do interesting ranking experiments well apparently their advisors were very mean and would not buy them enough computers and so they apparently survived by going down to the loading dock and volunteering to set up other other research groups machines and then living off the float for a while where they would actually use the machines for a little while one problem of this approach is you end up with kind of a heterogeneous mix of Suns and IBM's and all kinds of crazy things rather than a nice homogeneous cluster which you might prefer from a software standpoint but the research product essentially looked like this you have a system that takes in a query received by a front-end web server the index is actually divided into a bunch of partitions by document so some of each of these partitions has some of the documents across the whole system you send a request to the index serving system to each partition it computes the best results for its subset of the documents and sends results back and the doc servers are used to generate the actual title and snippets you're going to show once you've decided which of the documents that you actually want to put on a results page so the basic principles of a system designed like this given a query each index server can return you a doc ID which is just a numeric identifier for document and score for that document and gives you a bunch of those pairs and the cost of the index serving portion of the system is essentially order number of queries times the number of documents in the index that you're searching on the doc server side the the goal is given a doc ID and a query you want to generate a title and a snippet one of the early things Google did that was different from search engines at the top search engines of the time is that the snippet was query dependent so given the search results we'd actually use the words in the query to decide what the summary of the document that they would show as opposed to just showing the first 30 words of the document or something that actually is a really important innovation but it does mean that the snippet is query dependent so you can't sort of pre compute what are the snippets you want to show for every document you have to do it for every doc ID document cross query pair and the cost of the doc servers is really just order number of queries because for every query you're just going to show up to 10 results or 20 results or something and so from a performance standpoint in a search engine the cost is dominated by the index serving portion the doc serving portion really doesn't matter at all which is why you won't hear me talk about it too much okay so now that Google was a real company we decided we should build our own hardware and we were going to live off the commodity hardware curve that was driving you know a lot of the you know plummeting prices for desktop computers so we actually decided that we would build our own hardware from these commodity components the commodity components were great because they were really low price for what you got they didn't have a lot of fancy sort of server level features that you might see in higher end machines but they were pretty cheap so we actually bought components assembled them ourselves this was our first attempt before we hired mechanical engineers they were affectionately known as cork boards because we actually have trays here now there are several lessons from this particular design one is each tray had four computers on it and they all share the power supply to save some little bit of money but it kind of induces additional failure modes that you really don't like when the power supply fails I had four reset switches for the so you can an operator could reset the each individual computer um and I had a thin layer of cork to insulate the motherboards from the metal tray that they were sitting on so the system in 1999 looked pretty similar to the original research project accepted and grown a bit actually the first thing I worked on when I showed up at Google's I said we need to add system so I said okay let's find add system so I'm not going to talk much about in the rest of the talk but there are all kinds of interesting features in the add system you can view the advertising system as really another form of information retrieval with some additional kinds of constraints like budgets for advertisers and cost-per-click metrics and so on but I'm not going to really dwell on it in this talk and in order to add more search capacity in a system you essentially take the index data and you replicate it so that you have a whole bunch of machines that can deal with index shard 0 and a whole bunch of machines that can deal with index shard 1 and you send the request you pick one replica did I actually so you pick one of the replicas for each shard and you send the request there the other thing we added obviously was cash servers this is a sort of a no-brain no-brainer thing to do so cashing in web search is a very useful thing to do we actually cache both the results of index results and dock requests hit rates you typically see are maybe thirty to sixty percent depending a lot on a lot of different factors one is how often you flush the cache or update the index and have to invalidate the cache the other is kind of the mix of query traffic you know if it's a data center in Europe it'll see a wider variety of languages and so that will mean you'll typically get fewer cache hits than data center in the US where you see mote more English queries and fewer other language queries how much you personalize the search results and how that affects what you can cache so the main benefits of the caching system are obviously performance because a few machines dedicated to caching maintaining a cache on disk do the work that hundreds or thousands of machines in your backends or systems are actually trying to do so a few machines do 50 percent of your work and then the other thousand do the other 50 percent you also get much lower query latency so if you get cache hits you just basically f31 to seek to read the data in the cache and you return the users you don't have to do any of this distributed distribution of our pcs to discharge and so on and also the queries that hit in the cache tend to be both popular because obviously they hit the cache someone else has already issued this query and expensive you know they tend to be shorter queries single word queries that tend to have a lot of results and you end up having lots of documents that get retrieved for these queries and lots of documents to score um one of the things to be aware of is that there's a big latency spike and a big capacity drop when you flush the cache where you update the index so we also had to carefully staged when we would roll out a new index when we did it once a month to make sure we didn't do it peak traffic times and so on so I'm not going to talk too much about our indexing system except maybe in a little bit in the later versions of it but in 1988 8 98 and 99 it was basically a simple batch indexing system it didn't the initial versions didn't really even have check pointing so you just kind of try to take a whole bunch of double raw documents on disk take all the words from them sort them and invert them and you would essentially then end up with data structures that you could use to serve the index chards we didn't actually have checksumming at the raw data and the machines we bought at that time consumer class machines typically didn't have not only didn't they have ECC they didn't even have parity in their memory so a rather frustrating thing when you're sorting a terabyte of data without any parity is that it ends up mostly sorted and if you try it again it ends up mostly sorted a slightly different way so it's especially bad for merge sorts because you and you flip a bit and then all of a sudden you ignore like all the data downstream of that particular input file you're merging so colleague of mine Mike into the this to programming with adversarial memory and that so we developed one of the early kind of modules we developed was a file abstraction for design for small records that actually appended little check sums and could resynchronize with resynchronization patterns when it detected corruption this is still in use today for a variety of things and it's producible so as we got more mature we kind of got more comfortable with our computer design we ended up building computers with cases and all the connections on the front which was a good idea so basically this is kind of a design where we still had pretty dense computational power per rack so we actually were much denser than any other sort of data center user at that time we had a single power cord and this thing we'll never connection per machine that's all you had to plug in and at the time we were not running our own data centers we were in hosting centers that charged us by the square foot and not any other factors so which was kind of a curious business model but what do we know so our our incentives were to pack as many machines as we possibly could into these square feet and we often had to help them a little bit with some cooling as a consequence we actually got pretty good at moving out of bankrupt service providers and into other ones you know you can get pretty efficient at it you can have the racks all ready to go you just wheel them in and then you just kind of cable together the top of rack switches and away you go so one of the important things in this period of 99 kind of to 2001 is we were really growing two of those dimensions at the same time the index grew by a factor of 20 or so over that period at the same time we were getting kind of 20 15 20 percent traffic increases per month and signing big deals so we signed a deal with Yahoo to provide their search service in July 2000 and basically our traffic doubled overnight as a result of that deals though you know July 5th or whatever it was we turn on the spigot and now we have to handle twice as much traffic so we did a lot of work on performance of the index serving system we were deploying more machines kind of as fast as we could but it was still pretty challenging and you couldn't really deploy we couldn't operationally deploy them as fast as we could so we need basically needed to come up with lots of software improvements over that period as we kind of so we kind of roll out a new index then we work on software improvements for a while roll out the new index with some new cert your server software that was hopefully a higher performance so on so over this period this is kind of the way things went we OTT we wanted to increase our index size because we believed that index size was a very important way of increasing search quality basically you want to be the place where if you do a search and there's a page on the web you find that Paige and in order to do that you have to have a very large index so we're constantly increasing the size of the index when you increase the size of the index you have to keep partitioning it more finely in order to keep response times and within a reasonable rate if you tried to have like one or two index partitions your latency would be too high so essentially as you're increasing the index size you're adding more and more index partitions and as traffic is growing you're adding more and more replicas so there goes more shards bunch more shards more replicas more shards now a rather large problem with sharding the index this way is that you end up basically on every index shard doing a seek for every term in the query and so this seats are not the most highly performant operations one can do and so you end up basically being very disick limited and not really utilizing all the disk bandwidth you you have there now there are a lot of tricks you can do you can build other kinds of exhilarated Atta structures on disk that allow you to kind of get more information per seek by pre intersecting terms that are commonly occurring appearing in queries you can try to compress the index data so you have to read less data from disk which will help you because then in the time you're not seeking you have to read less stuff but essentially it becomes more and more problematic now one issue is as you try to work on index compression and you add more and more machines eventually realize that hmm you know if I looked at all these machines we're using to serve this index I can actually hold one copy of the index in memory across all these machines so in 2001 that's actually what we did is we had a completely in memory index system and we've had that pretty much ever since for all of our index servers so this was basically the design we came up with and there's another layer of distribution in there what we call balancers that worse ending the front-end web servers talks to each balancer and a shard the balancer then talks to each machine in the shard because there's now only one replica of any given piece of index data and so it has to talk to all these machines within the shard and gets results the balancer kind of combines them and the sons of the backup for the web server we still have cache servers and dock servers those work pretty much the same way now there are a lot of really good things about this first you get a really big increase in throughput because you're no longer doing this seeks you're just reading from in-memory data structures and you also get a very big decrease in query latency because you're not waiting for mechanical disks and especially at the tail so really expensive queries in a disk based indexing system this was our kind of canonical example that caused us all kinds of headaches we were looking through query logs and trying to find what queries were really expensive and this one I was just orders of magnitude more than the rest of the queries in the sample of queries we were looking at circle of life as a phrase so the problem with this query is all three of these words are actually relatively common else is extremely common and they hardly ever occur as a phrase on the web and that's just the worst thing you can possibly have in a retrieval system because you're essentially streaming through all this data for circle and for oven for life looking for a one rare document that has them next to each other in a phrase I think we did like 30 gigabytes of i/o for this query and in the memory system it's very fast because the sikhs don't really cost you much you just kind of move to a different position of the posting list in memory and the way you go now when you have the index system in memory there are two main issues that really kind of bite you as you're first starting to deploy such a system one is variance so the query is now going to touch thousands of machines not just dozens and so things like randomize cron jobs can cause you trouble and the reason up our operations folks had decided that we're going to have cron jobs that run every five minutes on the machine to do these kinds of housekeeping things and it would be good to randomize those across different machines so that not all the machines were actually running those cron jobs at five-minute intervals but we're kind of spaced out of it and it turns out for an in-memory index this is the entirely opposite thing of what you'll actually want because at any given point if you've randomized things at least one machine is doing this con job of activity and so has less CPU or is like busy turning its disk or something like that sending stuff over its network whereas if you were to just fix it at five-minute intervals so that everyone had a little hiccup every five minutes that would actually be better because most of the time all the machines would be running at full throttle and they'd be a slight hiccup that would affect a relatively small number of queries but not every query throughout the five minute period the other big problem you have is availability so you used to have many replicas in the disk based index system but now you only have one replicas and so one thing that happens is machines do fail periodically and so you'd like it to be the case that you don't suddenly have cnn.com the face of the world when that happens so four really important documents you can just pick like the top PageRank documents and you can just replicate them across multiple different machines so that don't completely lose those important documents the other thing that happens is you can have things called queries of death where when you receive the requests and you start doing some processing for it there's just some bug in your software that all of a sudden causes the process to abort you know it's egg vaults or you didn't allocate enough space and some buffer for a really long query and you've never encountered this before obviously your system gets more and more robust the more of these you you see but it's never going to be completely prevented and so if you're going to send the same request to a thousand machines and it's a query of death really really bad things will happen in particular the log crash and your whole data center kind of goes belly-up for many minutes at a time as it recovers so the solution is you basically can send what we call a canary West first to one machine and if you get a response excellent yes then you're free to send it to all of them if it fails well depending on how how ambitious you're feeling you could say well okay I give up well you can try another machine because it could have just been a coincidence that the RPC was sent to a machine that machine crash but not because of that query if it crashes the second time you're pretty sure this is not going to be a good thing to send to all your backends and so you can then just reject reject that request and the nice thing is you then crash only a few servers not thousands of them and you can also it's also a good idea to log what the query was there so that you can investigate further oh we didn't really have one I mean we had a kind of a dynamic one in memory that we would keep in the balancers so that you would only crash a few backends and then if the user hit reload you like then you'd say okay I'm gonna rejected even beforehand no not usually it's just you know some bug and some new release you've rolled out your also remember rolling out new release is really fast and you've replayed you know all of last month's logs or a large fraction of last month's logs to make sure it doesn't crash but you're always getting new kinds of queries and things like that so so we kind of kind of caught our breath a bit and then redesign things of a fair amount 2004 this is kind of the first time we were able to rethink things from scratch and essentially we unified the index servers and Doc servers so we now have a multi-level tree for grade distributions kind of generalized from what we had before where we had a fixed number of levels we have the least servers that are able to handle both in memory both index and dock requests we have a thing called the repository manager sitting on the side that deals with this index is made up of a whole bunch of shards thousands of shards and as a new shard becomes available we'll just switch that one shard so an index which is not kind of this big monolithic event it's now kind of a very gradual process is happening all the time and the cache servers cache partial results and then you issue requests for the pieces of the index that are new since you last had that cache request yeah so we kind of were able to at that point clean up a lot of the abstractions that had kind of evolved over time one of the things we wanted in this system was the ability to do easy experiments so when you have a new ranking idea or a new ranking algorithm you want to try often you need to need to access some new kind of information that you want to pre-compute for every document in the index and the old way of doing that was you had to kind of build the test index with the new information baked in which was a very painful process to build a whole new index just to try out your ranking idea so we wanted it to be easy to experiment so that you'd attach new data to an existing index roll out a new ranking algorithm try that for a fraction of the traffic if things look good then roll it out to a larger fraction we also wanted it to be higher performance because we were getting more queries all the time and we'd actually gone a little too far in terms of our index compression schemes and we're using very dense densely encoded bit level and coatings because the original scheme was designed for things to be on disk ironically the fact that it was so small was what made it actually end up fitting in memory for the first time but once you have a system where you now care a lot about throughput you actually want to relax how compressed you put things to make it a little faster to decode because now you're not streaming it off disk where you don't really care about CPU usage because the CPUs are so much faster than disk but you actually care about CPU usage yeah so this is basically just saying we came up with a new format so I'm going to skip ahead the new format we actually have a single flat position space unlike the old format which had a two-level document identifier followed by word position within that doc the document pair and the new format needs to be compact we can't really afford to have a 32-bit value for every word occurrence in the index we'd like it to be more compact than that we wanted to be very fast to decode one of the great things about retrieval systems is you get to work on all kinds of fun and coding formats so I will take a brief diversion so given that we wanted to be fast we know we're not going to be using bit level encoding formats but something byte aligned so a very simple thing you can do when you have want to have variable length integers encoded is to use seven bits out of every byte to encode seven bits of the value and then a continuation bit to say other more bit are there is there another byte that follows us that has another seven bits of data so numbers between 0 and 127 you can encode with one byte those that need seven more bits take two bytes and so on and then that continuation bit tells you the limits whether it's part of the previous bytes number encoding or is a new number now one problem with this is there's a lot of shifting and masking and looking in continuation bits in order to actually decode the number E so one thing you can do is you can say well maybe I'll just encode a number between 0 & 3 that tells me how many bytes the number the full encoding of the number is and then use the other 30 bits of the number to actually encode the value so I can limit myself to 30 bit numbers instead of 32 bits steal 2 bits and then 0 means that the one byte number 1 means it's 2 byte number and so on and this is actually faster you have fewer branches and less shifting and masking to decode things if you're willing to limit yourself to 30 bit numbers what we actually have in our index is obviously a whole stream of numbers that we want to incurred and so we can actually afford to do things where we can say we want to encode a group of 4 numbers with tasks we're going to give to ourselves and what we're essentially going to do is take that two-bit encoding from the second thing on the previous slide and we're going to pull those out into a one byte prefix for this group of four numbers and so now we have it all pack together as a single bite that has the length information for each of the four numbers in this group and then the rest of the data will be in 1 to 16 in the remaining bytes of the of the group and so the whole group will be between 5 and 17 bytes then you just kind of encode the numbers that way so one nice thing about this is decoding is pretty fast you basically load this prefix byte you can then use that value to look up on a 256 entry table and it tells you exactly where the offsets are in the rest of the group about for the numbers and what the masks are that you should use to decode them ah and furthermore you get a lot of instruction level parallelism once you've done this lookup you can essentially decode all four of those numbers in parallel and you don't take any branch misprediction even really it's actually you know factor of two roughly faster than most of these eternity so you know there's lots of different coding formats it's all fun the raw number oh it's probably quite a bit faster but it's kind of a memory trade-off right so if you just were streaming to remember you could do it at gigabytes per second but it would cost you you know three to four X and memory probably so then the system malady slaves continued to evolve one of the major changes we made in 2007 was this notion of universal search so previously we would just search web documents when you went to google.com and you have to go to these other properties that we developed over time like books Google comm and Newsday Google to come to search these other kinds of corpora and a problem of that is the user doesn't really know necessarily that these other corpora even exist in some cases or they're certainly not going to take the time to go to each one of these and search perform their search and all the different properties so we'd like to actually search all the information we have when they go to google.com and in order to do that we essentially took all these different properties and said okay we're going to search them at full web traffic levels and we have actually an indexing service that sits underneath it that is able to roll out new index updates to these different corpora and switch things over so the main issues here are performance obviously most of the corpora since they were just their own isolated property were used to traffic levels that were you know ten to a hundred times lower than what they were going to get if we were searching them with all all web traffic so a lot of them you know had ranking functions that weren't necessarily designed as efficiently as they could be that we had to do a lot of work there there's a bunch of issues in information people kind of area about how do you decide which corpora are relevant to a given query one thing we found was that it's very hard to predict just from the query which corpora are going to give you good results and so actually what we found easier is to issue the query to all the corpora get the scores back and then use that information to try to decide which corpora are most relevant it's a much harder problem to do it just from a couple of words in the query and then there's lots of interesting UI issues about how you organize the results from different corpora should you interleave them should you have separate sections for different kinds of things and actually turns out four different kinds of documents the answer is we do both depending on the kind of document okay so I'm going to switch gears a bit and talk about how kind of our underlying system infrastructure has evolved you know this is a fairly modern incarnation of our of our rack design we're back to not having cases on our machines you get better airflow as it turns out but they look better than the clipboard don't they so essentially these are still commodity class machines they have kind of a rack level switch that connects to a bunch of other connects to a central switch they're all running Linux plus a bunch of in-house software little described now this is kind of a list of things that happen that go wrong that I collected from one of our operations people things get a little bit better after the first year of a cluster because it's kind of broken in a little bit but you know these are the kinds of things that you have to deal with when you're building software on top of this kind of computing platform you know obviously things like individual machine failures or disk drive failures you would kind of expect but there's a lot of kinds of things that affect multiple machines at the same time like a whole switch dying that's actually not so bad because those machines just kind of drop off the face of the map but if it gets slightly flaky and starts dropping some packets but not all you can still kind of talk to the machines but it's just very slow then that's actually kind of worse from a software standpoint because the machines kind of look alive but we're kind of half there that's actually quite painful uh in terms of long distance links that connect our data centers these are all actual reasons that have caused our long-distance light to die things you would not expect they're not taught in software and hardware reliability courses let's should I tell you about dead horses yes turns out horse graves are very deep and will actually sever fiber that you've buried drunken hunters this was an organ apparently there were there wasn't much actual thing things to hunt and so they some hunters saw some nice things mounted on poles across the valley and decided there would be interesting to try to shoot them so in this kind of environment you really have to have the reliability and availability come from the software and not from the hardware and even if you were to spend more money and buy more reliable hardware at the scale that we're operating that hardware is still going to fail so you still need that reliability availability to come from the software and so I'd actually much rather spend have three times as many machines that are less reliable because the you get a lot more computing per dollar that way okay but assuming you have a lot of machines you'd like to and you're running in this environment there are a few things you'd like to be able to do why is you'd like to be able to store data persistently with high availability so that means not on a single machine obviously or maybe not even on a single rack given the problems we saw previously and you like hi read and write bandwidth to the data you've stored you'd also like to have the ability to run large-scale computations reliably and without having to sort of deal with individual machine failures let me briefly talk about the Google file system that was developed at Google in 2003 essentially it was a file system optimized for the kinds of workloads we have which were very large files so typically a whole record of files of things that we documents we've crawled so the files were you know hundreds of megabytes gigabytes in size not you know tiny 5k files and the design that we came up with was the master was going to manage the file system metadata so file names and information about where which different servers were storing different pieces of that file but the data would actually be spread across lots of machines in the data center running a process we called the junk server and clients when they actually wanted to read the data and the file would talk directly to the appropriate chunk server so clients talk to the master to figure out where the data is but then they read directly from the trunk server and the files are broken into chunks of roughly 64 megabytes in size and because we want to tolerate machine failures we're going to replicate chunks across multiple machines so we're going to make multiple replicate identical replicas of the same chunk typically three so in this example I've shown c0 is replicated on both chunks over 1 and chunks over nc5 is actually replicated on three of the ones illustrated here and so on so client actually has three possible locations I can pick from to read the data so a cluster in our environment is actually you know somewhere between five and 20 thousand machines typically one or a handful of hardware configurations so the different configurations might be some machines with one disc and some machines with twelve disks and but other than that they're pretty much identical different clusters built at different times will have slightly different evolutions of our our hardware so you know this one built two years ago might have all three processors that are a little bit slower than one bill today but within a cluster is typically pretty homogeneous um as I said they're all running on this commodity hardware they're all running Linux they're all running chunk server processes and within the data set within the cluster somewhere there's a GFS master we actually have a scheduling system so that you can run jobs and tasks so each machine runs a scheduling daemon that the scheduling master talks to to kind of coordinate tasks start up and find out when machines have failed or have come back up and so on we also run something I'm not really going to talk about much it's a distributed lock service so that different processes running on them these machines can talk to this lock service and grab distributed locks or share small amounts of configuration information amongst themselves and then on top of that we just run a whole bunch of different kinds of user level jobs so there might be one job that is indexed servers in our web search system another one might be a big batch production job to rebuild the index okay so so one of the problems that you have when you have lots of data is that it takes a long time to do anything with it I for example you know if you have 20 billion web pages and that's going to be 400 terabytes of data if you try to do this all on one computer you basically take them up several months to do anything with it and that's obviously not going to be very reasonable so you have two parallel eyes computation somehow the good news is that if you paralyze them you can actually get pretty decent response times for fairly data intensive tasks like I want to do something with all the webpages if I'm able to paralyze that across the thousand machines then I can do it in you know three hours instead of three months the bad news is there's a lot of issues that make this difficult you have to communicate between the different pieces of your paralyzed job I have to coordinate figure out who's going to do what recover from machine failures somehow status reporting is pretty important because I'm sitting there impatiently waiting for my three hour job to finish and so on and the bad news part two is that you're going to repeat a lot of this for every different problem that you're going to solve in slightly different ways so in 2003 my colleague in it Sanjay Ghemawat and I were working on rewriting our indexing system to try to make it easier to incorporate new kinds of information into it one of the things about an indexing system is it starts with raw page contents on disk and then it goes through a whole bunch of phases to kind of compute intermediate data structures that you're eventually going to bake into either the index serving system or the doc serving system and over time you kind of retreat more and more of these phases to compute other kinds of derived information that you've either know is useful in your ranking algorithm or if you're experimenting you think it might be cool to have this information available and so prior to this period each phase was essentially this handwritten parallel computation where we hand parallelized it across a bunch of different chunks of input we would have handwritten checkpointing code to basically deal with fault tolerance over machine crash you would revert to the last check point that that machine had saved and restart the computation from there and roll forward so eventually we squinted at all these different phases in our indexing system and said you know a lot of these look pretty similar and that they're extracting something from some input and transforming it in some way and then producing some output so we kind of squinted at this and came up with this programming model called MapReduce that allows you to express relatively simply in a couple of functions what it is you're trying to do to the input data and how you want to transform it and then produce the output data and the nice thing about that is if you take computations expressed in this way you can hide a lot of the messy details that were previously kind of intermingled with the actual simple computation you're trying to do and put all that gunk in a library and let the library deal with a lot of these issues and use that library for all kinds of different computations so a typical problem you're trying to solve in MapReduce is you want to read a lot of data extract something you care about with a map function the library internally will shuffle and sort it and then you want to apply a reduce function that says how you want to combine data that you generated in map phase and so this outline is basically the MapReduce programming model and that outline stays the same and you write the map and reduce functions to get the problem I'm going to walk through an example that's maybe a little more interesting than the typical examples done for MapReduce like an introduction to MapReduce um so this is actually a real example from our production Maps system so our input in this case is we have a whole bunch of lists of geographic features roads in the world and we want to end up generating map tiles where we've actually rendered all those roads at some resolution I'm ignoring multiple resolutions here let's say we just have one resolution we care about and so we need to somehow collect together all the data we need for rendering each tile in the world starting with this input data so we can actually assign a numeric key to each tile in the world in this case I'm showing two tiles tiles zero tile one this is actually Seattle if you're familiar Seattle and so there are four roads shown here I five which actually intersects both these tiles so my map function is basically going to take a feature so the map function gets called on each of these geographic features and it's going to figure out which tiles that that particular feature intersects with and emit information about that feature to each tile that it intersects with so in this case when it's called an i-5 it's going to determine that it intersects both tiles 0 and tau 1 it's going to emit all video needs for i-5 and generate that set of key value pairs same thing with Lake Washington that actually runs north-south and intersects both tiles I'm going to do the same thing there the 520 bridge actually only goes in the top tile so we're going to emit just one copy of that 2 tile 0 similarly with i-90 which intersects only Taiwan then there's this big shuffle phase where we take all the keys that the map function has output and we combine things by key so we generate we cluster everything together that has key 0 and we have all the data we need to render tile 0 and then we clusters everything that has pile 1 or key 1 and then we have everything we need and then we invoke the user's reduced function which for every tile is then going to get a list of all the geographic features that intersect that tile and the actual reduced function in this case is actually going to render things in a jpg image and generate the JPEG image as the output so pretty simple computation when you express into MapReduce and it the underlying implementation will take care of a lot of the details in particular um there's going to be a master that's going to coordinate ah the process of splitting the input into a bunch of different chunks so that can get parallelized across a whole bunch of different machines I didn't mention that the user can specify a partitioning function for the reduce phase that a group different reduce different intermediate keys to different partitions so they control the level of parallelism you get on the reduce side and essentially a master is going to find out who the free workers are it's going to assign the map tasks and the worker is going to read all the records in that particular map task and invoke the users map function and generate the intermediate data into files and the master is then going to assign reduced tasks to free workers and those workers will then read intermediate data that's been generated by the workers who perform map tasks and then that's going to sort and then apply the users of reduced function so the way this looks in graphical form is essentially like this you have a whole bunch of input data sitting in a distributed file system the master breaks it into chunks will tell different machines to process different chunks map tasks generate output under their local disk or in memory if there's not very much output and then it will get shuffled by the framework and then the user reduce function will be invoked and will produce output one thing to know is that for really large computations it's really more about network bandwidth and making efficient use of your disks and network transfers that you're able to do in your network rather than CPU and DRAM for small computations or CPU intensive ones that's not necessarily the case but a lot of the a lot of the performance in large mapreduces is get guided by the shuffle and the framework kind of takes care of that for you another thing is worth pointing out is that you have pretty fine granularity tasks we don't want it to be the case that in the course of a single MapReduce computation each worker does just one map task we'd rather have them do ten or twenty so that when a machine fails you can actually recover from that very fast by giving the 20 tasks this guy is done to want one of them to each of 20 other machines and recovery is very fast that way it also allows you to pipeline the net the network transfer so here's a picture with three map tasks to reduce tasks and the first two workers are assigned to do the map tasks this guy is first assign map task one then he finishes that because it was pretty fast Austin does map task 3 and then this guy is going to work on map task 2 which takes a little longer and so as soon as map task 1 finishes these two guys which have been assigned to do the reduce work they can actually start reading the output of map task 1 from across the network so you're actually shuffling in parallel with running other map tasks computations and so you end up with a lot of pipelining in this system and you get better dynamic load balancing by having finer granularity tasks it's actually pretty fault tolerant so you can actually handle a lot of the issues when a machine fails by just reacts acute Ivana do a little bookkeeping to keep track of who actually now has the you know the current result output for map task 7 or whatever it's actually pretty robust so actually we were running experimental MapReduce is early in the MapReduce development and we kind of got our signals crossed for their operations people and turns out they were rewiring the network in the data center we were running our large MapReduce on and so they kept unplugging rack after rack of machines rewiring it you know we'd lose 80 machines they're like ah well wonder what's going on and the MapReduce framework would recover and then they'd unplug the next rack and we lose another a machine quite painful but we didn't really know what was going on at the time and it took a little longer than we thought but we actually handled this and let me discover this after the fact which is kind of cool another refinement you can make in this environment that's all handled automatically by the framework is backup tasks so if you have slow workers that are slow for reasons independent of the actual input data you know maybe they're just running a lot of other jobs at the same time or their network is congested then you can actually they can lengthen the completion time a lot not because the input data is slow but because of other issues so one of the things we do is near the end of either the MapReduce phase we spawn backup copies of these tasks and so we actually run multiple copies of map task 7 and whichever one finishes first wins this brings in the job completion time tremendously for mapreduces another important thing in our case is a locality optimization so the scheduling policy is such that we're going to ask you FS for locations of replicas of input file blocks and we're going to try to schedule map tasks so that they're local to that particular machine or at least on the same rack the interest is oh so here's some stats about how MapReduce has been used over time within Google it's you know I won't really read it oh we're roughly processing an exabyte a month now running four million jobs a month got or interesting so no ah nine hundred and forty six thousand turbine now nearly next night excellent yes yes August so four was three terabytes I'm not going to touch on this too much I'll just briefly mention a current system I'm working on with seven or eight other people basically it's a system called spanner that is designed to be a piece of infrastructure that runs in a single instance across multiple data centers so a lot of the systems we developed at Google run one instance in one data center and then another instance in another data center and this is kind of the first system we've really tried to build that spans many data centers at a large scale so it has a single global namespace for data so that if you decide to move or change the replication of some data or move a copy of some data the name of that data doesn't change which is a pretty important operational property within Google because right now if the calendar group has some state stored in this data center and they decide to move it that's that actually changes the name when they store it in a different GFS self just kind of a pain because that means sharing but cross different groups is difficult so you always have to go ask the calendar people okay where's your cook data now and we're working on supporting a strut a mix of strong and weakly consistent operations across different days across data centers and the hope is that the system is much more automated especially when it's moving data across different data centers then right now which is kind of a fairly manual labor intensive process so this is kind of a very broad high-level view of the zone design of spanner so you have going to have a whole bunch of zones around the world in different data centers and these zones are going to store copies of data and might also have replicas of that data and different other zones and we like the zones to be semi-autonomous so they can still continue functioning and doing load balancing within the zone on their own even if they're partitioned away from the rest of the of the system and we also want them to be consistent after that like let's say those knit two network links are severed and then they come back up we'd like to recover a consistent view of the data and users are going to specify kind of high-level desires rather than very specific things we'd rather they say I want three copies of the data to in Europe and one in the US rather than saying I want it in exactly these three data centers okay all right the final portion of the talk is a set of kind of design experiences and patterns that I think you'll see derived from some of these systems I've described they're not obviously all-encompassing but they're kind of good rules of thumb that we found crop up across multiple systems so one of the first things that is pretty important when you're developing large complicated software systems is to be able to break them down into separate subsystems and especially at Google because everything is going to be distributed across many machines those separate distributed services provide good natural interface point where you can you know fairly easily specify the interface to the spelling correction system and then this group of people can work on the spelling correction system fairly independently of you know the people working in the ad system or the clients of this calling system I'm making roll out new versions at whatever pace makes sense for them and are largely decoupled from the other people of working on other distributed services so this is pretty pretty important small teams can work independently of other ones by carefully defining these interfaces and services and that also makes it easier for us to have a lot of engineering offices around the world you know we went through a big expansion about three or four years ago where we went from you know two or three engineering offices to thirty around the world and part of the reason we're able to do this is because we are able to break things down into be separate services and and give ownership of that of those services to different offices so as an example every search you do on google.com touches 200 separate services probably more than 200 who can keep track so another pretty important thing when you're trying to build a designer system is given some basic problem definition how can you choose what the best solution is and the best you know has a lot of kind of qualitative aspects to it is it easy to understand is the interface clean but it also has some quantitative aspects like how is it going to perform and so on and so a really important skill when you're designing systems is being able to estimate with the back-of-the-envelope kind of calculation what the performance of a system or all several alternative designs is going to be without actually having to build it you know if you have to build four versions and measure them in order to figure out that while those three were really bad ideas then that's that's not going to be very good so a while ago someone asked me to give a talk and I put together this list of numbers I think everyone should know maybe not everyone but everyone designing software systems and you know they're a pretty broad spectrum in terms of magnitude you know one of the things you notice about data centers is it's a far apart and so it takes you a long time to go between them the other thing is that memory is really fast and disks are pretty slow the middle one they're compressed so kilobyte with cheap compression algorithm that's actually pretty interesting because often you can get a factor of two compression with a very lightweight compression algorithm and save yourself actually quite a bit of network bandwidth and try to avoid dist seeks is it ever possible you know motherhood and apple pie that's pretty impressive about Netherlands because this I mean it would be 30,000 miles if you were going at the speed of light you know be not but it's a significant fraction of the speed of light yes well I mean it is basically sending over fiber-optic cables so so for example of how you can do this so how long is it going to take you to generate an image results page and let's say I'm tasked with this this thing I have to generate 30 thumbnails for an image search and I have one design where I'm going to basically for each of the 30 images I'm gonna do a dis seek and then I'm going to read the quarter megabyte image and then I'm going to go on to the next image so that's one design it would take me you know roughly half a second obviously designed to not obviously too difficult to figure out that if you issue the reads in parallel and you have enough disks then I'll actually go quite a bit faster back-of-the-envelope tells you you know it should be you twenty milliseconds or something actually probably because of variance when you're sending talking to lots of discs some of them are going to be a little bit slower than others but it'll be significantly better than the first design and the back of the envelope calculations allow you to work out lots of different variations you know if you're going to cash in this system does it make sense to cache the thumbnails with single images should you cache a whole set of thumbnails in one cache entry eyes does it make sense to pre-compute thumbnails and these kinds of calc relations or the kinds of things you should be doing over and over in your head when you're designing a system the other thing is pretty important is to know kind of the back of the envelope numbers for things you're building on top of if you don't know you know roughly how long it takes to do a write into your cash system then by cash I mean like a disk based - higher-level software system then you really can't tell if it's a good idea to cash things or not so another thing that I found when you're designing software infrastructure is going to be used by a bunch of different groups or people is that if you ask them what they want they'll tell you many different things and it's really important to listen to the commonalities and figure out what it is they all really want it's a good thing but - you know if they tell you they want eight different things usually six of them might be in common and you should pick those six and you should do them because that's clearly going to help a lot of people if you really stretch yourself you can usually handle an extra one that would help a lot of people - but if you try to do all eight it's really going to probably work result in a worse system for everyone because the system will get much more complicated because that 8 feature just drove you over the edge of complexity in some way and so that really is going to compromise all the other clients or possibly even the clients you're trying to help while trying to do this yeah well it's a subtle distinction but it is also important to listen to what they're saying and then sometimes translate that into if I did this feature then you wouldn't need this other thing you're asking for and they're like sometimes they say oh yeah that's that's true another thing I found is that you don't want to build infrastructure just because it's fun to build infrastructure you want to build it to address real needs and another trap is kind of to them when you're designing something - imagine these hypothetical uses like what if we had a client who wanted to do this and you design a lot of complexity in because you imagine that would be useful but no one is actually asking for it today so you need to kind of try to imagine water likely potential uses versus unlikely ones that you think would be just challenging to handle and the best approach ideally is to use your own infrastructure at the same time you're trying to build something that's this on top of it because you get very very fast iteration with this approach you know you can roll out a new feature you can put a feature in the infrastructure that makes sense or put it in your application if it doesn't make sense to put it in the infrastructure and get quick feedback on how the it the interfaces are evolving and how hard they are to use if you can't do that then you should at least sit very close to the people trying to use your infrastructure and get very rapid feedback from them another thing is designed for growth so you try to anticipate which parameters of your system are actually going to grow and how much over time obviously you can't do this perfectly if you could you'd predict the future and you'd all be better off but I've often found that it's not really useful to design a system so that it can scale infinitely if you think about the in-memory example you know our original disk based index we made that evolved pretty well but once the number of queries we were handling per day crossed the threshold there was 100x what it was handling originally then a very different design made sense no not having any of the index data on disk having it in blue be in memory and so when you have parameters that change by two orders of magnitude that probably means the original design is not going to be the right one at that much larger scale another common pattern that crops up is when you're building a distributed system there's a temptation to build completely distributed state as opposed to having a centralized component turns out you can actually go quite far in a lot of distributed systems with a centralized component that has some amount of state that makes a lot of things easier so good examples of this are GFS where we had the centralized metadata master BigTable which I didn't talk about but has a centralized master that coordinates load balancing across a bunch of machines MapReduce has the master for coordinating handing out of map tasks and various other examples it's important to keep the master from being a scalability bottleneck so you don't want it to be involved in a very frequent kind of operation but in terms of oversight of a much larger system that system works pretty well and will actually scale to you know thousands or 10,000 workers probably not a hundred thousand machines but and you have to have careful design in order to keep that master out of the common case kind of operations but as a benefit of having a centralized system you get it's much simpler to reason about the state of the system and you also get a nice centralized place where you can hook on status information or you know a place where you can go find out what is the state of the system in a more centralized manner you often want to have a hot standby of the master because if that single machine fails then you want to be able to recover quickly anyway I'll go quickly through this obviously when you're sending a request to lots of machines you don't want the network interface or the CPU on the root of the guy who's sending the requests out to be a bottleneck this wide fan intends to cause drops in TCP TCP packet drops and then retransmits which adds significant latency and the cpu for either sending out the requests or processing responses on the route can be a big bottleneck so obviously introducing a tree is a good idea and then the parent here is responsible for sending the request to a subset of a the leaves in the system the other benefit is that the CPU cost of processing responses gets spread across a whole bunch of parents and the network congestion is much lower especially if the parent is able to take some of the data that was sent by the leaves and filter it in some way in our search systems for example the leaves generate you know their best 10 or 15 responses and then the parent sends back the best 20 or 30 out of the 30 leads it's responsible for so that's actually quite a bit of reduction and the amount of data they get sent up to the route compared to if the route talked directly to the leaves and it also gives you an opportunity to co-locate responsibility so that a parent on a given RAC is responsible for talking to all the leaves on that rack and that can help if your network has interesting properties where you don't have full bisection bandwidth but have more bandwidth within iraq of machines than across another pattern that crops up is backup requests to minimize latency I described this briefly in MapReduce where we spawn the backup map tasks near the end of the computation in order to take the first one that finishes it's actually useful also in fairly fine-grained things like query serving systems so if you have multiple replicas towards the end of sending it out to all thousand leaves if you've heard back from 995 and you have replicas for the other five why not send you know five extra requests to the replicas and see which ones respond first so the granularity of that is milliseconds as opposed to map tasks for the granularities you know many seconds another good thing is multiple smaller units of work per machine you want to minimize recovery time when machines crash so if you have a single monolithic thing a machine is responsible for that's bad because you're going to have another machine who's going to have to then load up state proportional with that single monolithic thing and it also makes it fairly inflexible for load balancing purposes so that if this machine is overloaded because this particular work chunk is you know slightly more expensive than some other work chunk you really don't have any flexibility here because well it's going to be slow on some other machine too so if instead you give you know ten twenty hundred pieces of work to a given machine that gives you both fast recovery because if this machine dies nine other machines can each pick up one of these chunks and recover in 1/9 the time the monolithic one could and it also gives you load balancing so I can remove one of these chunks and now it's you know what on 9/8 as fast this this is present in a lot of our systems and a final pattern that you want to think about is elastic system so if you can't quite figure out what your peak load in your system is going to be it's often difficult to figure that out it changes over time throughout the day you can get things like denial of service attacks where someone suddenly floods you with queries you didn't expect and so you want to design the system to adapt ideally it should automatically shrink the capacity at night when traffic is low and then grow it back during the day when it's high but that takes a little bit of time and so you also want to make the system resilient when you do get overloaded unexpectedly so you kind of want to do something reasonable even when you get twice as much load as you you were expecting and there's lots of ways you can do this in different systems so for example in in web search you can you know stop stop searching the full index just search you know the first so many billion documents and chop off a little bit of the tail your quality will suffer a little bit but your overloaded and it's better to turn responses than not you can drop the spelling correction tips if the spelling correction system suddenly starts being being slow and another thing that helps in overload is to start employing more aggressive load balancing when the imbalance becomes more severe so when things are roughly balanced you want to know that I'm going to skip this slide in the interest of time and go ahead two final thoughts so you know I think one of the really exciting things about the period that we're in now is that we have a lot really large collections of computational power in single locations and have very interesting datasets available you know large fractions of the world's knowledge or now online there's all kinds of interesting new datasets available and there's a proliferation of more powerful client devices that I think can interact with these data center based services in interesting ways so that that brings about lots of interesting opportunities for new kinds of research thank you I put together a list of papers to cover some of the material there but if you go to labs duk-goo they column papers there's a bunch of other ones there too and I'll take questions right in front right here so when applications are deployed to datacenters do they express their demands as demands for a small number of Punchbowl resources like I need so many CPUs so many disks or is it complicated than that so the question is how our application resource requirements actually specified in our scheduling system you actually ask for a certain number of tasks and each task has a set of properties like I need two CPU cores I need this many megabytes gigabytes of memory and the scheduling system then tries to fit all these different basically it's a big bin packing problem so that's kind of the level at which people schedule things and then on top of that people build higher level things where you don't have to specify quite as much it is yes it it is just a handful so it's basically memory CPU Network and disk are the main ones yeah the centric well whoever they do the you was management very nerd centric no node centric sorry I was hoping it was nerd central yeah no it's interesting oh yeah I mean I didn't talk a lot about the networking because I tend to work on sort of system level software but yeah there's lots of interesting networking issues and over time the networks we've been building in our data centers have grown in sort of capabilities and bandwidth and reduced latency so you try to coordinate with networks or is that sort of going in parallel ah it goes on in parallel but we kind of keep each other informed of what you know we want to see and what we expect to see in the network in the year from now and how can we take advantage of that yeah ah things that keep you awake at night is sort of the future so the question is one of the big infrastructure problems I think you know one of them which is spanner is partly a response to is that it's very hard when you're operating a service that needs to run in multiple datacenters because a lot of the infrastructure we've built to this point has been very datacenter centric and so we've kind of cobbled together a bunch of tools on the side that allow that solve some aspects of the Cross datacenter service deployment like there's one tool you can use that helps you monitor your jobs in multiple datacenters there's another one that helps you transfer files around and I think that's caused a fair amount of complexity because there's lots of different tools that each solve one aspect of the larger problem and it means more complexity for people trying to do those deployments so I think simplification of complexity in this area is a big one yeah so the volume of data on the web is growing and the amount of information people store on compelled state and when you're walking within an office it's also growing one of those trends compare me are we reaching a point where you can put the whole web in a room or is version two we're impossible to buy enough machines to keep up so the question is how is the growth of the web keeping up with our our friends and the hardware disk drive and flash industries I think it defined depends how you define the web I think the textual web is actually not growing as fast as the device capabilities are so you can actually you know use a few hundred disk drives now and keep a large fraction of the web in you know one cabinet or something I think if you include video though then there's a proliferation of high quality video devices which have a tremendous propensity to generate very high bandwidth requirements and very large amounts of data so I think that'll continue to be a big issue for foreseeable future because that I think will generate a lot more data than then the devices are really ready to deal with large amounts of you'll not have all the world's video in your pocket anytime soon let me put it that way all right there how will you play the music do we pay the electricity bills in 20 years probably through ads but you know I think power usage within the data centers is a big issue and there's a lot of both software and hardware work that can go on there you know trying to generate idleness trying to make systems more energy proportional so that if you're using half of a machine's CPU it uses half the power right now it's more like 70 or 80 percent so I think a bunch of factors like that can help looking at more lower power CPUs is another interesting area you know all these things I think together will make it so that we will be able to pay electricity bills but it's not an easy thing yeah things you do when computing pay trink with MapReduce there are a lot of optimizations but for PageRank a lot of them aren't automatic and I haven't see much fun Google in particular about what you guys actually do yeah I mean there's there's a lot of things we don't necessarily publish very detailed papers on you know it is possible to compute PageRank with MapReduce it's just an iterative process so in each iteration you essentially you know read through the link graph flow the weights for that iteration and then generate the outputs it's not necessarily the best tool to use for that particular task so you can build a more specialized system and if computing PageRank very fast is very important to you that you then you might not use MapReduce I think MapReduce is pretty effective where the model fits really well or you're trying to get something quick and dirty running and you don't want to worry about optimizing the heck out of it so once you have a system that you know for PageRank you need a thousand nodes or something to compute then it makes sense to focus a little more optimization effort and maybe build something more customized yeah you have any patents of that photo of MapReduce reduced reduced chained kind of thing so the question is do we have Map Reduce reduced reduced chains ah yeah we would tend to implement that as a sequence of mapreduces and we're the map task essentially just out the subsequent map tasks just generate output for the reduced tasks but a pretty common pattern is to have a whole change sequence of mapreduces that each do some part of a larger algorithm in a sequence of steps what IO than the read this is okay sometimes you might write out to a cheaper storage system for example if it's intermediate data M we know we could restart the whole computation from scratch every one yeah instant and we're doing Beckett infrastructure is sure oh so Google Instant is basically a system that is in the background predicting what query you're actually trying to issue when you type a few characters of it and then we'll actually prefetch the results for that so from a you know web search standpoint it's essentially just getting certain kinds of requests one of the things we do is uh when I mentioned making your system resilient to overload those requests come tagged with a bit that says there are predictive prefetch not something user actually press ENTER on and so if we get overloaded we can drop those requests and other than that it's basically just capacity you know you just need more machines to do it because you're you're doing that turns out the predictive stuff does cache reasonably well because you're essentially predicting queries that users have issued in the past but you know it's mostly just a resource issue not fundamental changes in the underlying infrastructure okay one last one yeah okay one last possibility to experience from using these to these transactions we can do this within a single data center um so the question is what is our experience with using distributed transactions so we don't have a huge amount of experience without a lot of the infrastructure we've built in the past has kind of avoided implementing distributed transactions in for example BigTable has single row transactions but not distributed transactions in retrospect I think that was a mistake we probably should have added the maybe because what ended up happening is a lot of people did want distributed transactions and so they hand rolled their own protocols sometimes incorrectly and it would have been better to build it into the infrastructure so in spanner for example we do have distributed transactions we don't have a lot of experience but again there are actually places in the BigTable implication where it would have been useful because we also hand rolled our own protocols for a couple of the subtle features in there and it would have been better just to have that available to everyone okay for more please visit us at stanford.edu

Info

Channel: Stanford

Views: 180,105

Rating: 4.9463363 out of 5

Keywords: software engineering, computer science, mathematics, technology, optimization, efficiency, algorithm, google, web search, documents, index, search, query, speed, machine, information, latency

Id: modXC5IWTJI

Channel Id: undefined

Length: 82min 45sec (4965 seconds)

Published: Thu Jun 02 2011