"Keeping Time in Real Systems" by Kavya Joshi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning how's everybody doing today cool my name is Kavya and I'm here today to talk to you about time in our systems specifically the timekeeping mechanisms they use now first things first what even is time man I kid we're not going to go down that path today but I'm happy to talk physics and philosophy some other day preferably over red wine for our purposes today our intuitive notions of time will suffice namely that it moves forwards we use it to sequence events and to make measurements now we use time extensively in our systems right you don't need me to stand here and tell you this but rarely do we think about the underlying timekeeping mechanisms at play what do I mean well here's an example say we have a simple distributed key-value store everybody here familiar with the idea for distributed database cool so we have three nodes and we're going to assume that there are no failures and all our operations succeed now say I come along and I put key equal to V a few seconds later I changed my mind and I put key is equal to V 2 then you come along and you perform a get for key what's the value you get here it's V 2 right and he takers for V 1 yes so the final value you get depends on the consistency model of the data store are we using much sequel or are we using Cassandra and I haven't told you anything about the consistency model but wait a minute this talk was supposed to be about time and time keeping why are we talking about consistency models well a consistency model can be thought of as the set of guarantees the system provides about what events are visible and when so it's about time in that the consistency model encapsulate the set of valid timelines for events in our and these guarantees they are informed by and and enforced by the timekeeping mechanisms the system uses so this is what we're interested in today what these mechanisms are and how they affect the properties of the system the rest of the stock is structured as follows I will first talk about computer clocks in physical time we'll then talk about other timekeeping mechanisms and the context of spanner and react to practical systems and finally we'll take a step back and wrap up all right then first things first computer clocks so let's first concretize the model we're working with right it's a distributed data store which means it's a data store spread out over several nodes where the the features we care about a node is that each node operates sequentially and our nodes are separated in that there's no shared memory they are separated by a network and they communicate over by shared sorry by message passing now the reason we want to use a distributed start is for properties like fault tolerance and scalability for performance which means our data is partitioned across these nodes and replicated okay so this is the model we're interested in and our question is given this is our system how do we keep time across it well let's see our computers have clocks my computer has a clock can we just use them the party lines answers to these questions like this right hardware clocks drift ntp man I tell you NTP the system clock it keeps you next time but what are the real problems with this what are the properties we want from our timekeeping mechanism that these don't provide to answer that question we're going to delve into the system clock and you it's time to see how they work in important caveat we but before we begin the details vary by pretty much everything but we do but the details don't matter today we're after the the higher level concepts and those remain pretty much the same that said that's what we will be assuming and we're not running on an x86 processor okay so let's start with the first hardware clock drift now what is this hardware and why does it matter starting from the top when you run a command like date or call something like time now to get the time from your favorite programming language this translates to a system call to get the value of a particular computer clock right now we have different computer clocks we have the system clock or the wall clock this is the clock we're familiar with it keeps you next time but there are other clocks there's the monotonic o'clock there's the monotonic raw clock and there are others now all of these computer clocks are software based but they are run by a hardware clock or hardware clocks in conjunction with the operating system kernel let's look at an example of how this works in practice using the system clock so the system clock is simply a counter right and it's run by the hardware and the MS Carnall the way this works is at system boot it's set to a value to a reading based on the hardware clock the hardware clock is is usually the real-time clock which is a tiny chip on the motherboard and subsequently this counter is maintained it's incremented by the kernel arranging for a hardware ticker to deliver interrupts periodically okay conceptually this is like the kernel telling the each pet or a different ticker a different timer telling the each bed hey interrupt me in 10 milliseconds so when the interrupt is delivered to it the colonel knows that 10 milliseconds have passed and it increments the counter by that much now modern watering kernels not some water and anymore but they're all ticklish kernels which means that the interval the interrupts interval is calculated dynamically alright pretty simple so this is the system clock and this is the hardware clocks that can drift so when they drift what it means that what what it means that is different computer clocks change at different rates so this is the real problem that our clocks are not synchronized let's move on to the second ntp minute ntp everybody's you're familiar with ntp okay so it's basically a clock synchronization protocol it tries to make it so it tries to make it so our system clocks remains synchronized with a highly accurate clock Network and what's special about that clock Network is the hardware that powers it are very accurate clocks like GPS and atomic clocks that have minimal drift now the way it synchronizes our computer clocks the system clocks is it has two modes of operation it can gradually increase or decrease that interrupt rate this is called skewing the clock and this is good or it can simply jump the clock it can simply step it or set it to a new value and this is bad because it means our clocks the time they keep is distant or can be discontinuous there are other problems with ntp for one you need these trusted and reachable NTP servers and for another NTP is slow on the public Internet if can take up the hundreds of milliseconds and this matters when you're trying to keep things consistent in a data story and Leslie let's talk about the system clock the system clock which keeps you next time now we sort of know what you next time is it's the number of seconds since epoch which is some point in time defined with respect to the UCC time standard the other property if you next time is it it counts it calculates a day as having exactly 86,400 seconds okay so if we wanted to know the thousand days since epoch we could simply do thousand times that 86,400 and that would give us you next time it seems pretty reasonable except UTC days are not a constant 86,400 seconds wait what well we're gonna have to take a brief interlude to talk about UTC I promise I'll keep it short the idea behind UTC is somewhat bewildered it's this like messy compromise of a time standard between atomic time which is kept by atomic clocks and astronomical time which was the old-school way of calculating time based on the Earth's rotation why do we want a time standard in between the two well they both have nice properties atomic clocks are very regular very stable but astronomical time has the nice property of matching up with the earth sortation and I'm told that this is nice for certain of for certain aspects like navigation that's not an astronomy unclear but anyway UTC said UTC decides to be based on atomic time and periodically adjust its time to sync with the Earth's rotation so far it doesn't sound too bad either until you learn that the Earth's rotation gradually slows down with time so over time the Earth's an Earth Day takes slowly longer than 86,400 seconds and so to compensate for this drift UTC does this remarkable thing of adding a second every time then typically at midnight UTC inserts what is called a leap second which you may have heard of it caused a recent CloudFlare adage that quite large but how the leap second works is from going from 11 59 59 the time jumps to 11:59 60 and then goes to midnight okay and they're in there lewd back to you next time so you PC can be either a UTC day can be either of these seconds long but unix time kept by the system clock doesn't know how to represent this extra second but in the long run we want our computers idea of current time to keep up and remain aligned with UTC right we don't want it to have its own sense of what the time is so what does you next time do well in the case of a leap second you next time simply ways it simply repeats a second which means the time kept by our system clock is not monotonic so this is the real set of problems with using computer clocks computer clocks across different computers do not give us synchronized monotonic time why does this matter well let's see with an example say we have two nodes so all my representations in this talk are going to be space-time diagrams where time moves from left to right and our perspective of a nodes time line is represented by those horizontal lines and events against the system are going to be upwards and downwards errors okay it's pretty standard okay so two nodes now say I come along and I put key is equal to V but node one tends to is is unsynchronized it's a fast node so it's time stamps it with 150 a few seconds later I perform another put that ends up a note - I put keys equal to V - but no - maybe has the right time maybe it's slower but it time stamp set with a lower version so although I have a newer version of data it's time stamped with an older time stamp so not only is our system consistent now we're pretty much done for game over we can never reconcile this inconsistency right which is why and that brings us to the need for other timekeeping mechanisms our systems need to provide abstractions on physical time as we know it in order to function correctly so what are these other timekeeping mechanisms well it depends it depends on as we alluded to in the beginning the consistency model the system chooses to provide what guarantees it wants to provide about how events will be ordered and when they will be visible but this is not this decision it does not have a single dimension right consistency comes with costs for one you have availability costs right availability is the notion of very informally how responsive the system is and typically the more the higher consistency means lower availability this is the cap theorem at play sort of and consistency also has a has a cost of performance right and performance specifically means read and write latency and therefore throughput of the system and so while building these systems you typically have to make trade-offs you have to turn down and up these knobs of consistency availability and performance so the timekeeping mechanism is also dependent on these three factors now this relation between the timekeeping mechanism and these properties of the system is something we'll come back to throughout the talk for now though enough theoretical pre-landing let's start real systems so the first system we're going to talk about today is a strongly consistent data store built by Google it's called spanner our people you're familiar with or heard of spanner cool so as you know then spanner is a distributor relational database it it allows for complex schemas and distributed transactions which is pretty neat no spanner is horizontally scalable so your data your savings table in a banking system may live on one partition and checking on another partition its geo replicated for fault tolerance and spanner really cares about being performant right and this is an important design consideration to note and finally the consistency model spanner provides is a very strong form of consistency called external consistency the definition is a mouthful a globally consistent ordering of transaction that matches the observed commit order to see what that means let's just work through an example so say we have a spanner set up we have our savings and our checking's on two partitions each is replicated and the application has constraint that the minimum total balance requirement for a user's account across their checking's and savings accounts must be 200 now at the beginning of the world my account meets that criteria and then I perform a deposit into my savings account this is the first transaction and then a few seconds or maybe a few hours later I make a debit from my checking account and this is t2 now what do we want from this system well we want it to never be the case that t to the debit is visible but not t1 the deposit right this is this is what this is the essence captured by globally consistent transaction order what we're saying here is t1 and t2 our transactions operating on different partitions of the key space and yet despite that we want them to be ordered in the order of zero by me external to the system to the commit order that I observed No why is this complicated or why is this tricky well imagine this imagine one partition is in the u.s. so team one goes to the US partition and T to that partition is in Europe so T two ends up in the Europe data center and the auditor for the bank say the bank accounts balance police is sitting in Europe and checking checking balances meet that criteria now you would think that if they had the data set the Europe data center it's entirely possible that they they might see t2 before t1 and we don't want that to be the case because then I would be in trouble and the system would be incorrect as per in spec so that is the idea of external consistency okay the other thing we care about is performance right you can imagine one way to provide external consistency but just putting a giant block of the entire database but that would sort of defeat the purpose of a distributed database so what does this have to do with time again well taking those lofty goals and translating them into our set of requirements what we're saying we want is a consistent time line across replicas we want to be able to order transactions across partitions as well and we want this order to correspond to the observed commit order if we are able to satisfy these three requirements with whatever timekeeping mechanism we choose then we obviously get the desired consistency guarantees we also get the desired performance now we're not going to go into this detail in this talk but what it allows us to do is implement things like reads for replicas and consistent snapshot reads and that's what gives us the performance that we want all right so let's start with the beginning as with the with the first requirement consistent time line across replicas now we're not going to talk about this because there's a protocol that we all are several protocols that we all are I take it familiar with consensus we can use consensus protocols like paxos raft 3 3 phase commit two-phase commit what have you we can use consensus to synchronously replicate and that would give us a consistent time line across replicas let's talk about the second and the third properties which are more interesting what we want is we want t1 to be ordered before t2 even if T even if t1 and t2 are across the globe so let's think of one way to do it namely using commit time stamps right the idea of a commit time stamp is when your transaction commits you assign that version of the data so this is Savannah uses MVCC so it keeps around multiple versions of data so when the transaction commits you timestamp that version of that data with that timestamp called the commit time stamp would that work well let's see if we had perfectly synchronized clocks it would write t1 would be time stamped with 50 and t2 the transaction there would be time stamped with a hundred and that and then we could order them based on their time stamps and we'd be all set but we don't have synchronized clocks so what else could we do well if we knew exactly how how would the uncertainty what the error of our clocks was we could take that into account and still use commit times so this is what spanners true time does true time is both the API and the infrastructure that provides what is effectively a globally synchronized clock with a bounded but non zero error right so conceptually a true time both tracks and it exposes the uncertainty about perceived time across the system clocks this is the idea right so say we have an event it occurs at a time so it occurs an absolute time a true time that our system clocks cannot know they cannot know this because of drift in because there are because of their error margins but if we know the error margin we can come up with an interval we can come up with the bound that is guaranteed to contain true time right so in this case T the small T is true time and that represents our interval does this make sense so far cool so this is what true time does it represents time as an interval not a point so for example if you say now true time Jimmy now true time returns the interval that is guaranteed to contain true now so earliest there is the earliest time that could be true now and latest is the latest time that could be true now so here we're talking about physical time but its physical time augmented with uncertainty information okay so now that we have a way to know the uncertainty of our clocks let's account for it so what we want is for T ones commit time stamp to guaranteed be less than 2 T 2's commit time stamp and the way this works is Spanish so when T 1 arrives at its partition what the note does is it sets it's commit timestamp to be the latest time that represents now and then it waits it waits a full uncertainty window and only after it waits the uncertainty window does it actually go through the commit at that point the changes it makes are visible to the rest of the nodes so at that point it commits and it replies to the client why does it wait what the commit wait does is it guarantees that the next transaction will have a higher timestamp what it what it does is it forces the client to wait and therefore the next transaction to be serialized at least one uncertainty window ahead what this looks like in practice is now let's say t2 ends up edits at the second partition now the note is going to assign the commit timestamp in the same way but note that now has now moved along one whole uncertainty window and so t2 is guaranteed to be less than t1 now why do we have to wait that full of uncertainty window because for up to that time say say our uncertainty window is 10 for up to 10 for up to 10 milliseconds there could be an error between my clock and another clock right so say for example the t1 and so going back to our original example with the 2 notes so say that t1 ends up at a node with a slow clock right so it's time stamp set with 50 then our sorry with a fast clock so it's time step we'll come back to it time's confusing okay all right so what spanner does is you know so what's better does essentially is it weights out for the uncertainty right since it uses this concept it applies that to reeds as well we're not going to talk about reeds today in the interest of time but it applies the same idea it knows the uncertainty window and so it takes it into account to provide strong reeds okay so this is really really neat okay this is neat because true time enables externally consistent commitment I'm stamps which in turn enables external consistency without coordination but true time is not perfect right this the uncertainty window it affects the commits time it affects the commit wage time and so it affects the the right latency and so the throughput of the system which is why you need to go through extraordinary lengths to keep this uncertainty window very small so Google uses a lot of very expensive and very impressive infrastructure and it has a special clock synchronization protocol to keep the clocks the system fox very tightly synchronized right so whereas NTP goes up to 100 milliseconds true time bomb binds it to seven milliseconds which is pretty impressive and this is also why no you can't have it cool so that's about true time now let's hop over to the other side of the cap here and fence and talk about a weekly consistent store react have folks here used react a few yes cool symma react is a distributed key-value store so it does not support complex schemas it just Maps a key to a blog and it doesn't have the notion of multi key transactions okay now react chooses first and foremost to be a highly available data store that is it chooses to be available and able to perform reads and writes it prioritizes that over everything else and this again is important this is an important design consideration to note to that end the consistency guarantees react makes a very weak it's a specific flavor of weak consistency is called eventual consistency and the idea is replicas are allowed to diverge for a period and they will eventually converge with an example say we have three replicas then use it a user comes along and they put carts equal to a that ends up a node one the user subsequently updates the cart to contain D and that ends up a no - now for a period a subsequent gift for the cart can return the empty cart if it ends up an if the request is served by node three it can the Deb can get a cart with value a or it can get the cart with D and this is totally fine in fact this is specified by the consistency guarantee the system makes to you eventually though all the replicas must converge and they must converge to the last updated value which is the cart containing D and this is where timekeeping comes into play right and our timekeeping mechanism specifically needs to be able to determine causal updates so it can converge to the last updated value it needs to know that D was D was aware of Kart 8 and over and over wrote it so it's causally related now the other thing we said we cared about it was availability right so consensus is not an option here we want not just one replica but all replicas are many replicas to be able to accept reads and writes and perform them and this in turn means that the second thing our timekeeping mechanism needs to be able to do is determine conflicting updates what do I mean by conflicting updates well here's the example well same after user a takes the empty card and makes it card containing a say a different user comes along and takes the empty card and makes it card containing B now these two updates are effectively concurrent right they didn't know about each other and therefore and therefore they conflict so we need the timekeeping mechanism to be able to identify updates that are causal versus conflicting well let's see can we use time stamps here physical time steps sure why not only one problem we don't have synchronized clocks and we don't have true time not everybody is Google and so we need to come up with a different timekeeping mechanism which brings us to logical time in the form of vector clocks now vector clocks are logical logical clocks and that they don't track physical time they track logical time which is entirely divorced from the idea of physical time right you can think of it as keeping versions and event counts and ink and time stamping versions of the data with the vector clocks okay so this is the example we're going to be working with we have a conflicting update so a and B are conflicting and we have a causal update a and D are causal and this is the example we just okay so let's see how effective clocks work at the beginning of the world so vector clock is simply an array of n n count encounters basically with one counter per node each node in the system holds on to its own effector clock so its own full vector clock at the beginning of the worlds they're all set to zero now each time a node observes or performs an event it's going to increment its count in its vector clock by one so there's a put-on node one so the zero went to one now note we had an event so the zero the green zero becomes the green one now no one has another event so it's vector clock goes up again and what happens here what react does is get not only returns the objects the cart it also passes along the vector clock the time step in a causal context which is really just the header of the response so it returns both the clock both the clock but the clock and the cart now the user is going to put the cart is going to update it to D and put it and that's going to end up a note too right so that's what the user is going to do and the user again react dustless where the user sends along the timestamp that was returned to it what this enables is causality tracking through the system so what does know to do at this point well it's just observed an event so it's going to increment its vector clock but it's also going to do something else here it's going to do something special to establish that it knows off the event that happened on node 1 it's going to set its vector clock to the pairwise max of node once vector o'clock and it's back to clock okay and this captures the captures that it knows of the other event - all right so this is the final state of our system and it's now time for replicas to convert we won't talk about how replicas converging react but you can ask me later Wow okay so at replica convergence time this our question is how do we know what versions of the data are causal and therefore should overwrite older versions and what versions are conflicting and therefore should be preserved and we're going to use vector clocks to do this how do we do that well the vector clock rule is if X is vector clock is less than Y is vector clock then X came before happened before Y so with an example let's look at a and D right those are the Associated vector clocks if we take if we do a pairwise comparison of their corresponding elements we see that 2 0 0 is less than 2 1 0 and so carts a precedes carts de cartes de should override cart a on the other hand if we look at D and B we see that two months zero is not pairwise less than 0 0 1 and the inverse doesn't hold as well right so we basically can't order the vector clocks one way or another and this means because the vector clocks can't be ordered it means that D conflicts with B which matches our intuition right so this is also pretty neat of logical clocks they give us it means to abstract away physical time and like focus on the ordering of events so they they are he clever proxy for physical time but logical clocks also have their pitfalls for one they need to be passed around to preserve our causality chain and they're divorced from physical time right so you can't query react for data based on a physical time stamp that you care about like you can do with spanner all right so we could stand here and delve into the intricacies of spanner and react all day but rather than doing that I'd like to take a step back to wrap up and talk about what we learned today so we first started with a discussion of why physical time doesn't quite cut it for disputed systems we then looked at two very different approaches used by two very different systems to get around this problem right we talked about true time and how it all meant physical quacks no a dog meant physical time but this uncertainty information to give us time stamps that correspond to the wall clock time but to give us external consistency we then talked about a radically different approach to the problem at a physical time we talked about doing away with physical Huynh entirely and coming up with this notion of logical time trapped by vector clocks and this gives us means to track causal relations and not worry about time at all like physical time off but these approaches are not perfect right true time requires a globally synchronized clock in order to work and rekordbox don't have anything to do with physical time which brings us to an idea that I will have to leave you with or I'm happy to talk about later which is hybrid logical clocks used by cockroach TV it's a fairly new technology and what it does is it bridges the gap between physical time and logical time it's pretty cool and I encourage that but we are about out of time I hope you leave today with and standing of the timekeeping mechanisms used by these two systems specifically but but more importantly perhaps an appreciation for the the nuances of the problem of timekeeping in our systems thank you [Applause]
Info
Channel: Strange Loop Conference
Views: 7,396
Rating: 4.9378238 out of 5
Keywords:
Id: BRvj8PykSc4
Channel Id: undefined
Length: 39min 5sec (2345 seconds)
Published: Mon Oct 02 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.