Lecture 8: Zookeeper

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

alright last time I started talking about linearize ability and I want to finish up this time the reason why we're talking about it again is that it's our kind of standard definition for what strong consistency means in storage style systems so for example lab 3 is a needs to obey your lab 3 needs to be linearizable and sometimes this will come up because we're talking about a strongly consistent system and we're wondering whether a particular behavior is acceptable and other times linearize ability will become come up because we'll be talking about a system that isn't linearizable and we'll be wondering you know in what ways might it fall short or deviate from linearize ability so one thing you need to be able to do is look at a particular sequence of operations a particular execution of some system that executes reads and writes like your lab 3 and be able to answer the question oh was that was that Stevens of operations I just saw linearizable or not I'm so we're going to continue practicing that a little bit now plus I'll try to actually establish some interesting facts that will be helpful for us about what it means about the consequences for the systems we build and look at linearize ability is to find on particular operation history so always the thing we're talking about is oh we observed you know a sequence of requests by clients and then they got some responses at different times and they asked for different different you know to read different data and got various answers back you know is that history that we saw linearizable ok so here's an example of a history that might or might not be linearized able so let's suppose at some point in time some client groups of times gonna move to the right this vertical bar marks the time at which a client sent a request I'm gonna use this notation to mean that the request is a write and asks to set variable or key or whatever x2 value 0 so sort of a key and a value this would correspond to a put of key X and by zero in lab 3 and then this is sort of we're watching what the client send the client sent this request to our service and at some point the service responded and said yes you're right is completed so we're assuming the services of a nature that actually tells you when the write completes otherwise the definition isn't very useful ok so we have this request by somebody to write and then I'm imagining in this example there's another request that because I'm putting this mark here this means the second request started after the first request finished and and you know reason why that's important is because of this rule that linearizable history must match real time and what that really means is that requests that are known in real time to have started after some other request finished the second request has to occur after the first request in whatever order we work out that's the proof that the history was a linearized linearize available ok so in this example I'm imagining there's another request that asks to write X to have value 1 and then a concurrent request may be started a little bit later as to set X to 2 I said now we have two maybe two different clients issued requests at about the same time to set X to two different values so of course we're wondering which one is going to be the real value and then we also have some reads if all you have is writes well well you have us right so it's it's hard to say too much about linearizable linearize ability because you don't know you don't have any proof that the system actually did anything or revealed any values so we really need reads so let's imagine we have some read unless you'll be seeing our in the history that a client said to read at this time and the second time it got an answer for red key accent got value to so presumably actually saw this value and then there was another request by maybe the same client or a different client but known to have started in time after this request finished and this read of X got value while and so the question in front of us is is this history linearizable and there's sort of two strategies we can take we can either cook up a sequence because if we can come up with a total order of these five operations that obeys real time and in which each read sees the value written by the priest most recently proceeding right in the order if we can come up with that order then that's a proof the history is linearizable another strategy is to observe that these rules each one may imply certain this comes before that edges in a graph and if we can find a cycle in this operation must come before that operation we can find a psych on that graph and that's proof that the history isn't linearizable and for small histories we may actually be able to enumerate every single order and use that show this history isn't linearizable anyway any any any thoughts about whether this might be or might not be linearizable yes yes okay so the observation is that um it's a little bit troubling that we saw read with IU - and then the read with value want and maybe that contradicts you know there were two rights one with value one on one value - so that so we certainly if we had to read with value three that would obviously be something I got terribly wrong you know but we got there were a right of one in two and a read of one and two so the question is whether this order of reads could possibly be reconciled with the way these two rights show up in the history okay so what I'm what I'm the game we're playing is that we have a like maybe two clients or three clients and they're talking some service you know maybe a raft last year something and what we are seeing is requests and responses right so what this means is that we saw requests from a client to write X to the you know put requests for X and one and we saw the response here so what we know is that somewhere during this interval of time presumably the service actually internally change the value of x - 1 and what this means is that somewhere in this interval of time the service presumably changed its internal idea of the value of x - 2 somewhere in this time but you know it's just somewhere in this time it doesn't mean it happened here or here does that answer your question yes yes okay so the observation is that is linearizable and it's been accompanied by an actual proof of the linearize ability namely a demonstration of the order that shows that it is linearizable and the order is yes it's linearizable and the order is first right of X with value 0 and the server got both of these rights at roughly the same time it's still had to choose the order itself all right so let's just say it could have executed the right of x2 value 2 first and then the read of X then executed the read of X which would the first read of X which at that point would yield 2 and then we're gonna say the next operation had executed it was the right of X to 1 and then the last operation in the history is the read of X to 1 and so this is proof that the history is linearizable because here's an order it's a total order of the operations and this is the order it matches real time so what that means is well just go through it the the right of X to 0 comes first and that's that's totally intuitive since it's actually finished before any other operations started the right of X to 1 comes sorry the rate of X to 2 comes second so we're gonna say maybe that I'm gonna mark here that sort of real time at which we imagine these operations happen to demonstrate that the order here does match real time so it'll say I'll just write a big X here to mark the time when we imagine this operation happened all right so that's the second operation then we're imagining that the next operation is the read of X of 2 we you know there's no real time problem because the read of X of 2 actually was this u concurrently with the right of X of 2 you know it's not like the right of X the read of X of 2 finished and only then did the right of X right of X with to start there really are concurrent we'll just imagine that that sort of point in time at which this operation happened is right there so this is the you know we don't care when this one happened let's just say there's the first operation second third now we have a right of X of one let's just say it happens here in real time just has to happen after the operations that occur before it in the order so that will say there's the fourth operation and now we have the read of x1 and it can pretty much happen at any time but let's say it happens here okay so this is the Diamonds so we have the order this is the demonstration that the order is consistent with real time that is we can pick a time for each of the operations that's within its start and end time that would cause this total order to match our real time order and so the final question is did each read see the value written by the most closely preceding right of the same variable so there's two V's this read preceded by a right with that correct value so that's good and this read is preceded by a right most closely preceded by a right of the same value also okay so this this is a demonstration that this history was linearizable and you know the you know depends on what you thought when you first saw the history but it's not always immediately clear that set up this complicated is you know it's easy to be tricked when looking at these histories which do you think oh the right of x1 started first so we just sort of assumed that the first value written must be one but that's actually not required here any questions about this if the you mean if these two were moved like this the okay so if if if this if the right with value to was only issued by the client after the read of accent value to returned that wouldn't be linearizable because in whatever order you know any order we come up with has to obey the real-time ordering so any order we come up with would have had to have the read of X with to precede the right of X with 2 and since there's no other right of X of 2 insight here that means that a read at this point could only see 0 or 1 because those are the only other 2 rights that could possibly come before this read so moving you know shifting these that much makes the would make the example not linearizable yes I'm saying that the first vertical line is the moment the client sends the request and the second vertical line is the moment the client receives the request yes yeah yeah so this is a very client centric kind of definition it says you know clients should see the following behavior and what happens after us send a request in maybe there's a lot of replicas maybe a complicated network who knows what it's almost none of our business we're only the definition is only about what clients see there's some gray areas which we'll come to in a moment like if the client should need to retransmit a request then we also have to you know that's something we have to think about okay so this one is linearizable here's another example I'm actually going to start out with it being almost identical I'm gonna start out with you being identical for the first example so again we have a right of X with 0 we have these two concurrent rights and we have the same two reads those are so far identical to the previous example so therefore we know this must be this alone must be minimal but I'm going to add let's let's imagine that client 1 issued these two requests the definition doesn't really care about clients but her own sanity will assume client 1 red X and saw two and then later red X and saw one but that's okay so far I say there's another client and the other client does a read of X and it sees a 1 and then the other client is a second read of X and it sees - so this is linearizable and we either have to come up with an order or this comes before that graph that has a cycle in it so you know that thing this is getting at the puzzle is if one client saw there's only two rights here so they you know in any order or one of the rights comes first or the other rate comes first and intuitively client one observed that the right with value to came first and then the right of value one right these two reads mean that has to be the case that in any legal order of the right of two has to come before the right of one in order for the climb when to have seen this and it's the same order we saw over here but symmetrically client one's experience clearly shows the opposite right sorry huh fine to client who's experience was the opposite clients to saw the right of one first and then the right with value too and one of the rules here is that there's just one total order of operations not allowed to have different clients see different histories or different different progressions evolutions of the values that are stored in the system there can only be one total of order that all clients have to experience operations that are consistent with the one order and if one this one client clearly implies that the order is right - and then right one and so we should not be able to have any other client who observes proof that the order was anything else which is what we have here and so that's a bit of a intuitive explanation for what's going wrong here and and by the way the reason why this could come up in the systems that we build and look at is that we're building replicated systems either you know raft replicas or maybe systems with caching in them but we're building systems that have many copies of the data and so there may be many servers with copies of X in them possibly with different values at different times right if they haven't gotten the commits yet or something some replicas may have one value some may of the other but in spite of that if our system is linearizable or strongly consistent it must behave as if there was only one copy of the data and one linear sequence of operations applied to the data that's why this is an interesting example because this could come up in a sort of buggy system that had two copies of the data and one copy executed these rights in one order and the other replicas executed the rights in the other order and then we could see this and linearize ability says no we can't see that we're not allowed to see that in the correct system so the the cycle in the graph in the this comes before that graph that would be a sort of slightly more proof e proof that this is not linearizable is that the right of two has to come before client ones read of two so there's one arrow like this so this right has to come before that read client ones read has to come before the right of X with value one otherwise this read wouldn't be able to see one right if this you can imagine this right might happen very early in the order but in that case this read of X wouldn't see one it would see two since we know this guy saw two so the read of X with two must come before the right of X with one the right of X of one must come before any read of X with value 1 because including client who's read of X with value 1 but in order to get value 1 here and for this read to see to the right of X with I too must come between in in the order between these two operations so we know that the read of X 1 must come before the right of X 2 and that's a cycle alright so there's no there's no Vinnie or order or that but there's no linear order that can obey all of these time and value rules and there isn't because there's a cycle in the in the graph yes that's a good question this this definitions the definition about history's not about necessarily systems so what it's not saying is that a system design is linearizable if something about the design it's really only history by history so if we don't get to know how the system operates internally and the only thing we know is we get to watch it while it executes then before we've seen anything we just don't know right we mean we'll assume it's linearizable and then we see more and more sequences of operations this Akash they're all consistent with linearize ability they all follow these rules so you know we believe it's probably this isn't linearize of all and if we ever seen one that isn't then we realize it's not linearizable so this is yeah it's not a definition on the system design it's a definition on what the what we observe the system to do so in that sense it's maybe a little bit unsatisfying if you're trying to design something right there's not a recipe for how you design you know except in a trivial sense that if you had a single server in very simple systems one server one copy of the data not threaded or multi-core or anything it's a little bit hard to build a system that violates this in a very simple set up but super easy to violate it in any kind of distributed system okay so the lesson from this is that there's only can only be one order in which the system is observed to execute the writes all clients have to see value is consistent with the system executing the writes in the same order here's a very simple history another example supposing we write acts with value 1 and then definitely subsequently in time maybe with another client another client launches a right of X with value 2 and sees a response back from the service saying yes I did the right and then a third client does a read of X and gets got you one so this is a very easy example it's clearly not linearizable because the time rule means that the only possible order is the right of X with 1 the right of X is 2 the read of X with 1 so that has to be the order and that order clearly violates this is the only one order that order clearly violates the second rule about values that is you know the most value written by the most recent right in the owned one order that's possible is not 1 it's 2 so this is clearly not linearizable and the reason I'm bringing it up is because this is the argument that a linearizable system a strongly consistent system cannot serve up stale data right and you know the reason why this might come up is again you know maybe you have lots of replicas each you know maybe haven't seen all the rights or all the committed rights or something so maybe there's some maybe all the replicas have seen this right but only some replicas have seen this right and so if you ask a replica that's lagging behind a little bit it's still gonna have value 1 for X but nevertheless clients should never be able to see this old value in a linearizable system are there no stale data allowed no still reads yeah if there's overlap in the interval then there's then you know that you could the system could legally execute either of them in a real-time and I in the interval so that's the sense in which they could system gonna execute them in either order now you know other you know if it weren't for these two reads the system would have you know total freedom execute that writes in either order but because we saw the two reads we know that you know the only legal order is two and then one yeah so if the two reserva laughing then and then any order then the reads could have seen either in fact you know Toby saw the two and the one words all from the reads these doobies could have you know the system until it committed to the values for the read it still had freedom to return them in either order I'm using them as synonyms yeah yeah you know for most people although possibly not today's paper linearize ability is is well defined and and people's definitions really deviate very much from this strong consistency though is less I think there's less sort of consensus about exactly what the definition might be if you meant strong consistency it's often men it's usually men too in ways that are quite close to this like for example that oh the system behaves the same way that a system with only one copy of the data would behave all right which is quite close to what we're getting at with this definition but yeah for you know it's reasonable to assume that strong strong consistency is the same as serializable okay so this is not linearizable and the you know the the lesson is weeds are not allowed to return stale data only only fresh data or you can only return the results of the most recently completed right okay I have a final final little example so we have two clients one of them submits a write to X with value three and then write two acts with value 4 and we have another client and you know at this point in time the client issues a read of X but and this is a question you asked the the client doesn't get a response right you know who knows like it in the sort of actual implementation may be the leader crashed at some point maybe the his client to sent in the read request so the leader maybe didn't get it because the request was dropped or maybe the leader got the request and executed it but the response the network dropped the response or maybe the leader got it and started to process up and then crash before finished processing and or maybe did process it and crash before saying the response who knows when the clients point of view like sent a request and never got a response so in the interior machinery of the client for most of the systems we're talking about the client is going to resend the request maybe do a different leader maybe the same leader who knows what so it sent the first question quest here and maybe it sends the second request at this point in time it times out you know no response sends the second request at this point and then finally gets a response it turns out that and you're going to implement this in lab 3 that a reasonable way of servers dealing with repeated requests is for their servers to keep tables sort of indexed by some kind of unique request number or something from the clients in which the servers remember oh I already saw that request and executed it and this was the response that I sent back because you don't want to execute a request twice you know if it's a for example if it's a write request you don't want to execute requests right so the server's have to be able to filter out duplicate requests and they have to be able to return the reply to repeat the reply that the originally sent to that request which perhaps has dropped by the network so that servers remember the original pry and repeat it in response to the resend and if you do that which you will in lab 3 then if you know since the server the leader could have seen value 3 when it executed the original read request from client to it could return value 3 to the repeated requests that was sent at this time and completed at this time and so we have to make a call on whether that is legal right you could argue that oh gosh you know the client we sent the request here this was after the right of X 2 4 completed so Jesus what you really should return for at this point instead of 3 and this is like a little bit a question of it's like a little bit up the designer but if what you view is going on is that the retransmissions are a low-level concern that's you know part of the RPC machinery or hidden in some library or something and that from the client applications point of view all that happened was that it's sent a request at this time and got a response at this time and that's all that happened from the clients point of view then a value of 3 is totally legal here because this request took a long time it's completely concurrent with the right not ordered in real time with the right and therefore either the three or the four is valid you know as if the read requests that really executed here in real time or or here in real time so the larger lesson is if you have client retransmissions the from the application point of view if you're defining linearize ability from the applications point of view - even with retransmissions the real time extent of the requests like this is from the very first transmission of the requests to the final time at which the application actually got the response maybe after many reasons yes you might rather you got fresh data than stale data you know if I'm you know supposing the request is what time it what time is it that's a time server I sent a request saying Oh what time is it and it sends me a response you know yeah if I send a request now and I don't get the response until 2 minutes from now dude some Network issue it may be that the application would like prefer to see we're gonna get the response it would prefer to see a time that was close to the time at which had actually got the response rather than a time deep in the past when it originally sent the request now the fact is that if you you know if you're using a system like this you have to write applications that are tolerant of these rules you're using a linearizable system like these are the rules and so you must write you know correct applications must be tolerant of you know if they send a request and they get a response a while later they just you know you can't are not allowed to write the application as if oh gosh if I get a response that means that the value at the time I got the response was equal to 3 that is not OK for applications to think you know what that I have that plays out for a given application depends on what the application is doing the reason I bring this up is because it's a common question in 6 6 8 to 4 you guys will implement the machinery by which servers detect duplicates and resend the previous answer that the server originally sent and the question will come up is it ok if you originally saw the request here to return at this point in time the response that you would have sent back here if the network hadn't dropped it and it's it's handy to have a kind of way of reasoning I mean one reason to have definitions like linearize abilities to be able to reason about questions like that right i'm using using this scheme we can say well it actually is okay by those rules all right that's all i want to say about linearize ability of any any lingering questions yeah well you know maybe I'm taking liberties here but what's going on is that in real time we have a read of - and a read of one and the read of one really came after in real time the read of two and so must come must be in this order in the final order that means there must have been a right of - somewhere in here it's our right with value one somewhere in here that is after the read of - in the final order right after the read of - and before the read of one in that order there must be a right with value one there's only one right with a value unavailable you know if there were more than one we maybe could play games but there's only one available so this right must slip in here in the final order or therefore I felt able to draw this arrow and these arrows just capture the sort of one by one implication of the rules on what the order must look like yeah all right yeah I mean any hour or X so which sorry which which yeah his own rx1 he sees it before his own rx1 okay so the via yep well we're not we're not we're not really able to say which of these two wheats came first so we can't quite for all this error if we mean this arrow to constrain the ultimate order we're not you know the these two weeds could come in either order so we're not allowed to say this one came before that one it could be there's a simpler cycle actually then I've drawn so I mean it may be because certainly the that the damage is in these four items I agree with that these two these four items kind of are the main evidence that something is wrong now whether a cycle I'm not sure whether there's a cycle that just involves that there could be okay this is worth thinking about cuz you know if I can't think of anything better or I'll certainly ask you a question about linearizable histories on midterm okay so today's paper today's paper zookeeper and I mean part of the reason we're even zookeeper paper is that it's a successful real world system it's an open source you know service that actually a lot of people ron has been incorporated into a lot of real world software so there's a certain kind of reality and success to it but you know that makes attractive from the point of view of kind of supporting the idea that the zookeepers design might actually be a reasonable design but the reason we're interested in in it I'm interested in it is for to somewhat more precise technical points so why are we looking at this paper so one of them is that in contrast to raft like the raft you've written and raft as that's defined it's really a library you know you can use a raft library as a part of some larger replicated system but raft isn't like a standalone service or something that you can talk to it's you really have to design your application to interact at the raft library explicitly so you might wonder it's an interesting question whether some useful system sort of standalone general-purpose system could be defined that would be helpful for people building separate distributed systems like is there serve some service that can bite off a significant portion of why it's painful to build distributed systems and sort of package it up in a standalone service that you know anybody can use so this is really the question of what would an API look like for a general purpose I'll call it I'm not sure what the right name for things like zookeeper is but you've got a general purpose coordination service and the other question the other interesting aspect of zookeeper is that when we build replicated systems and zookeepers a replicated system because among other things it's it's like a fault-tolerant general-purpose coordination service and it gets fault tolerance like most systems by replication that is there's a bunch of you know maybe three or five or seven or who knows what zookeeper servers it takes money to buy those servers right like a 7 server zookeeper setup is 7 times expensive as a sort of simple single server so it's very tempting to ask if you buy 7 servers to run your replicated service can you get 7 times the performance out of your 7 servers right and you know how could we possibly do that so the question is you know we have n times as many servers can that yield us n times the performance so I'm gonna talk about the second question first so from the point of view this discussion about performance I'm just going to view zookeeper as just some service we don't really care what the service is but replicated with a raft like replication system zookeeper actually runs on top of this thing called Zab which for our purposes we'll just treat as being almost identical to the raft and I'm just worried about the performance of the replication I'm not really worried about what zookeepers specifically is up to so the general picture is that you know we have a bunch of clients maybe hundreds maybe hundreds of clients and we have just as in the lads we have a leader the leader has a zookeeper layer that clients talk to and then under the zookeeper layer is the xab system that manages replication then just like rafts what was a a lot of what's that is doing is maintaining a log that contains the sequence of operations that clients have sent in really very similar to raft may have a bunch of these and each of them has a log but it's a pending new request that's a familiar set up so the Clinton's in a request and the Zab layer you know sends a copy of that request to each of the replicas and the replicas append this to their in-memory law I'd probably persisted onto a disk so they can get it back if they crash and restart so the question is as we add more servers you know we could have four servers or five or seven or whatever does the system get faster as we add more more CPUs more horsepower to it do you think your labs will get faster as you have more replicas assuming they're each replicas its own computer right so that you really do get more CPU cycles as you add more revenues between all the yeah yeah there's nothing about this that makes it faster as you add more servers right it's absolutely true like as we have more servers you know the leader is almost certainly a bottleneck cuz the leader has to process every request and it sends a copy of every request to every other server as you add more servers it just adds more work to this bottleneck node right you're not getting any benefit any performance benefit out of the added servers because they're not really doing anything they're just all happily doing whatever the leader tells them to do they're not you know subtracting from the leaders work and every single operation goes to the leader so for here you know the performance is you know inversely proportional to the number of servers that you add you add more servers this almost certainly gets lower because the leader just has more work so in this system we have the problem that more servers makes the system slower that's too bad you know these servers cost a couple thousand bucks each and you would hope that you could use them to get better performance yeah okay so the question is what if the requests may be from different clients or successive requests and same client or something what if the requests apply two totally different parts of the state so you know in a key value store maybe one of them is a put on X and the other was a put on Y like nothing to do with each other you know can we take advantage of that and the answer that is absolutely now not in this framework though or it's the center which we can take advantage of it it's very limited in this framework it could be well at a high level the leader the requests all still go through the leader and the leader still has to send it out to all the replicas and the more replicas there are the more messages the leader has to send so at a high level it's not likely to this sort of commutative or community of requests is not likely to help this situation is a fantastic thought to keep in mind though because it'll absolutely come up in other systems and people will be able to take advantage of it in other systems okay so so there's a little bit disappointing facts with server hardware wasn't helping performance so a very sort of obvious maybe the simplest way that you might be able to harness these other servers is build a system in which ya write requests all have to go through the leader but in the real world a huge number of workloads are read heavy that is there's many more reads like when you look at web pages you know it's all about reading data to produce the web page and generally there are very relatively few rights and that's true of a lot of systems so maybe we'll send rights to the leader but send weeds just to one of the replicas right just pick one of the replicas and if you have a read-only request like a get in lab 3 just send it to one of the replicas and not to the leader now if we do that we haven't helped rights much although we've gotten a lot of read workload off the leader so maybe that helps but we absolutely have made tremendous progress with reads because now the more servers we add the more clients we can support right because we're just splitting the client lead work across the different replicas so the question is if we have clients send directly to the replicas are we going to be happy yeah so up-to-date does the right is the right word in a raft like system which zookeeper is if a client sends a request to a random replica you know sure the replica you know has a copy the log in it you know it's been executing along with the leader and you know for lab 3 it's got this key value table and you know you do a get for key X and it's gonna have some four key exodus table and it can reply to you so sort of functionally the replicas got all the pieces it needs to respond to client to read requests from clients the difficulty is that there's no reason to believe that anyone replicas other than the leader is up to date because well there's a bunch of reasons why why replicas may not be up to date one of them is that they may not be in the majority that the leader was waiting for you think about what raft is doing the leader is only obliged to wait for responses to its append entries from a majority of the followers and then it can commit the operation and go on to the next operation so if this replica wasn't in the majority it may never have seen a riot it may be the network dropped it and never got it and so yeah you know the leader and you know a majority of the servers have seen the first three requests but you know this server only saw the first two it's missing B so read to be a read of you know what should be there I'll just be totally get a stale value from this one even if this replica actually saw this new log entry it might be missing the commit command you know this zookeepers app as much the same as raft it first sends out a log entry and then when the leader gets a majority of positive replies the leader sends out a notification saying yeah I'm gonna committing that log entry I may not have gotten the commit and the sort of worst case version of this although its equivalent to what I already said is that for all this client for all client to knows this replica may be partitioned from the leader or may just absolutely not be in contact with leader at all and you know the follower doesn't really have a way of knowing that actually it's just been cut off a moment ago from the leader and just not getting anything so you know without some further cleverness if we want to build a linearizable system we can't play this game of sending the attractive it as it is for performance we can't play this game at replicas sending a read request to the replicas and you shouldn't do it for lab 3 either because that 3 is also supposed to be linearizable it's any any questions about why linearize ability forbids us from having replicas serve clients ok you know that the proof is the I lost it now but the proof was that simple reading you know right one right to read one example I put on the board earlier you not a lot just you know this is not allowed to serve stale data in the linear linearizable system ok so how does how does ooh keep our deal with this zookeeper actually does you can tell from table two you look in Table two zookeepers read performance goes up dramatically as you add more servers so clearly zookeepers playing some game here which allows must be allowing it to return read only to serve read only requests from the additional servers the replicas so how does ooh keeper make this safe that's right I mean in fact it's almost not allowed to say it does need the written latest yeah the way zookeeper skins this cat is that it's not linearizable right they just like to find away this problem and say well we're not gonna be we're not going to provide linearizable reads and so therefore you don't are not obliged you know zookeepers not obliged to provide fresh data to reads it's allowed by its rules of consistency which are not linearizable to produce stale data for Wheaton's so it's sort of solved this technical problem with a kind of definitional wave of the wand by saying well we never owed you them linearizable it'll be in the first place so it's not a bug if you don't provide it and that's actually a pretty classic way to approach this to approach the sort of tension between performance and strict and strong consistency is to just not provide strong consistency nevertheless we have to keep in the back of our minds question of if the system doesn't provide linearize ability is it still going to be useful right and you do a read and you just don't get the current answer or current correct answer the most latest data like why do we believe that that's gonna produce a useful system and so let me talk about that so first of all any questions about about the basic problem zookeeper really does allow client to send read-only requests to any replica and the replica responds out of its current state and that replicate may be lagging it's log may not have the very latest log entries and so it may return stale data even though there's a more recent committed value okay so what are we left with zookeeper does actually have some it does have a set of consistency guarantees so to help people who write zookeeper based applications reason about what their applications what's actually going to happen when they run them so and these guarantees have to do with ordering as indeed linearise ability does so zookeeper does have two main guarantees that they state and this is section 2.3 one of them is it says that rights rights or linearizable now you know there are notion of linearizable isn't not quite the same in mine maybe because they're talking about rights no beads what they really mean here is that the system behaves as if even though clients might submit rights concurrently nevertheless the system behaves as if it executes the rights one at a time in some order and indeed obeys real-time ordering of right so if one right has seen to have completed before another right has issued then do keeper will indeed act as if it executed the second right after the first right so it's rights but not reads are linearizable and zookeeper isn't a strict readwrite system there are actually rights that imply reads also and for those sort of mixed rights those those you know any any operation that modifies the state is linearizable with respect to all other operations that modify the state the other guarantee of gives is that any given client its operations executes in the order specified by the client they call that FIFO client order and what this means is that if a particular client issues a right and then a read and then a read and a right or whatever that first of all the rights from that sequence fit in in the client specified order in the overall order of all clients rights so if a client says do this right then that right and the third right in the final order of rights will see the clients rates occur in the order of the client specified so for rights this is our client specified order and this is particularly you know this is a issue with the system because clients are allowed to launch asynchronous right requests that is a client can fire off a whole sequence of rights to the leader to the zookeeper leader without waiting for any of them to complete and in order resume the paper doesn't exactly say this but presumably in order for the leader to actually be able to execute the clients rights in the client specified order we're imagining I'm imagining that the client actually stamps its write requests with numbers and saying you know I'll do this one first this one second this one third and the zookeeper leader obeys that ordering right so this is particularly interesting due to these asynchronous write requests and for reads this is a little more complicated the reasons I said before don't go through the writes all go through the leader the reads just go to some replicas and so all they see is the stuff that happens to have made it to that replicas log the way we're supposed to think about the FIFO client order for reads is that if the client issues a sequence of reads again in some order the client reads one thing and then another thing and then a third thing that relative to the log on the replicas talking to those clients reads each have to occur at some particular point in the log or they need to sort of observe the state as it as the state existed at a particular point the log and furthermore that the successive reads have to observe points that don't go backwards that is if a client issues one read and then another read and the first read executes at this point in the log the second read is that you know allowed to execute it the same or later points in the log but not allowed to see a previous state by issue one read and then another read the second read has to see a state that's at least as up-to-date as the first state and that's a significant fact in that we're gonna harness when we're reasoning about how to write correct zookeeper applications and where this is especially exciting is that if the client is talking to one replica for a while and it issues some reads issue to read here and then I read there if this replica fails and the client needs to start sending its read to another replica that guaranteed this FIFO client or a guarantee still holds if the client switches to a new replica and so that means that if you know before a crash the client did a read that sort of saw state as of this point in the log that means when the clients wishes to the new replicas if it issues another read you know it's its previous read executed here if a client issues another read that read has to execute at this point or later even though it's switched replicas and you know the way this works is that each of these log entries is tagged by the leader tags it with a Z X ID which is basically just a entry number whenever a replica responds to a client read request it you know executed the request at a particular point and the replica responds with the Z X ID of the immediately preceding log entry back to the client the client remembers this was the exid of the most recent data you know is the highest z x idea i've ever seen and when the client sends a request to the same or a different replica it accompanies their request with that highest CX ID has ever seen and that tells this other replica aha you know i need to respond to that request with data that's at least relative to this point in a log and that's interesting if this you know this replicas not up this second replica is even less up to date yes was then received any of these but it receives a request from a client the client says oh gosh the last read I did executed this spot in the log and some other replica this replica needs to wait until it's gotten the entire log up to this point before it's allowed to respond to the client and I'm not sure exactly how that works but either the replicas just delays responding to the read or maybe it rejects the read and says look I just don't know the information talk to somebody else or talk to me later where's eventually the you know this replica will catch up if it's connected to the leader and then you won't be able to respond okay so reads are ordered they only go forward in time or only go forward in sort of log order and a further thing which I believe is true about reason rights is that reads and writes the FIFO client order applies to all of a clients all of a single clients requests so if I do a write from a client and I send a write to the leader it takes time before that write is sent out committed whatever so I may send it right to the leader the leader hasn't processed it or committed it yet and then I send a read to a replica the read may have to stall you know in order to guarantee FIFO client order the read and they have to stall until this client has actually seen and executed the previous the client's previous write operation so that's a consequence of this type of client order is that a reason rights are in the same order and you know the way the most obvious way to see this is if a client writes a particular piece of data you know sends a write to the leader and then immediately does a read of the same piece of data and sends that read to a replica boy it better see its own written value right if I write something to have value 17 and then I do a read and it doesn't have value 17 then that's just bizarre and it's evidence that gosh the system was not executing my requests in order because then it would have executed the write and then before the read so there must be some funny business with the replicas stalling the client must when it sends a read and say look you know I the last write request I sent a leader with ZX ID something in this replica has to wait till it sees that I'm the leader yes oh absolutely so I think what you're observing is that a read from a replica may not see the latest data so the leader may have sent out C to a majority of replicas and committed it and the majority may have executed it but if our replica that we're talking wasn't in that majority maybe this replica doesn't have the latest data and that just is the way zoo keeper works and so it does not guarantee that we'd see the latest data so if there there is a guarantee about readwrite ordering but it's only per client so if I send a write in and then I read that data the system guarantees that my bead observes my right if you send a right in and then I read the data that you wrote this isn't does not guarantee that I see your right and that's and you know that's like the foundation of how they get speed up for reads proportional to the number of replicas but I would say the system isn't linearizable and and but it is not that it has no properties then the rights are certainly many all right all rights from all clients form some one at a time sequence so that's a sense in which the rights all rights are the knee risible and each individual clients operations may be this means linearizable also it may you know this this probably means that each individual clients operations are linearize well though I'm not quite sure you know I'm actually not sure how it works but that's a reasonable supposition then when I send in an asynchronous right the system doesn't execute it yet but it does reply to me saying yeah you know I got your right and here's this yaks ID that it will have if it's committed I just like start return so that's a reasonable theory I don't actually know how it does it and then the client if it doesn't read needs to tell the replicas look you know that's right I did you know if I do a read of the data is of the operation okay so if you send a read to a replica the replicas in return you that you know really it's a read from this table is what your no way notionally what the client thinks it's doing so you client says all I want to read this row from this table the server this replica sends back its current value for that table plus the GX ID of the last operation that updated that table yeah so there's so actually I'm I'm not prepared to so the the two things that would make sense and I think either of them would be okay is the server could track this yet for every table row the ZX ID of the last right operation that touched it or it could just to all read requests returned the ZX ID as a last committed operation in its log regardless of whether that was the last operation of touch that row because all we need to do is make sure that client requests move forward in the order so we just need something to return something that's greater than or equal to the right that last touched the data that the client read all right so these are the guarantees so you know we still left with a question of whether it's possible to do reasonable programming with this set of guarantees and the answer is well this you know at a high level this is not quite as good as linearizable it's a little bit harder to reason about and there's sort of more gotchas like reads can return stale data just can't happen in a linearizable system but it's nevertheless good enough to do to make it pretty straightforward to reason about a lot of things you might want to do with zookeeper so there's a so I'm gonna try to construct an argument maybe by example of why this is not such a bad programming model one reason by the way is that there's an out there's this operation called sink which is essentially a write operation and if a client you know supposing I know that you recently wrote something you being a different client and I want to read what you wrote so I actually want fresh data I can send in one of these sink operations which is effectively well the sync operation makes its way through the system as if it were a write and you know finally showing up in the logs of the replicas that really at least the replicas that I'm talking to and then I can come back and do a read and you know I can I can tell the replica basically don't serve this read until you've seen my last sink and that actually falls out naturally from fifl client order if we if we countersink as a right then five-o client order says reads are required to see state you know there's as least as up to date is the last right from that client and so if I send in a sink and then I do read I'm the the system is obliged to give me data that's visas up to date as where my sink fell in the log order anyway if I need to read up-to-date data send in a sink then do a read and the read is guaranteed to see data as of the time the same was entered into the log so reasonably fresh so that's one out but it's an expensive one because you now we converted a cheap read into the sink operation which burned up time on the leader so it's a no-no if you don't have to do but here's a couple of examples of scenarios that the paper talks about that the reasoning about them is simplified or reasonably simple given the rules that are here so first I want to talk about the trick in section 2.3 of with the ready file where we assume there's some master and the Masters maintaining a configuration in zookeeper which is a bunch of files and zookeeper that describe you know something about our distributed system like the IP addresses of all the workers or who the master is or something so we the master who's updating this configuration and maybe a bunch of readers that need to read the current configuration and need to see it every time it changes and so the question is you know can we construct something that even though updating the configure even though the configuration is split across many files in zookeeper we can have the effect of an atomic update so that workers don't see workers that look at the configuration don't see a sort of partially updated configuration but only a completely updated that's a classic kind of thing that this configuration management that zookeeper people using zookeeper for so you know looking at the so we're copying what section 2.3 describes this will say the master is doing a bunch of rites to update the configuration and here's the order that the master for our distributed system does the rites first we're assuming there's some ready file a file named ready and if they're ready file exists then the configuration is we're allowed to read the configuration if they're ready files missing that means the configuration is being updated and we shouldn't look at it so if the master is gonna update the configuration file the very first thing it does is delete the ready file then it writes the various files very zookeeper files that hold the data for the configuration might be a lot of files nose and then when it's completely updated all the files that make up the configuration then it creates again that's ready file alright so so far the semantics are extremely straightforward this is just rights there's only rights here no reads rights are guaranteed to execute in linear order and I guess now we have to appeal the fifl client order if the master sort of tags these as oh you know I want my rights to occur in this order then the reader is obliged to enter them into the replicated log in that order and so though you know the replicas were all dutifully execute these one at a time they'll all delete the ready file then apply this right in that right and then create the ready file again so these are rights the orders straightforward for the reads though it's it's maybe a little bit maybe a little more thinking as required supposing we have some worker that needs to read the current configuration we're going to assume that this worker first checks to see whether the ready file exists it doesn't exist it's gonna you know sleep and try again so let's assume it does exist let's assume we assume that the worker checks to see if the ready file exists after it's recreated and so you know what this means now these are all right requests sent to the leader this is a read request that's just centrally whatever replica the clients talking to and then if it exists you know it's gonna read f1 and B that - the interesting thing that FIFO client order guarantees here is that if this returned true that is if the replica the client was talking to said yes that file exists then you know as were as that what that means is that at least with this setup is that as that replica that that replica had actually seen the recreate of the ready file right in order for this exist to see to see the ready file exists and because successive read operations are required to march along only forwards in the long and never backwards that means that you know if the replicas the client was talking to if it's log actually contained and then it executes this creative the ready file that means that subsequent client reads must move only forward in the sequence of rights you know that the leader put into the log so if we saw this ready that means that the read occurs that the replica excuse to read down here somewhere after the right that created the ready and that means that the reads are guaranteed to observe the effects of these rights so we do actually get some benefit here some reasoning benefit from the fact that even though it's not fully linearizable the rights are linearizable and the reads have to read sort of monotonically move forward in time to the log yes [Music] yeah so that's a great question so your question is well in all this client knows you know if this is the real scenario that the creators entered in the log and then the read arrives at the replica after that replica executed this creepy ready then everything's straightforward but there's other possibilities for how this stuff was interleaved so let's look at a much more troubling scenario so the scenario you brought up which I happen to be prepared to talk about is that yeah you know the the master at some point executed to a delete of ready or you know way back in time some previous master this master created the ready file you know after it finished updating the state I say ready for I existed for a while then some new master or this master needs to change the configurations release the ready file you know it doesn't right right and what's really troubling is that the client that needs to read this configuration might have called exists to see whether the ready file exists at this time all right and you know at this point in time yeah sure the ready file exists then time passes and the client issues the reads for the maybe the client reads the first file that makes up the configuration but maybe it you know and then it reads the second file maybe this file this read comes totally after the master has been changing the configurations so now this reader has read this damaged mix of f1 from the old configuration and f2 from the new configuration there's no reason to believe that that's going to contain anything other than broken information so so this first scenario was great the scenario is a disaster and so now we're starting to get into of like serious challenges which a carefully designed API for coordination between machines in a distributed system might actually help us solve right because like for lab 3 you know you're gonna build a put get system and a simple lab 3 style put guessed system you know it would run into this problem too and just does not have any tools to deal with it but the zookeeper API actually is more clever than this and it can cope with it and so what actually happens the way you would actually use ooh keeper is that when the client sent in this exists request to ask does this file exist and would say not only does this file exist but it would say you know tell me if it exists even set a watch on that file which means if the files ever deleted or if it doesn't exist if it's ever created but in this case if it if it is ever deleted please send me a notification and furthermore the notifications that zookeeper sends you know it's a the reader here it's only talking to some replicas this is all the replicas doing these things for it the replica guarantees to send a notification for some change to this ready file at the correct point relative to the responses to the clients reads and so what that means so you know because that the the implication of that is that in this scenario in which you know these these rights sort of fit in here in real time the guarantee is that if you ask for a watch on something and then you issue some reads if that replica you're talking to execute something that should trigger the watch in during your sequence of reads then the replica guarantees to deliver the notification about the watch before it responds to any read that came that you know saw the log after the point of the OP where the operation that triggered the watch notification executed and so this is the log on the replica and so you know if the so that you know the FIFO client ordering will say you know each client requests must fit somewhere into the log apparently these fit in here in the log what we're worried about is that this read occurs here in the log but we set up this watch and the guarantee is that will receive the note if if somebody deletes this file and we can notified then that notification will will appear at the client before a read that yields anything subsequently in the log will get the notification before we get the results of any read that's that saw something in log after the operation that produced the notification so what this means that the delete ready is gonna since we have a watch on the ready file that elite ready is going to generate a notification and that notification is guaranteed to be delivered before the read result of f2 if f2 was gonna see this second right and that means that before the reading client has finished the sequence in which it looks at the configuration it's guaranteed to see the watch notification before it sees the results of any write that happened after this delete that triggered the notification who generates the watch as well the replica let's say the client is talking to this replica and it sends in the exists request the exist room has a read only request it sends with his replica the replica is being painting on the side a table of watches saying oh you know such-and-such a client asked for a watch on this file and furthermore the watch was established at a particular Z X ID that is did a read that client did a read with the replica executed the read at this point in the log and return results are relative to this point in the log members owe that watch is relative to that point in the log and then if a delete comes in you know for every operation that there s Q so it looks in this little table it says aha you know the a there was a watch on that file and maybe it's indexed by hash of filename or something okay so the question is oh yeah this this replica has to have a watch table you know if the replica crashes and the client is officially different replica you know what about the watch table right it's already established these watch and the answer to that is that no the rep your replica crashes the new replica you switch to won't have the watch table and but the client gets a notification at the appropriate point in in the stream of responses it gets back saying oops your replica you were talking to you crashed and so the client then knows it has to completely reset up everything and so tucked away in in the examples are missing event handlers to say oh gosh you know we need to go back and we establish everything if we get a notification that our replicas crashed all right I'll continuous

Info

Channel: MIT 6.824: Distributed Systems

Views: 25,347

Rating: undefined out of 5

Keywords:

Id: pbmyrNjzdDk

Channel Id: undefined

Length: 80min 31sec (4831 seconds)

Published: Thu Mar 05 2020