Everything You Thought You Already Knew About Orchestration

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this talk is about orchestration and specifically all the stuff that you think that you know but you probably kind of don't know I have been using docker for a long time long before orchestration was really part of the story so my first docker version was o7o in late 2013 I've been working with docker for so long they have given me a fancy hat and the title of docker captain so if you're brand new to docker which probably or not if you're in this room please speak out me or someone else's a docker captain we'd be happy to chat with you as Jo mentioned I'm the director of engineering at code chip which is a CI CD company I'm not going to talk about CI CD but we do have a booth here if you are interested in docker CI CD writing paths not being paged all those things we can talk to you about at our booth I think we're at five in the expo hall so come come say hi we have hot sauce it's good hustle its TSA approved pepper so today as Jerome mentioned as I mentioned we're talking about orchestration so we'll talk about the rap algorithm we're talk about quorum leader election log replication I'll talk a bit about service scheduling and strategies for scheduling tap how they get dispatched onto your cluster and then for my last trick we will go over some pretty gnarly failure recovery scenarios I have a lot of live demos I think six of them and I'm feeling you know this is not a black belt this is like not recommended use necessarily of what you should be using docker for so bear with me I I will I'll do everything live because otherwise it's cheating throughout the talk I'll have some debugging tips that you might find useful if you're trying to figure out what's going on in your cluster so the whole reason that orchestration is so complex and why we're all sitting in this room is that what we're trying to do is to get this like big collection of nodes to behave like a single node so to the outside user it should just seem like you're talking only to one computer and the way that this can happen kind of boils down to two main categories we have maintaining state in the system so in order to behave like one single thing there has to be a single state how does that happen and then also how does work it's scheduled if you say docker service create something how does the orchestrator and the schedule or decide where to dispatch those tasks so that they are running on a node that can handle the workload pretty much state and scheduling are the two big problems and everything that we will talk about today kind of boils down in one of those two categories before we get too deep into it of course this is a beautiful cluster and we're all familiar with what this looks like for the next 35 minutes or so I'm mostly going to talk about managers in orange I'm going to talk about wrapped consensus and the consensus group will talk about leaders and followers in the management group but I'm not really going to talk much about the workers themselves or the interface between managers and workers most of the heavy lifting in a cluster happens at the manager level and I think that's the most important stuff so we'll focus there and the number one most important thing that you need to understand in order for your highly available services to stay highly available and for things not to just turn upside down is quorum so what even is quorum I took a sample poll at the coach of customer dinner last night and got a lot of like oh yeah it's I understand it so that leads me to believe that maybe some of you who are administering clusters in this room might not totally understand what quorum is but it is essential and imperative that you do quorum just like in the legislative kind of political sense is the minimum number of votes that a group needs in order to perform an operation so in distributed systems this is scheduling something promoting a node to become a new leader like electing that node if you do not have quorum your system cannot do work so you need to plan to maintain quorum and your consensus group at all times this can be done with math which is really fortunate for all of us the the formula is a little bit weird because of course you can't have like half of a manager we have to deal in whole numbers so if you have a consensus group of five it's like 5 divided by 2 plus 1 that's quorums and up that case it's 3 round down in this case take the floor of that number if you have one manager quorum is one if you have two managers quorum is two and so on and so forth this is can be achieved with math in simpler terms quorum means majority and of course when we're talking about owns that are online and able to vote in in the cluster of your manager group I think what's more important than understanding quorum is understanding how much fault your system can tolerate so in the case of one manager form is one you have zero but in the case of two quorum is also two because one out of two is not more than n divided by two plus one your fault tolerance is still zero and in fact even numbers of managers are highly inefficient and please don't do them like please don't use even numbers use odd numbers only like the phrase I literally can't even like kind of annoying but that is a good rule for your management cluster just only odd numbers having two managers instead of one this is so counterintuitive but when I say it like this you're going to think oh my god lor you just blew my mind having two managers instead of one actually doubles your chances for losing poram it actually makes your management cluster more unstable you have two points of failure instead of one please don't do I just have one have three have five even are sorry odd numbers only don't use even numbers if you are using multiple regions or doing something really distributed across the world pay attention to data center topology when you're placing management manager notes plan that one of your data centers US East one is going to go on fire and that you're not going to know about it by the way all of my demos are in u.s. East 100 probably should have like rubbed that plastic skull or something before I came up here but then so if you're distributing for example three nodes across three regions have one node in each region pretty intuitive plan to have one region totally blown away and still maintain quorum this again can be done with math in docker for AWS this is done automatically for you via auto scaling group auto scaling groups and AWS kind of distribute the work for you so use them if you can and if about quorum quorum is just one part of the story raft is the thing that's like using this idea of quorum and making your system run so raft is really complex and we talked about rafts we talked about it in the keynote no sensible person wants to write their own distributed consensus algorithm that's why raft is used because why would you write your own you can write your own implementation but the actual algorithm itself is not worth you writing your own just use the one that's that's out there so Ralph is responsible for a couple things it's responsible for log replication in your cluster who's responsible for leader election and it is responsible for safety so making sure that nothing too terrible can happen or making sure that your cluster doesn't get into a terrible state I'm not going to talk much about safety it sort of built into the log and leader part of of what the protocol was responsible for being easier to understand isn't kind of unwritten responsibilities of raft the whole point we're talking about raft instead of something like paxos or multi paxos is because no one understands multi taxes I could not stand up here and give you a talk on how multi packs those works because I couldn't even explain it to you very eloquently this is a problem for people who are administering distributed systems if you don't understand the algorithm that is backing your entire management cluster that's like a big red flag so rafts was designed to be a bit more easy to be a bit easier to understand easier to teach easier for students to learn which is great for you all because then it's easier for you to understand as well chances are even if you don't use swarm kit or even if today's the first time you've ever heard of raft you've probably used it before use everywhere that sed is used because sed is backed by raft orchestration systems need some kind of consensus algorithm to back them when they're running and in a lot of cases that algorithm is wrapped so if you're using kubernetes if you're using sed you are using rafts already maybe you just don't know about it the difference about swarm kit is that swarm kit implement drops directly storm kit does not use that CD it just has rafts hanging out there on github you can check it out read it lots of comments I highly recommend that you do that in most cases the route that's running on your manager nodes rafts takes up a lot of resources to do the work that it needs to do to keep your cluster up and running I would highly recommend not running work on your manager nodes this may be seems like cluster 101 but I think a lot of people forget you can drain your manager nodes by using docker node update - Josh availability drain this will make sure that no work gets scheduled on the manager nodes it's also great to have the man you know to be a bit bigger than then your worker node simply because RAF does take up a lot of resources and the nodes that are running the algorithm can be a bit more sensitive to resource starvation maybe than a normal worker node I will of course run work on manager nodes for educational purposes only so if you are of course a hobbyist maybe it's fine to run work on manager nodes just a word of warning to not do that you can prevent really bad things from happening if you just keep your manager nodes for rasped only and not for other tasks let's talk a bit about what RAF is doing under the covers and so this is leader election and log replication and we're going to talk about leader elections first so we know that we have a leader in our manager cluster the manager can also be in a candidate state meaning it's trying to be elected leader or a convened a follower state meaning it's just like voting and taking instructions from the leader there sort of a fourth unwritten state which is offline this is like what we're trying to protect ourselves against is having a manager offline and then everything blows up in our face I have a handy demo of course I have a domain name hoarding addiction so I did buy the domain consensus group so if you go to demo dot consensus group we can look at the cool demo we see if I can hide this and go there we can just talk through what's happening so first we'll talk about uhm yeah I can make this little bigger and we can talk about the leader election process I'm going to speed this up just so that I can save more room for live demos at the end and so what just happened there as we started a new election cycle and in this case number two so s2 had just been elected leader and has the solid line around it around the other nodes we notice their time out and what will happen is this none of the nodes here from the leader before those notes timeout that timeout will prompt a new election cycle so in this case s3 was the first to timeout it started a new election phase we have one node offline that node can't vote but okay because we have five which means quorums three so we have four votes the candidate looks for itself and then three was allowed to be elected the leader Ralph has some pretty cool stuff and one of the really cool things it does and why you shouldn't write your own consensus algorithm is because of weird cases like this like an actual very true split vote in this case we had s 1 and s 5 timed out at exactly the same time this is really really unlikely because these timeouts are randomized but it can happen like in the realm of the universe of possibilities keeping in mind we have 5 manager nodes quorum is 3 we maybe are going to end up in a bad state because each of these nodes is trying to get 3 votes we only have 4 on line and it didn't happen who would we lost it so what happened was that neither of them got the three votes necessary and Raft instead of trying to repair that election cycle just says like if whatever the next node that times out without hearing that a new leader had been elected becomes the new candidate and then we ended up in in the 5th election term so that's leader election is actually pretty simple to understand relatively speaking it's the the logging part that's a bit more challenging before we go through the logging demo I just want to make something really really clear so number one the log is the source of truth for your application this is if you imagine like a series of scribes distributed across many of ancient cities the scribe has like stone tablets of the truth written on it and that's what the log is in the case of your distributed system I am NOT talking about things that you read in paper trail I am NOT talking about fact traces or application any like output I'm talking about an append-only time-based record of data in this case we can think for example we have a log of the value of x we start on the left hand side with the first entry X gets to X 1030 etc and we're disappearing so this is time-based append-only just it's the truth it's the state of your application distributed logs and replicated logs or how your distributed systems knows about its state this log is for computers it is not for human so again this is not things that you would read in paper trail or like log aggregation of the services that are running in your distributed system this is the record that is the truth any new node that has this log should behave in the same state again this is for computers not human but I am a human with a computer so we will do some fun demos and the thing about logs and distributed systems is that it's kind of hard so this is a very simple system we have a client and a server of course the client can say hey server X gets 12 and then the server is like cool and it gets added to the log and it's not a big deal as you can maybe imagine the idea of quorum becomes important when trying to replicate logs to a distributed group all of these managers have to vote and agree that the next state transition is possible and valid and vote to accept the change so that it can be appended we can go back to my fancy little demo here so this is five manager nodes the same ones that we had before and we'll just go through log replication and we can see that s1 who's the current leader has some uncommitted logs is why there's dashes around them and it's trying to write them into stone for say trying to commit them but we have three offline nodes which means we do not have quorum which means that cannot happen we can't perform that operation because we don't have quorum so the system will just do what it can without quorum once a new node comes online you can see maybe it's a little hard to see in the back but those logs in the leader are now being committed so they're they're being voted on accepted as truth and then we can proceed and that's exactly what happens until they're all committed what happens though like let's say for example we have uncommitted logs and then the leader times out and we have an election and now we're in a new election cycle the kind of long story short is that logs from a more recent election term take precedent so we're in term three now and we'll see like when f1 comes back online it will just listen to whatever is there from the more recent election so of course this is like a pretty not watered down but like some pretty simple convenient examples that are nice and can be visualized here there's a lot more complexity that's you know worth its own to our talk or worth its own advanced orchestration workshop perhaps that velocity in June if you're there you can learn how you can learn even more all about this cool please understand log replication again I could go on for like two hours about this there's a really good blog post that I made a convenient bitly link v ugly logging post just from the LinkedIn engineering team it talks at links about what's logs are where they come from what they're used for please do yourself a favor if you're administering a cluster and read this blog post to thank you later on Twitter great so one cool debugging tip we talked about these logs and I said that they're not for human but I'm a human with a computer so let's see what's actually going on here because I think it's really useful to like look at what's happening in the file system I have there's two ways that you can kind of see what's going on in your cluster one is monitoring the logs via something like inotify wait or you can read them directly so let's start with let's start with the monitoring G so I have a cluster here and I have I think docker node ls3 I have node one which is a leader and then I have two other nodes chillin so I'm going to do my work on node one and then I know two here I'm just going to monitor the logs in the filesystem and they live in VAR Lib docker swarm and I made of course a little image because it's easier for me to remember this like docker run command versus inotify wait command because those are harder to remember so let's just link bar lib docker sworn into this container and I made this little okay which is just going to watch Barla docker swarm cool so we have a watch setup and I'm going to go back onto node 1 and my cluster and we'll run that fancy example voting app that we are all very familiar with from all the various keynotes I'm going to run that with docker stock deploy because there's a stack file in it we'll call it vote and we can say docker back cool so we're going to create a bunch of networks create a bunch of services and then my plan worked so we can actually see that we have some action going on here in Barla docker swarm raft the wal that's the log so we can see that it's like being modified we can see tasks are being modified all this good stuff is happening here which is pretty sweet umm cool but what happens it's like you just want to look at what's happening in the log so I mentioned this is not for computers and it's not but there is a tool and swarm kit that helps you kind of dump the logs out and look at what's going on in case you need them for debugging purposes so I have made a conveniently just a little backup directory and then I just ran a docker container with the golang image and mount go bin from the container to get the utilities that are in docker swarm kit CMD I just loaded those in there for demo purposes they're already done and what we can do is for safety reasons take a copy of the file system I'm going to put that in here so let's do those cpa VAR c lin stalker swarm let's call the swarm copy cool so I have now the swarm copy which is the like where my logs were what we were just monitoring I just put in here we have to do a bit of fancy work with speedo which is always good during a live demo so that I can access them I'm copy cool and then we can use this swarm raft tool point it at form copy and then use this utility called dump wal like I actually forget with that sensor right tell me right append log yeah this is our raft log so this is like human readable version of what's happening in our cluster and you can see there's like lots of stuff going on in there if we look just a bit we can see that this is like all the events the record from our voting app they're all there we can see entry index we can see the election terms we can see a lot of stuff I'm not going to say that this is like a very practical thing for you to do on a day to day basis but it can be done and the utility lives in swarm kit itself so this is a nice way to kind of have a deeper look into what's actually going on in your in your cluster cool so that's swarm log let's talk a bit about scheduling so scheduling is a bit tricky because scheduling algorithms that you're taught in CS classes are kind of for like packing boxes at Amazon warehouses not really for running highly available applications and there's just a subset of problems where there's two things h.a applications and scheduling algorithms sort of overlap and that's the problems that orchestrators have to deal with as we talked about I think Diogo mentioned it this morning in the in the opening session there are scheduling constraints so he used the example of scheduling secur like work that really needs secure nodes unlike the subset that has a specific label you can do that with - - constraint you can use labels you can use a bunch of other stuff it's really well documented this makes it so that you can only run that work on the node that has the constraint brand-new and docker 1704 is topology aware scheduling so this is not a constraint but a preference what this does is implements a spread strategy so trying to spread the work out over the notes evenly for nodes that belong to a certain category unlike constraint this is a soft preference so if it can't be satisfied the scheduler is going to still try to run your work somewhere else it's not as hard as constraint and you can't get into a state where like a job is scheduled and it really should have been allowed to be scheduled it's a soft preference so you have to know the difference when you're using it and um we can maybe take a look at that really quickly I'm in a docker stack RM this votes back and actually I have on consensus group the dr. storm visualizer in case you want like a more UI focus picture of what's happening here I'm gonna add some labels to my nodes using docker node update - - label ad which is really kind of counterintuitive from an english-speaking perspective and I'm going to add the label data center BC and for node 1 and node 2 I'll put them in the same data center for example node 3 let's put it in this data center so um what now we have our two nodes in one data center another node in another data center when I do a docker service create let's just run some Redis stuff and let's run like let's run 12 down to start with I'm going to start them with a placement prep and that is going to be spread that's the strategy notice label a tip data center okay this is going to say for every value I have two values of data center is going to spread the work evenly so we can expect six to be scheduled in a and six to be scheduled in Z we have one node and Z into an a so we'll see what the what the scheduler does there I think that should work and let's take a peek over here cool so we can see that for the values of a it took those jobs and then split them up over the two notes and then in the value of Z since there's only one note it's going to schedule all six this is a really good way to again prevent single point of failure with a particular data center again in case one goes down you can kind of make sure that you're highly available apps are highly available one super cool point that I want to make is that if you are like running stuff and one of your nodes goes offline your nodes of course is the the tasks are going to be rescheduled I have a pretty fancy chaos monkey script so let's um let's disconnect one of the nodes maybe no two can be disconnected so we'll drop all network traffic like what could go wrong - no - this might just take a little bit for the visualizer to pick up the change that it's going to draw okay cool so now we can see hmm I asked for how many cool it's going to split it up then so it took the nodes I had a placement preference in data center a one of the nodes in data center a went down so is going to reschedule that work on the other node that's available in data center a one thing that is maybe a bit counterintuitive is that when that node comes back online there's really no point in taking healthy nodes offline so let's not do it or sorry not helping those healthy jobs so right now what you might expect is for swarm to rebalance that node but it's not going to everything's just going to stay where it is unless the containers go down and then they'll be reduced on to maybe node two but this is a kind of counterintuitive thing to you maybe expect some rebalancing but it will not happen what will happen in terms of kind of rebalancing is you can do a manual rebalancing you can maybe say doctor service scale Redis equals 20 we can add some and then maybe take it back down to 12 okay so this is like a manual workaround it's not like ideal but it's rebalancing is really important to you that's what you have to do it won't happen automatically for you cool debugging tips for scheduling is an ADD availability drain to a manager notes and run that engine in debug mode that will help you if you're kind of confused about some of the scheduling things that are going on yes adding another manager to your cluster will maybe disrupt the like failure safety for app maintaining quorum but chances are if you have to run a manager in debug mode with availability drain you probably a bigger problem so I think it's fine just for this one case cool so we talked about just a really easy kind of failure recovery what happens when a single node goes down those things are the jobs are just sort of scheduled on to the other nodes that can accommodate them but what happens when like more disasters things happen so let's talk about losing quorum and what happens when you lose quorum so again you're you just like simply can't do work on your cluster if you've lost quorum the most obvious thing is to bring back the nodes that are down like duh right obviously that's probably not the case and if you lose quorum if you have access to a healthy manager you can run docker swarm in it - - forced new cluster that will recreate the cluster with one manager that's a healthy manager that you're on you need to then promote other managers to get to the kind of group that you want if you wanted five so now you force a new cluster with only one new manager but it's a pretty it's a pretty ok kind of failure recovery is a pretty soft failure recovery like nothing to disastrous has happened but what happens if the data center is just totally on fire you can restore from a backup your whole cluster I can't say this is recommended and it can it like it really shouldn't happen but this is black belt so I'm going to do things that are you know not recommended and things that really shouldn't happen this is in the docs and I want to just talk through a couple like complexity but like pieces of complexity around this so basically the process is this you have a backup so you have Varla dr. swarm backed up somewhere and you want to bring a new node online and stop docker it is possible to do this like a hot kind of backup in like a hot restore but for your own sake just don't do it don't try to have docker changing the state when you're trying to recover from a backup same for when you take the backup just stopped offer and make sure that varla docker swarm is totally cleaned out so you can do a sudo RM RS which is my favorite way of ensuring nothing is there copy your back up into varla docker swarm start docker and then you can do a docker swarm in it just like when you locked for them and you can use - course you cluster I have it in parenthesis because in fact just yesterday or maybe the day before yesterday I found what is arguably a bug but arguably also a feature in this recovery process so if you have a backup the metadata for your nodes has the IP address of the leader okay in general you should not be monkeying around the IP addresses of your nodes and in fact it's like expressly forbidden in the context of swarm when you restore from a backup you are kind of giving your cluster that old state and that old IP address which means when you try to generate a token and join the swarm like it's not going to work because the swarm leader is advertising itself on a different IP than just there all right this is arguably a feature to protect yourself like from disastrous but in this one particular case of restoring from a backup it's less than ideal one workaround is to use an elastic IP and just like reassign that IP to the new node that you bring out and you have to know that beforehand so that's why I'm here to tell you it's like this is not ideal and in fact like I just haven't gotten around to opening up the issue on on docker because I want to do a little bit more documentation since I'm a good issue opener but like I'm not sure how this is going to work out so I'll show you the workaround and I'll show you like what might happen if you try to restore from a backup right now as it is written in the docs so I'm going to do that now this is my last trick and we'll select let's say we want 20 of these Redis container okay cool um I am going to use s3 for my backup because it's just like just ones and zeros of course could have this in a volume you could do any number of things and I'm gonna just copy it up there bear with me as I type take a drink of water do something else don't check your email I promise won't take too long dr. Warren I'm going to dump this into soccer swarm backups I think that should work cool so pretty much everything that we saw being monitored is I'm just like going to dump it somewhere else this is not a talk on like how to use s3 or how to use AWS properly or even how to make a backup so it's just going to be there the important thing is I have it somewhere else and I'm going to close out the connection to like again I have to change after reallocate reassociate my floating IP to the new cluster again this is not a talk on how to use the AWS CLI so for those of you who maybe aren't familiar with AWS I'm just going to use the UI and disassociate my slow ting IP address when I started this cluster I'm advertising on this elastic IP I think that's the most important detail that maybe I left out you have to not only assign the elastic IPS to your managers but you have to advertise using that elastic IP because that's the piece of information that's going to be carried to your new cluster keep this work okay cool I want to then I'm going to reassociate this to do the main kind of node one in my new cluster I'm going to try to SSH into this again so of course I'm going to get a terrible warning which is fun I'll fix that awesome so I should serine I'll be able to SSH into this other instance again excellent um I have nothing running on this and in fact if you want to go on to consensus that group we should see we should see also nothing running on the pool there's nothing happening as I said before the most important thing is to stop the docker service and I'm just going to do that via systemctl cool if I do talk for info I should get nothing great ok now I need to somehow get my offline backup onto this machine and I'm going to do that first I'm going to do RM RF bar Lib docker swarm to make sure what check what's in there ok totally nothing's in there cool and sudo AWS I [Music] promise this will all be worth it for the magical reveal at the end no such bucket ok cool so I'm just copying bits from one place to the other and now if we look here we should have swarm in place cool now we have to start the cluster and are we sorry we have to start docker first so it's again so important to not have docker running during this whole time because like you don't want two things trying to do the same thing at the same time start docker service give it a minute all right cool soccer service LS all right um let's see nothing should be running right now and I should be able to say dr. swarm in it - forth new cluster I hope so in order like full transparency I have done this demo probably ten or twelve times this only works maybe fifty percent of the time because of like it's again not recommended okay great so now I have this error that I've lost quorum which is not ideal yeah of course it's not a manager but I'm trying to force the cluster let's try this again and if not I'll resort to showing you the recording which is very upsetting okay cool womp womp this is real life let's fast forward to the end of this we'll pretend cool Wow amazingly I just downloaded everything from f3 and look at it go sorry uh yeah yeah let's do this and then I'll check to see what I did wrong because uh I probably did something wrong which is fine cool sorry maybe this is a little bit hard to see but at this point I've taken the backup and I'm just going to start the doctor service again this works the doctor should be running and in this case because I have exactly the same IP I didn't even have to force a new cluster the swarm just thought like hey cool stuff awesome and if I go to the visualizer everything that I had running before is there but you'll notice like with a big kind of caveat and again I want to make it really clear like this is number one like not recommended and you should never be in a state where you have to do this and number two there is an issue with right now with IP address and being able to seamlessly transfer a file system back up into a new cluster so this is not something that ideally you should be you find yourself in the situation and production you should have something in place like monitoring or something so that someone gets paged before this happened I just want to illustrate that it is like technically possible all the stuff that's running underneath your cluster it's just a bunch of ones and zeros and you can move data around and kind of restore from a backup sometimes it works sometimes it obviously doesn't work that great and there are so much more interesting very advanced orchestration topics that I didn't cover I didn't talk about security I didn't talk about running form at scale I highly suggest that you seek out those talks they're super super interesting if this is way over your head I highly encourage you to attend one of the advanced orchestration workshops I'm usually that Jerome run they're quite good and that is all thank you all so much because we're at out of time we won't do QA but like a formal I'll group QA but please come up if you have a question otherwise stop it sorry I stopped by their coaches and I'll be happy to answer any questions that you have yeah because there's the break just after your we come to hang out here and a Christian thanks again Ron [Applause]
Info
Channel: Docker
Views: 14,308
Rating: 4.9840636 out of 5
Keywords: docker, containers, Black Belt, Algorithms, orchestration, platforms
Id: Qsv-q8WbIZY
Channel Id: undefined
Length: 41min 11sec (2471 seconds)
Published: Mon May 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.