Highly Available Kubernetes Clusters - Best Practices - Meaghan Kjelland & Karan Goel, Google

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so for both of us it's our first keep God and naturally first keep contact so we're we're excited big room lots of people hopefully you'll take back all of the things we tell you and actually implement them in your organization so I want to start off with kind of a not quite relevant to keep kubernetes or cube con project this was a github repo of Brad Fitzpatrick another Googler and he was writing about his home lab setup and his home internet setup some of you might have seen this unhappy news and his goal was hey I want hive highly available internet access for my home lab and I don't want any single points of failure and so you know if you've built any labs or any systems at home or in your organization you might know that there's redundant power supplies redundant Wi-Fi switches network access points and so on but I also saw he had two ISPs a lot of us don't even have two eyes piece to choose from and I thought about it and it makes sense if there is no single point of failure it makes sense to prevent yourself head yourself from a rat chewing through your fiber and going with a broadband as a secondary and so what does this have to do with kubernetes we're at coop con not lab con after all so kubernetes is some of you might know it's a distributed container orchestration system it is great it's based on Google's knowledge of after running billions of containers every single week on a distributed planet scale but really planet skill doesn't really mean anything if you don't have redundancy and high availability if one of the components of your system goes down and that causes all of the other systems to not be available doesn't matter how much you scale it's not good for you so and I get a feel for the room of feel for the room here how many people use kubernetes right now one or two people okay and how many people have actually deployed a kubernetes cluster not using a management service quite a few ops people and how many have done h a cluster setups Wow a lot of people great and how many of you have been asked to raise your hand at a conference before cool so hi my name is Karan Goyal I'm a software engineer at Google and here in Seattle and I work on gke on Prem some of you might know about that and I'm making challenge I'm also a software engineer on the GK on Prem team at Google and before I joined this team I was on a different team at Google that worked on the Cloud Foundry container runtime so that's another kubernetes project so I've been working in the kubernetes space for about two years okay so let's come back to this question because I don't think we've quite answered this yet what is high availability anyways basically what it means is that you want to have a high SLO or a service level objective so you have a certain amount of availability that you promise so for example GK has an H a product called GK a regional clusters and the control plane of those clusters is available in ninety nine point nine five percent of the time so great now we know high availability means you want to be available a lot that's not super helpful but basically what we want to talk about today is how we're going to achieve that how do we make sure that our system has no failure points that could take down our whole system that weren't redundant so if we bring it back to the home internet set up that's exactly what Brad Fitzpatrick did for his own internet set up we're gonna do that for our clusters before we get too far into that I want to talk about multi master or multi controller nodes this means that you're running multiple copies of your control plans so that if any of the copies fail you have other ones to come take up the work and that's great that's a really important part of high availability but in a lot of cases it's not enough we want to look at every layer of our stack including the control plane networking applications and persistence and make sure that we don't have single points of failure within any point in the system what kind of failures are we talking about there are so many things that can fail in our system and we want to make sure that we talk about a lot of them and plan for all of those failures so we could have failures in our application so if we're running an application we want to make sure we're running multiple copies of it so that if one of them fails we don't necessarily care why as long as there are other ones available to take up the work but if we were running all of those copies on the same VM then we would care if that failed so we need to make sure that we're spreading our workloads across VMs as well and similarly if all of those VMs were running on the same physical machine then we can't handle a failure of that physical machine either so we want to run ideally across physical machines and then you can imagine there are other things that could fail also so if we're running three physical machines but they're all connected to the same power source or they're all using the same cooling system we can't handle failures and those them either we also want to make sure we handle network partitions so if a network switch fails in our lab the nodes can't communicate with each other we want to make sure we know what's going to happen in the system if that happens and storage is a really critical one if we lose storage we need to make sure we don't lose the data because that can be irreversible on GK or a little bit lucky because we have this concept of regions and zones and this encapsulate Sal OTT of the things that I just talked about regions are geographical areas so those are very isolated from each other if you're running across regions you can even handle like natural disasters in one area and then zones are a deployment area within a region that is independent they're independent from each other in a lot of ways like if you're running across zones then you know you have different power sources you know you have different cooling systems and usually they're running in different areas of the datacenter as well and the reason I'm telling you this is because through most of this talk we're going to use zones as our failure domain and I just want you to remember that that already encapsulate Sal OTT of these failures that I talked about but you can also replicate this in your own setup if you're running on Prem okay so we've divided our talk into three sections first we have applications how do we run applications on kubernetes in highly available way then we'll talk about kubernetes itself the control plane how do we run the controller manager the API server and the scheduler for kubernetes in a highly available way and we split out at CD as a special part because it's our data layer we want we have a couple of extra considerations there there we go so we want to talk about application high availability first because it makes intuitively a little bit more sense a lot of the features and functionality we'll talk about is packaged and in kubernetes and comes out of the box so it'll be a little bit easier to understand here so we're gonna assume that we're running this cluster with three replicas of our workload the green boxes are VMs and the orange yellow boxes are workloads containers pods and you can see that we're running our VMs across three different zones of a region and like Megan said these happen to be in different data centers across [Music] different physical locations but in the same region so let's assume that we are running these workloads they are scheduled over three different PN's over three different zones this concept of scheduling across different zones is not native and primitive in kubernetes I'll talk in a little bit about how to go about achieving that and let's say one of our zones the one on the further right goes down we're lucky because our control plane is not running there in this case the scheduler will see that hey I was supposed to run through your replicas but there's only two running so let me just get you'll schedule one more and bring up the running count to three and it happens to land in zone D so this is this is probably what you're most familiar with as a developer as someone who is deploying the workload on kubernetes is a concept of a deployment or a replica set or a replica you would run your pod as a deployment and you would set the replicas in our case to be three the second more important thing here is rolling updates or just update type in general if you try to update all of your pods and container if you try to update your deployment and you don't set a rolling strategies to rolling update strategy the scheduler will just take down all of the replicas of your deployment and so you might see a disruption there in our rule that we have on screen here we're setting max unavailable to one and so this will be a rolling upgrade where one replica is taken down at a time and the new one is brought up this is this is a great use of out of out of the box functionality because you don't need to write your own machinery there is a replication controller that's watching for changes to your either deployment manifest or updates to your pots themselves and reconciling any errors that happen similar concepts translate to stateful sets for stateful applications you get stable storage stable persistent storage and stable Network identities and we'll talk about staple sets in a little bit more detail later the next thing is the concept of setting zones and you might remember I said it is not intrinsic to kubernetes but if you do use a kubernetes cloud provider you get that feature again out of the box there but if you don't use a cloud provider and you want to do it by hand you need to do two things the first is setting pod and I affinity rules to your pods like the one on screen where we set the specific failure domain key and the second is you would add labels to each of your nodes where the key happens to be that failure domain key and the value can be anything so in our case it would be us central one a B C D and so on once you do that the scheduler will schedule all of your workloads across those zones but again if you deploy a cloud provider that will just happen for you so next thing I want to talk about in touch here is node upgrades and specifically other types of voluntary disruptions that you might want to do so if you want to do kernel upgrades on your machine you will need to drain all of the pods there and maintenance and then join it back into the cluster but if you decide to drain it drain the machine all of the replicas and all of the pods running there will be drained at the same time and it's possible that all of your pods were running there and all of those are gone and so your services unavailable to get over that you can use pod disruption budgets where you can set again your tolerance for downtime and if the tolerance here is not met by any trained commands the pod eviction API will just reject the call so this was useful when you have three pods running across three different machines you start draining one of them at a time and you can set the disruption budget to be targeting just one and have at least two available at a time okay let's talk about the control plant itself we're not okay so earlier on this slide carne showed you what would happen if we had a failure in US central 1b but what if our failure had been in u.s. central and F which is where our control plane is running in that case we don't have a control plane running anymore so you can see that the containers that we're running in that region or that zone are no longer rescheduled because nothing's there to reschedule it anything that's running in the healthy zones is going to continue running and if you had a load balancer in front of those services then the load balancer will notice that one of the notes is not there anymore and stop sending traffic to it so that should be okay but there are some we might see some intermittent issues in the cluster while this is happening the solution of course is to also run our control plane nodes across them so we're going to run three VMs one in each zone and then we're going to run a replica of all the control plan components on each of these VMs let's look at what's actually happening on those VMs we have the API server which talks to @cd again that's going to be in the next section and then we have the scheduler and the controller manager the API server is really easy to run multiple replicas of because it is completely stateless we can just run it as active active so all of them can take the same request and all we have to do is put a load balancer in front of API servers and we should be fine the scheduler and controller manager are a little bit more complicated because they have to read data and then act on the data and write data also we want to make sure we're only running one of them actively at a time so what we do is we have a locking system they'll all attempt to apply on the lock one of them will get the lock and that one will become the active component and then the other two will just wait for that one to fail so if it fails they can they can take over by acquiring the lock in order to configure that we have there are five slides on these components leader elect all you have to do is set it to true and then all of this will work but the other ones are around they let you configure how long it'll take for one of the passive components to take over if the leader fails ok so we've talked about kind of like the multi master setup now but we haven't really talked about a lot of other things that are important in H a if one of our party M fails like if we had a zone outage nothing's really like health checking to make sure that we always have three available and we don't have anything that's gonna handle failure recovery in that case we also haven't talked about how to upgrade without downtime there are actually a lot of solutions to this and a lot of options you can explore depending on what you need of course there are hosted solutions like gk e gk regional clusters will do this for you already or gk on prem which is what we work on you can use managed instance groups to make sure you have the correct number of machines available at all times that's a GCE feature also you can build your own monitoring server but that would be really hard we would know and then there's also kuru knows itself so there are kind of two different ways you can use kubernetes to do this for you or two schools of thought I guess and like we just saw in the application section of the talk kubernetes helped us in a lot of ways to manage these things for us the first way of doing this is called self hosting this is when you run kubernetes and it manages itself as pods in the same cluster this is gonna have some interesting bootstrapping issues or certain failures can be hard to recover from does anyone has anyone tried using self hosted clusters before okay really two people this time did anyone like it did any hands were raised okay well yeah so if you want to try this out though cube admin has it as an experimental feature so you can try to plug a cluster this way I think the way that has gained a little bit more traction though recently is the management cluster idea so you have two clusters now cluster a is unmanaged and it manages a bunch of clusters in this case just one cluster B so cluster a is control plane is managing cluster B's worklet cluster B is control plane as pods so you can see cluster a's node is the same as cluster b's controller node and then cluster B can just manage its own workloads but like I said cluster a is unmanaged itself if that fails though the control plane of cluster B will continue running but you'll need to be able to have a good recovery story for that gke solution is regional clusters like I've said before regional clusters will automatically run the master nodes across three zones it'll run three replicas of the control plane and it'll put a global load balancer in front of the API servers for your cluster also one of the benefits of GK regional clusters is that Google's sres are now managing your compute networking and storage resources for you there's also the cluster API this does machine and cluster management in kubernetes style API objects we worked up we work on this and some of our team works on this at Google as well if you're interested in learning more about that there's a cluster API work group in cig cluster lifecycle thank you so next we want to talk about our data plane which in this case kubernetes case is at CD and at CD is a little bit of a snowflake here because it is backed by a database and for writes to a database to succeed in a distributed environment there needs to be a strict majority to elect a leader that can for writes to succeed so what I want to talk about here is why city's the reason reason that at least in gke we decided to go with three masters instead of say two or four or even higher than that so at CDs basis is distributed consensus algorithm called raft I won't go into details there's papers about that the equation that governs that series operate or raft in general is this simple n over 2 plus 1 equation what this tells us is that for any action to take place there needs to be a very strict majority of at least 51% of the members very democratic so in F Series case what that means is at least 51% of the members of an FC D cluster need to reach a quorum on the leader and so let's see what happens with if our EDD cluster was only 2 replicas instead of 3 and we can assume that we're doing an upgrade of our cluster and the rightmost master is down for that maintenance or upgrade well in that case we have one of our master of one of them is down which means only one out of the table 50% are able to reach a quorum but we need a strict 51% majority here and so rights to our database will not succeed so we can understand why one is not good because one goes down you have nothing we can see why 2 is not good one of them goes down we have no tolerance after that but why three why not 4 because you know for if 4 people were attending this talk that's better than having three people so I want to go back to those that equation again and here's a table of potential cluster sizes the member of members that we need for a majority to reach and then the failure tolerance or the fault tolerance that that provides us so as you can see in the row 1 2 we talked about that so it's very intuitive there and I've highlighted rows 3 & 4 what you can see is for a cluster size of three two out of the three members need to be in consensus that's 60 6% of the members but when we jump to four three out of those four are needed to reach a quorum that's 75 percent so we by increasing our cluster size from three to four we've actually decreased our probability of reaching a consensus but in the last column you can see that our fault tolerance is exactly the same we can still only tolerate one machine failure from the same table you can also notice that odd numbered cluster sizes beyond cluster size of two is strictly better in terms of a of reaching quorum and be providing failure tolerance than even-numbered so you can say odd-numbered provides us with better odds of having a highly available cluster thank you so why not why didn't we choose five then because five is better than three well then it comes down to decisions like is it how expensive is it going to be to run five machines versus through versus three and is that cost justified in my say the revenue or the business value and the second is even if you scale up to five masters and theoretically at City can provide you say five nines of availability your downstream systems still might not be able to provide you with that much availability so your upper limit is still the lowest common denominator of downstream systems and so three to us seemed like the best trade-off between all of these decisions but it's really an activity for for you to evaluate what your business needs are what your business goals are and how much availability you need so let's see what happens in case of three replicas now that we've established that it's a good good first step say one of our nodes goes down again one of our masters is down for upgrades in that case we still have two available replicas and those two can still now reach to reach a quorum because we have a strict majority so rights to our database will succeed do note that if during this upgrade one more of the replicas goes down only one one of our masters is up we our rights will not succeed since we do not have a strict majority so if you are upgrading a three master three cluster size cluster you do not get a high availability cluster during an upgrade itself if you want that well go to the next odd number replica size like five now you can tolerate two machine failures and still be high availability so this is what EDD recommends for a lot of its really cluster sizes but again it depends on your business needs so how would we go about configuring it there are documentation that there is documentation for out there so I won't spend too much time but there are two steps one is you tell @zd where all the peers are this can be IP addresses or dns service names and the second is you start the cube API server with those @cd instances there is a linked example manifest on screen so you can go back and look at that and because that CD is backed by a data store we do recommend having preparing preparing for any downtime that you might have because of disk failures data corruption and the biggest one is backup in backup and restore we recommend having periodic backups monitoring that those backups are having an alerting when those backups are not happening especially for production systems and if you've run if you run production systems where you are taking backups you definitely have to test to restore randomly in your backups if you do lose your data due to some mishap and you are not able to restore or people have not been trained on how to restore it your backup is useless so there is that linked document documentation page on how to do this backup and restore and how to also recover from when your cluster cannot reach quorum so dump some tool recommendations hefty Oh has Ark some people might have used that before it's great it's a server client where the server lives on your cluster the client makes HCD calls to basically iterate through all of the objects save them in a data store and you can recover them by just posting them back to the API server do you note that this requires the API server so if your API server itself is down because something happened during an upgrade restoring might not be that easy there is also at city operator which again look into and Etsy d-zone CLI has a snapshot function which you can run periodically in something like a cron job so what we're gonna do is deploy an application across zones on a cross zone multi master cluster and then we're gonna simulate a zone failure so our cluster will look like this we have three VMs at our controller nodes three VMs at our worker nodes we deployed this cluster using kubernetes the hard way but we made it cross zone which is one of the only differences and then we're gonna deploy our application across them and see what happens when we delete the VMs in one of the zones which is simulating our zone failure okay so first on the Left we're going to show what nodes we have running in our cluster the left is cute cuddle get nodes and then the right is the Google Cloud console and we can see that the zone labels have been applied to all of these nodes we had to apply them ourselves because the way we deployed it didn't use a cloud provider but like we said earlier if you used a cloud provider this would be applied automatically so you can see where zero is in u.s. west 1a that matches over here then in the cloud console we'll look at the load balancer that's sitting in front of our controller nodes and what we should see is three healthy controller nodes so yes we have three now we're gonna deploy our application so we're gonna watch what pods we have running in our cluster currently we don't have any then we'll create this simple nginx deployment and we'll look at what's in that deployment so first we can see that the notes or the pods are all scheduled across the three different worker nodes because they're in three different zones and down here you can see that we have this preferred during scheduling pod anti affinity rule which uses this failure domain zone top alot topology key and since it's preferred it will reschedule two pods into the same if it has to if we make it if we made it required then it wouldn't do that okay now we can delete our zone we can't actually delete a zone because we would probably get fired but so okay on the bottom we're looking at the pods that are running in the cluster and you can see which worker node they're running on on the top we're gonna look at our nodes and we're gonna make sure that they're all in a good ready state and we'll see what zone they're running in again so you can see right here they're all ready right now okay now the scary part although this is a video so it's a little bit less scary we're gonna delete controllers 0 and can't work or 0 which are both running in uswest 1a once we delete them the first thing we'll see is over here we'll have a slight and then a down time but we should see it automatically recover itself the reason well let's see it should happen pretty fast yeah so we can see our calls to q4q pedal are failing but then they come back pretty quickly the reason for that is that the load balancer that's sitting in front of the 3 API servers has a health check on it and it takes about two seconds to recognize if one of them's failing which I configured myself so you can configure the time amount that it takes to recover okay so while they're deleting what we'll see first I believe is that worker 0 will become not ready we also changed the pot eviction timeout when we deployed the cluster to make it a victim pods quicker when it realized one of them is not ready so now our nodes are deleted and we can see this one's not ready and now down here we'll see a new pod is scheduled on work or two really quickly because we're core zero is now no longer ready and now we can look at the load balancer again to make sure that we're not actually sending traffic to controllers 0 anymore we can see what that looks like so yeah we can see controllers 0 is now a yellow exclamation point because it's not healthy cool that's all we have we've linked some additional resources here one is for general clusters what you talked about already we also linked the kubernetes high availability Doc's here Cabrini's the hard way if you want to try the demo yourself and then also the cluster API cig and we'll be in the at the Google cloud booth downstairs in the exhibition hall so if you have any questions I hope all of you come find us there thank you thank you [Applause]
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 17,578
Rating: 4.8721461 out of 5
Keywords:
Id: NpT9RraqKdY
Channel Id: undefined
Length: 29min 8sec (1748 seconds)
Published: Sat Dec 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.