Keynote: How Spotify Accidentally Deleted All its Kube Clusters with No User Impact - David Xia

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello good morning welcome to day two of koukaon I'm really excited to be here in beautiful Barcelona and I'm grateful to the CN CF for giving me this chance to tell you about how we had Spotify like Brian said deleted our production kubernetes clusters accidentally twice but with little to no end user impact now I actually just am curious raise your hand if you have accidentally deleted a production cluster just want to get a sense cool yeah look at that elite club out there yes I see you alright briefly just about myself and Spotify I'm an infrastructure engineer Spotify is a music streaming company with over 100 million subscribers and over 200 million monthly active users and inside the company we have over a thousand developers that are continuously deploying code to over 10,000 virtual machines and a little bit more about on contexts of Spotify is compute environment that you're gonna need to understand some of the stories that I'm going to tell you so Spotify uses Google cloud platform and for the purposes of this talk we have two types of engineering teams that Spotify teams that build infrastructure and teams that use this infrastructure to build their features that you would see if you open the app and we'll call the former infrastructure teams and we'll call the latter feature teams my team builds infrastructure related to google kubernetes engine and for the purposes of this talk again we'll call my team the cluster operators and we'll call the feature teams the cluster users so currently as Spotify we're helping all stateless back-end services migrated over to gke and at the time of these incidents we were only running a subset of services and partially each on gke I'll tell you a little bit more about what that means later so we have three production clusters at the time of the stories one in each of three regions us Europe and Asia and we were backing up each of them one every hour story time all right so I love stories I'm going to tell you two of them and the first one is about how I accidentally deleted this cluster and the other is about how my co-workers when trying to prevent that from ever happening again they accidentally deleted this cluster and it gets better because in the process of trying to recreate this cluster in Asia that happened okay so how exactly did this happen and how did we prevent any end user impacts so that people could keep listening to the music that they loved so it was November 2018 and I wanted to test the gke feature so I create a new GK test cluster in a test project and I wanted to have the same configuration as our production clusters so I opened two browser tabs one for the production one one for the testing one I finished my test and I wanted to delete that test cluster and I had the wrong tab open so I was freaking out and on slacker I remember asking my colleague how do you how do you stop this is there any way to stop this I had deleted a 50 node production cluster running dozens of workloads so with GK it's actually really easy to create a cluster that's what's really awesome about it you click one button you get everything masters nodes you get networking all set up you get authorization unfortunately with GK it's also really easy to delete an entire cluster so with one click of a button everything goes away masters nodes authorization networking and so is there a way to stop it and I learned the hard way that no there's there's just no way to stop it you just wait for it to clean up all of the compute instances and then you kind of like hold your breath and hopefully nothing too bad happened you try to recreate it so let's talk about how we recreated this cluster it took us three hours and twenty five three hours and 15 minutes to restore it fix all of our integrations with it the restoration time took this long because we found bugs in our cluster creation scripts that we hadn't exercised a lot our documentation was incomplete and sometimes just had mistakes in it and our cluster creation process wasn't resumable so sometimes would fail halfway and we had to restart from the very top we didn't design it to be resumable from the middle so this wasted a lot of time okay so we thought about how would we you know human mistakes happen how are we going to prevent this from happening again and we wanted to we decided to use an open-source tool for putting your infrastructure in code this was called terraform just one of the many out there and basically allows you to codify your infrastructure you can write down your clusters declaratively you can version it you can review changes and while trying to adopt terraform unfortunately we we accidentally deleted not one but two clusters so how did this happen this is the second story and I'll give you some context about how we configured our clusters and how terraform works so you understand the details of it so we have a shared git repository where we define all of our cluster level configuration and resources on for every cluster and this repository is used by both cluster users and by my team the cluster operators so it's all mixed together when we started using terraform we we realized that there was like a state file that terraform wrote to the kind of like know about the state of your infrastructure but we weren't very familiar with it terraform actually uses the state file to decide what it's going to do so first my team when trying to get terraform and manage these clusters we created a pull request to import the asia cluster into the state file and this state file actually effects what's going to happen in production but we ran review builds that actually modified the state file during a during the view bill so let's think about that we modified a global state file that affects what happens in production during a review build we didn't merge this PR yet though and then a few minutes later a cluster user who knows nothing about terraform and shouldn't need to know anything about it they made a pull request to add three more users to an hour back file an existing rule binding file and they they asked us to review it and it looked very innocuous it was just three extra users on an existing game space but what was important what wasn't what was in the PR or the change file it was what was missing and it was missing the definition for the Asia cluster and these two pull requests were merged out of order so that one was merged to master first can you guess what happens next and that's what happened so terraform looked at this emerged it in it's like oh I don't see the Asia cluster defined I'm just gonna this this shouldn't be here I'm just gonna get rid of it so this way we thought okay well it can't you can't get any worse right we still have that unmerged PR oh and when you think that it can't get any worse it can always get worse it's always like I know it's like one of those phrases like I'll do the dishes later you know it's not gonna happen so we still have that emerged PR from the previous slide that declared the age of cluster so we thought okay we'll just merge that one in and it will recreate it so our cluster creation script actually fails halfway because it didn't have the right permissions so before we were using terraform each of us were just running a few terraform commands locally and our on our work laptops with our personal GCP accounts and those personal accounts had they were owners of the project so we had all the permissions we needed but now we're using a service account to run these terraform commands and they didn't have all the permissions so we gave it we kind of just like gave it all the permissions it needed but these actual were different permissions in terraform originally used to import the clusters and when you have different permissions and you call the Google API s you get different attributes so now when terraform gets a list of clusters from GCP they have different properties because it can see more and so these properties have changed depending on their permissions so we use one set of permissions to import the clusters and now we're using another set of permissions to kind of manage them and now the clusters look different to terraform so that happened because if i don't know about this cluster this must be different so at this point we have deleted 2/3 of our production clusters from the face of the earth so this was pretty bad thankfully didn't get any worse than this we didn't delete like all three this time but this was pretty bad still ok let's talk about the impact so the developer impact one team had to create more non kerbin IDs VMs they were actually running half-and-half so they're running some other capacity on kubernetes and some on our existing just straight-up compute so they had to create more to handle the load my team realized all the places we had hard-coded the old clusters master IPS so we had to go in and like update all of those and kind of minor annoyance but everyone using coop CTL had to refresh their cluster credentials when we recreated the clusters that had new certificates but otherwise our user support slack channel internally was quiet it was like you're really quiet and no one paged us so let's talk about the end user impact like the sky who's listening to Spotify while you know on his jog he kept running we had no reports of either cluster deletion affecting our end users so first a summary of how we did this and what we did right that prevented any end user impact so from day one of doing this complex migration we plan for failure because you know things happen things are not always going to be reliable and there's gonna be lots of mistakes on the way and also we when we were migrating large-scale complex infrastructure we did it gradually and this is very important because it gives you time again it gave us time to build redundancy into our systems and to give us a chance to rollback in case anything went wrong during the migration process and last but not least we have a culture of learning that's Spotify so you know we try to figure out what have we learned from this and how do we prevent the same type of mistake from happening in the future so that we only have it once I'm gonna go into more details on each of these now so how did we plan for failure we originally while migrating we told each team to migrate their services just partially to kubernetes don't put 100% of it on do half and half gradually ramp up we were still building confidence in our ability to manage these clusters and to scale them up and to test out all of our custom integrations with it the second way that saved us the second thing that saved us was how we registered these services running on kubernetes and the results of these two is that the the failover to non communities instances happened as we expected when the clusters were deleted so let me be clear these three things were not accidents but they were deliberate choices we made to make our infrastructure more reliable and our migration process reversible and now going to go into a little more details on each of these three and how we did that so for the partial migration on a per service level kubernetes usage is Spotify at the time was marked as beta and what this means is that we recommended teams only migrate some but not all of each of their services capacity to kubernetes and meanwhile we were continue working on integrations reliability and managing multiple clusters at once and we actually registered our our services on kubernetes in a kind of different way the non Kerber Nettie's way due to interoperable interoperability reasons so our internals service discovery uses pot IPs it doesn't use the service at all and this internal service discovery pulls the services endpoints and then it updates it's it's pretty much it's updates its internal state with the pot IPS and then when it noticed that it couldn't reach a cluster our team was paged and so we would have to go and actually remove that cluster from the configuration so that that service discovery mechanism would no longer pull that non-existent cluster anymore and then what happened was that we failed over to the non kurby Nettie's instances so we would restart service discovery our internal service discovery system and then the pot IPS would be removed and then downstream clients that had a list you would gradually refresh their list of backends and it would just you know those pod IPS wood from kubernetes would go away and it would use the kind of legacy instances that weren't on kubernetes anymore and now I want to describe some of the best practices that we followed when migrating complex infrastructure to make sure that we did it gradually we could always have a chance to reverse it if we needed to so number one thing to do and also a huge part of what saved us was we backed up these clusters from day one we codify that infrastructure or started to after the first cluster was lead 'add we started performing disaster recovery tests and we made sure that we practiced these drills we practice these scenarios and actually ran through them so our cluster backups they were essential we had already tested restoring from these because if you have backups but you've never tested restoring from your backups I'm sorry you don't have backups right you just never tested them codifying your infrastructure so we were trying to introduce terraform gradually and it was you better than what we had even though it you know there were some hiccups along the way terraform and tools like this help us standardize our workflow and the change management of infrastructure code we added linters and validators to make sure that the configs that we're writing were correct we also added other things like adding the output of the dry run to the comments in the pull requests so that it was very obvious what it was going to do and we use github so we can have status checks that are required to pass before merging one of these status checks is that the branch must be up to date if you get a dry run based on an old master commit it's not really what's going to happen so that we turn that on so you have to do the review build again based on the latest master of course we require approving pull request reviews and we would just immediately fail reviews if we detected certain keywords like destroy you probably don't want to destroy something we still do those destructions of clusters manually but not not in this repository so disaster recovery tests disasters will happen whether you plan for them or not so it's always good to just plan for them we schedule them in advance and we announce widely to operators and users when we're gonna do them you also want to test different failure conditions you maybe you want to test cluster going away or something not as catastrophic like the control plane going down or node not being able to be added or pod not being able to be scheduled and as you're going through these exercises as you do them just record each little issue you see and fix them accordingly if it's something like typos and a documentation or just some some kind of tool that wasn't automated the right way and practice makes perfect so it took me three and a quarter hours to restore the first cluster I had deleted along with all its integrations with the help from my team and the second cluster deletion that actually lasted from 8:00 at night till 5:00 in the morning but now we can actually restore clusters that are way bigger than the ones we deleted before in one hour so I think that's a huge improvement and most very grateful and thankful that Spotify has engineering teams have a great culture of learning not blame so I was freaking out when I deleted that cluster my team though was very supportive and you know they said things like it's better now than never you know this would have happened anyways and Ashley said we learned so much from doing this they stopped short of like thank you for deleting it on a Friday but you know it was still good that that happened before we got too many users on so we're all human and we've all made mistakes what is important isn't who made the mistake but what you learned and how you can prevent making that same type of mistake in the future so to recap what prevented that end user impact we plan for failure from the very beginning we migrated large-scale infrastructure very gradually that gave us time to build redundancy reliability and reversibility into that process and we have a culture of learning so the next steps for us using kubernetes as Spotify I'm very happy to say that as of last Monday we kind of flipped the switch kubernetes is available for everyone to use and it's marked as generally available at Spotify which means that we've kind of lifted that recommendation against running your service entirely on kubernetes we feel confident enough that you can take your mission-critical service and run it a hundred percent on kubernetes and we now have - we are in the process of figuring out or have already kind of made good progress on managing configuration around how do we distribute workloads among multiple clusters this is we have five in each region now and how do we create more redundancy for services maybe we want to deploy these services to multiple clusters within a single region so that's my talk I want to again thank the CN CF and Spotify for this opportunity I want to thank my team for all their hard work for making kubernetes a reality in production and I want to thank all of you here for attending Spotify is a great place to work if you're looking we're hiring we're also using the latest cloud native technologies to make the entire music ecosystem of work better for everyone so my name is David you can find me on Twitter here thank you so much [Applause]
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 36,021
Rating: 4.9109588 out of 5
Keywords:
Id: ix0Tw8uinWs
Channel Id: undefined
Length: 20min 22sec (1222 seconds)
Published: Wed May 22 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.