The Container Operator’s Manual - Alice Goldfuss | #LeadDevLondon 2018

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I love being in London and I've already been here a few days which means I'm about a hundred and ten percent T and 0% regrets and I'm just so ready just vibrate at you for 40 minutes but before I do before we start allow me to just say happy pride yes I knew you were a hip crowd and for those of you that don't know what I'm talking about it's like the World Cup for people with fashion sense this talk is called the container operators manual and if you have any comments or critiques want to talk about how rad my shoes are cuz they are my twitter handle is right up there it's a Tallis gold fuss and I'm very excited to hear what you have to say speaking of me Who am I I'm Alice nice to meet you I'm a site reliability engineer at github which means I'm a professional worrier with sudo access and today I want to talk to you about car ads now you know when you're watching TV and by that I mean the five seconds before the skip ad button comes up on YouTube and it's this car ad and it opens up in this beautiful picturesque mountain side of the serpentine Road weaving in and out and there's this sleek beast of a car or aging down the countryside all these stats they're popping up on screen like 0 to 60 in three seconds 70 miles to the gallon five star rating and you're watching it like yeah this is amazing science has proven that this is the best car ever made I must buy it I must have this car in my life and if you watch really closely you may notice at the very bottom of the screen in really small letters you may see the phrase laboratory setting laboratory setting because it's an advertisement and those stats and figures were gathered in perfectly tuned laboratory conditions they're about one step away from theory and if you were to purchase that same car and take it out in the field you would not get the same results and that's how I feel about most container talks you walk into a container talk you sit down and they go step right up step right up in the next 20 No 15 minutes I will tell you everything you need to know to completely revolutionize your infrastructure what's that you want to run more than one container well no problem just download this gist don't read it run it on your laptop and in five seconds you will have a perfectly tuned three node cluster running and you want to run containers in prod well go right ahead after all pride is just your laptop times 100 right great and then they walk off and you're left staring at this half-finished tutorial like is is this way sis admins are so mad frankly I would like to see a lot less docker 101 and a lot more docker WTFs because there are just things as an industry we need to talk about when it comes to containers we need to examine them a little more critically than what we've been doing already and that's what I want to do with you today I want to discuss some of my experiences with you to help you make better decisions when it comes to containers and you might be saying all right great that sounds good but why should I listen to you édouard DevOps hands well this is what I've been up to I first started running containers in pride at scale in January of 2015 and that was at a company that looked at docker 0.6 and was like yeah let's run it in pride and we did and we kind of just paved the street behind us so a lot of good early adopter information from that experience and from there I went to a team that built and maintained docker eyes Cassandra clusters and now I'm on the team that runs the kubernetes platform behind github.com so all told I have about three and a half years experience running containers at product scale and if you consider docker came out about five years ago that's really not bad but before we begin I want to make sure we're all on the same page so let's just start with what is a container because if you've ever been to one of those docker 101 talks it usually goes a little bit like this first you have your host and it's a box and on top of your host you have your operating system which is also a box and on top of that you have your container daemon or engine and it's a box and on top of that you have your containers in little boxes and that's that's fine except it's all lies it's just completely wrong it's just led you to deception and this is a really good way for humans to see containers but this is how your computer sees containers so this is the output of a top command from a kubernetes host and as you can see we have some processes running third down we have the docker D process which is that docker Damon or container engine v down we have a couplet process which is a kubernetes process and then we have some Ruby and Python processes running and those Ruby and Python processes are containers those are containerized applications they're not abstracted away they're not hidden in boxes they're just processes because containers are processes born from tarballs anchored to namespaces and controlled by C groups so what do I mean by that there is no actual concept of a container in the Linux kernel there's no flag that you trigger to use a container containers are made up of a bunch of native Linux features stapled together so let's break that down a little bit more when you're making a docker file or some sort of container config you are specifying a recipe or instruction set for the environment and file structure of your application and when you build that docker file you are making a tarball and when you deploy it it gets deployed unpacked and run as a process like any other process on that host and here's where the native features come in so the first feature is namespaces now Linux namespaces determine what that process can see such as other processes or directories what-have-you whatever you specify it can't see everything on the host it can only see its own slice of the universe and we also have see groups or control groups and see groups determine what that process can use resources like memory or CPUs and together these features make containers this is a container and if you're going oh that's that's what they want me to run in prod I thought it was like little whales or something I just don't worry it's okay we're going to take a nice hard look at containers discuss some of the pros and cons and enable you to make better decisions about whether or not this is the solution for you and we're to do it using four lessons could have been 400 I cut a lot out of this talk but just for just for put some tea in me and I'll and I'll shove more at you but just for for now so let's start it off with the first one containers have strengths I don't want you leaving this talk and going on Twitter and being like that emo kid doesn't like containers containers are good I like containers when they're used in the right way and when you play to their strengths and containers strengths lie in ephemeral disposable processes and by that I mean stateless applications now just again so we're all on the same page what is a stateless application a Salus application is an application that takes data in changes it in some way and then sends data on to another endpoint and it doesn't keep any of that data for itself it doesn't become altered by the data it doesn't take a count or anything it has no recollection of having process that data it is stateless and stateless applications are fantastic when you were scaling out your infrastructure because you can run multiple instances side by side and say put them behind a load balancer have them all pointing to the same back-end like a database and if one of those instances dies your dependencies don't care they still have things receiving data and things sending data to them and the other instances also don't care provided you have enough Headroom because they don't depend on each other to do their work they are stateless and this if you'd lose that instance no one cares there are ephemeral they're disposable and container izing things in your infrastructure that are already stateless like if you have Micro Services will really play to their strengths and just just so we're here just so we're clear here containers are not magic boxes all right my magic boxes are ok because we know they're lies I'm gonna use boxes to represent containers but we know they're lies but if you containerize stateless things you'll be able to maximize those strengths so what what do I mean by maximize those strengths well first of all you make your applications portable once you go through that journey of making sure you have every dependency you need delineate it out in your application file you've taken stock of everything you need to run them as long as that container will run somewhere you have a kernel that'll handle it you can run that application everywhere and you don't have to worry about not having the dependencies that will support it if you can want it you have changed the library versions in your config you want to try a different version of your code fine rebuild it deploy it'll be fine and this means it's extremely easy to iterate and it's extremely easy to upgrade and you want to be able to iterate fast because we're all about innovation these days right all about that velocity containers can help you there you change one thing you could do a B test even change it deploy it doesn't work change it again redeploy it you're all done now you can get the same information on a blog post you can get the same information for one of those docker 101 talks but a strength that those blog posts and those talks don't discuss is disaster recovery how many people here have had to deal with disaster recovery yep a few people yeah so when you start hearing phrases like enterprise or legacy or suck two you need to deal with disaster recovery and how many people here have done a disaster recovery exercise that looks a little bit like this yep you're sitting down you're taking stock of everything you need to know to run that application for your disaster recovery process and it turns out that this one uses Capistrano that one uses Jenkins this one is run on a cron job off someone's laptop that deploys every day at 3:00 a.m. and if they were to quit we'd be in big trouble theoretically of course and even though you write everything down and you think you know how to run your application the day of the disaster recovery exercise it turns out that you actually didn't have everything that you needed and you have to call someone up to come sit with you and show you how to deploy their apps but if everything is containerized everything is deployed the same way to the same place and as someone who has been on either side of the Capistrano world and the container world I like it a lot better over here for disaster recovery exercise it goes so much slicker and so much cleaner and this is also great when it comes to production incidents when you're dealing with on call say you get paged and you start you wake up you look at things and you're like oh it's actually not my service at someone else's service well in the old world you would have to page them and have them wake up and look at it and then redeploy their app with it whatever arcane magic they used to do that but if you are using containers the first thing you can try is just redeploying their service for them and if that doesn't work page them but it's possible that a redeploy is all is needed to kick it back into shape and if so you have just solved your incident in less time without having to involve more people and that is a huge benefit but even if you don't have a production environment you can benefit from containers testing environments because what our testing environments if not ephemeral spaces where disposable things run around all the time that you want to clean up afterwards you could try all different versions of your application your code at different spaces and afterwards you can you can automate the build up deployment and teardown and if you want to keep the results in a centralized location you can export them to a database or send them off somewhere else for later analysis and then everything is cleaned up afterwards and this is a fantastic way for testing customer bugs by the way you know when customers write in and they're just like something's broken and you don't download anything at all to match their environment and you're like Carrie buh whoops well now you can repro not on your laptop in praat even if you want to on some other service and you can test it using containers this is fantastic if you for example work on a desktop application and your production environment to someone else's computer but you still want to be able to test things fast you can use containers to do this basically anywhere in your infrastructure that you have ephemeral stateless applications or services you can strengthen them using containers and I encourage you to do so now if containers have strengths it makes sense that they would also have weaknesses and if container strengths lie in ephemeral disposable short-lived things that surely their weaknesses lie with things that keep their states keep their shape our modal moldable and by that I mean stateless applications or stateful applications and usually when people talk about stateful applications with containers they want to talk about databases and like ah I attended one of those container 101 talks and now I'm gonna containerize all my databases and usually when they say that to me I respond are you Google and that tends to hurt some feelings because when I say are you Google they usually here are you a success are you a household name have you had a lot of lawsuits lately but what I'm actually saying is are you routing billions of requests to thousands of databases across dozens of data centers per second usually the answer is no in which case you really don't have to containerize your databases but I understand why people want to I totally get it there are so many benefits at least from the outside of container izing your database is faster provisioning better stability faster recovery if you have a database team if you're running databases on bare metal and a dev team is working on a new service they need a new database for that service it could take the database team months to give them that database they may need to put in a hardware order they need to wait for that hardware to come in get rekt get provision to apply the correct schema before they hand it off to the development team and that slows down velocity that slows down shipping and so I totally understand why people would want to containerize databases but you know what you get all of these benefits from a cloud provider and some tooling then people tend to think well do you have any idea how much that costs that's so expensive the AWS costs alone are just unbelievable no we were going to build our own solution well how much do you think it's gonna cost for your engineers to research build and maintain your own internal solution and how are you going to ensure that they are going to properly treat your data the way you need it to be treated secure it properly make sure it's archived properly you're gonna be dealing with a lot of data at the end of the day and do you really want to build your own solution to handle that but people say yes and there are two main ways they go about it so let's just discuss those two main ways so the first way is you have some containerized database instances and they send all of their data to some network back storage array like SEF and if one of those instances goes down everyone's fine you don't lose state and you can redeploy your regenerate that database instance it gets it stay for something like at CD and it keeps sending data back to the array and this is fine except you're sending data through the network mean means eventually you will be network bound I mean if you're building you're gonna start doing like dedicated fiber for your databases you might as well go to the cloud another way people do it the second solution is they have containerized database instances on hosts and they store their data on the host using a mounted volume and in this case if you lose one of those database instances the data stays behind well now what do you do well you can just redeploy that containerized database back to the same host and if you recall I said that I ran doctor eyes Cassandra clusters this is what we did from the outside it looked like a normal Cassandra cluster where you had one database per host we weren't really winning on hardware footprint having the JVM s containerized allowed us to upgrade easier deploy easier but we also had to write a lot of our own tooling and we still had to spend just as much on hardware and operating costs so did we really gain anything other people may try to do this on a one-to-one basis where it's one container instance per host and if they lose an instance then they regenerated elsewhere but you still have to move that data and once again you're gonna be network bound but I know I know people want to containerize their databases and if you're gonna containerized databases my only advice is keep it small keep it small keep your footprint small if you want to do a test database if you want to do some sort of assets database maybe has five gigs on it fine but don't store your 16 terabytes customer account database in containers it's not gonna work you're gonna be sad or again use a cloud provider I was taking a look at the features the other day on both AWS and Google cloud allow for automatic database failover they have scalable databases you have automatic read replicas and yet covers multi regions and Azure has the exact same set except instead of automatic failover it's manual they're currently working on automatic this gives you everything you need without having to worry about if you're going to be network bound without having to worry about if your engineers are going to be securing your data properly I would highly recommend this route speaking of vendor solutions that brings us to lesson number three which is containers need friends people start looking at containers and they're like oh yeah we're we're gonna do containers kind of slap some containers on that gonna we're looking into containers now very hot very hot right now those containers but the thing is it's never just containers containers these days is a vast ecosystem of different vendor solutions and you need to look at them research them and glue them all together to get a container platform for example what are you gonna use to build those container tarballs locally you could use docker that's it's it's very slowly primed for a dev setup but you don't have to use docker in production you can run those containers with something else in pride do you want to do that you should research that orchestration how are you going to schedule your container resources when you are deploying those containers to your platform and it says I need this much memory this much CPU what are you gonna use to decide where to run that container which hose to put it on what's gonna be making those decisions for you how are you gonna automate it management let's say you want to run clusters on your container platform well what are you gonna use to do health checks or to drain traffic or to do automatic failover ''s how are you going to do that sort of management and last but not least networking you are adding a completely new abstraction layer to your infrastructure now that you have a bunch of containerized services running how will you'll handle routing both ainst internally and extra how will you handle access control can all the containers just talk to each other regardless of if they need to and how will you handle service discovery how will one container find another you're gonna need to figure that out and just to give you an example of how much you're gonna be looking at this is just a handful of software solutions out there that handle networking in kubernetes and they don't all do the same thing you need to use multiple of these in order to get a complete networking solution and for each of those you need to ask yourself what is the rate of adoption do we think it has long time support do our engineers already perhaps know people that work on those projects do our engineers understand what these projects are doing can we protect you me possibly contribute upstream if it doesn't do exactly what we want did we already try to run it do we understand it does it work you're gonna need to do a lot of research but let's say you did your research and you figured out what you need to do and you have containers you have a container platform great now you have to hook it into your already existing infrastructure for example deployment how are you going to deploy to your new container platform is it the same tooling as you're currently using to deploy do you have to change it in some way and how are you going to support two parallel deployment pipelines of your legacy applications and your container ones and where will support go of one of those pipelines breaks where's your priority how he'll handle deployment monitoring what new metrics and data points do you need to be monitoring on your container platform what counts as a healthy container what counts as a healthy service now that you're probably running more than one in containers does your current monitoring support this do you need to build new tooling or get a new vendor solution that'll handle that monitoring for you and do you have to train people on how to use it now provisioning how are you going to provision new container hosts do you need to update your base images how will you'll handle inventory concerns what gets priority container hosts or legacy hosts or or database host how will you make sure that you have enough inventory to go around to build out your container platform and last but certainly not least debugging now that you have this brand new layer of abstraction how are you going to troubleshoot it what do you need to know and aside from the team that is currently owning and operating this container platform how will you empower the other developers running services on your platform to troubleshoot their services are you going to make some sort of centralized logging platform where you can make your own tooling to make it easier for them do you have to offer training how are you going to make sure that they can take care of themselves now that they're running on something new and between having to research the container platform having to integrate it with your new infrastructure you are going to be looking at a gradual rollout where you try some things you fail you try some new things they're steady you try to migrate apps that fails and because of this you're looking at a minimum one year span to get a container platform off the ground and when I say off the ground I mean you have a container platform it's mostly stable and you have like one service running on it one year to get there and by the way the end goal for container izing is usually not everything is containers you're usually going to be dealing with a hybrid solution or you have legacy apps such as databases running the way they always have and then some of your services are containerized this has been true of everywhere I've worked most places that are container izing already existing applications are looking at something like this is a final state and you need to understand what your end goal is and set your expectations accordingly and know that it's gonna take you a while this brings us to our last lesson and if you ignore everything else in this talk if you're just like yeah I'm gonna containerize my databases right now I'm gonna do it on my phone fine whatever there's enough semi fours on there it'll be good no please please pay attention to this last lesson which is containers need headcount if you want to fail at containers just ignore this slide because usually what happens when people want to do containers is some conversation happens behind closed doors where one person says we need containers and the other person says let's give it to ops after all ops already owns inventory and provisioning configuration management network deployment tools monitoring Incident Response possibly even your databases so let's just put containers on top of that because containers are what that's like Jenkins right like it's just like using a new Jenkins no it's not like using a new Jenkins looking into rolling out a container platform is not like using just one piece of new software it's like deciding to build a new data center in a city no one has ever heard of in with the hardware no one's ever used and software no-one's even touched it is a completely new can of worms that you are going to need time and research to handle and so you are going to need to build a new team a completely new team you're gonna need to hire to build this container platform and you're gonna need a select number of skills in order to make this team successful so here are some of the skills that you should be hiring for to empower this new team and make sure that it builds container platform correctly you're gonna need someone who understands operations perhaps even someone who understands operations at your company and yes yes this is the the one time when you can move someone from ops to this need do team but then you need to backfill them you can't just take them away permanently then they need to be backfilled on their old team if they're joining this new team but you need someone who knows ops at your company because they're gonna need to know where the skeletons are buried where the snowflakes are running because the hope is eventually they will be containerized and migrated over so you're gonna need someone that knows that you need someone who knows deployments how do climates are handled at your company and has a pretty good vision for how to integrate containers into that deployment pipeline urine needs someone who can write and test the tooling that you will need to glue that container platform together because even after you figured out what vendor solutions you're gonna be using more than likely they're not gonna give you everything you need and then you also have to integrate it into your already existing infrastructure and you need someone who can build that tooling you're gonna need someone who understands monitoring who has a good eye on what kind of new metrics and data points and and health checks you're gonna need for your new platform you know that person where someone in slack is like I don't know something might be broken but I can't tell and five seconds later someone has dropped in a brand new dashboard they made that shows yes the thing is broken you need that person that's the person you need you need someone who understands the kernel because you are going to get some interesting low-level interactions between containers your infrastructure and your kernel and you need someone who can say analyze kernel crashes take a look at what's happening upstream and understand those odds spaces between things someone who can look at the lower level you're gonna need someone or someones who understands networking and can make solid decisions for what you need to do for your new platform network because again there are a lot of different choices out there you're adding new levels of abstraction you need someone who understands how that's gonna integrate with what you're already using and what's gonna be best for your architecture your needs someone who understands what security containers provide out of the box what they don't and what best practices in the community are containers have kind of a bad reputation when it comes to security but they've made great strides in this area and you need someone who's passionate about that who can do the work for you because a lot of times security in containers is very last of the list when it comes out to rolling out a platform and it should really be a first-class citizen you need someone who can help with internal adoption who has good relations with other teams internally because you're gonna need someone that can walk up to them during their stand-ups and go oh that's a nice roadmap you have you know what would be better is if you didn't do that and it said containerized your services and put them on our beta platform and get all bugged out and then tell us what's broken that would be great you're gonna need that person that can manage those relationships because you need those beta customers in order to build a more solid platform and if you don't have that early adoption your platform is not gonna succeed so you need someone that can help you with that migration internally and get it rolling and manage those relationships and last but not least you're gonna need someone who can keep the project and the team on track and get the support the project needs internally to succeed there are a lot of obstacles that are gonna come up when you're trying to roll out a container platform people might start yelling at each other over road maps don't know why that would be and your need a blood project manager that could step in and go no no it's fine we'll give you what you need my team needs to go over here now are you doing and then just keep them on track and and boost morale when things don't work because again you're gonna be going through that research phase that development phase where things are probably gonna fall on the floor and shatter and you need someone that could be like that's fine we padded the roadmap for this we know with a timeline we're dealing with you're fine go pick something else to try now each of these skills doesn't need to be a dedicated person but it also can't be just one person okay you need a team ideally your team should be between six to eight people but it at least needs to be four people and that's because eventually this team will be on call for this platform and an on-call schedule of less than four people is in my opinion untenable don't do that to your engineers they will burn out so at least four people but I would do at least six to eight and you need to empower them to succeed and this is more than just that project manager's job you are going to need Authority you're gonna need top-down Authority in order to build this container platform because they're gonna be trying new things they're I need a budget for example to try out cloud instances to iterate faster they're going to need the mandates to mess with other people's roadmaps to help with that migration they're going to need the space to fail and you need to make sure that they have the support they need in order to correctly build out this platform so today we took a look at some of the pros and cons of running containers in pride at scale and we touched on for lessons briefly such as containers have strengths their strengths lie in stateless ephemeral disposable processes and if you already have those processes hanging out in your infrastructure maybe you're breaking down two monoliths into microservices consider also putting those micro services into containers to headaches at once do it if you and you can also put them in testing environments they've really bring strengths wherever you have those ephemeral processes we also talked about their weaknesses aka databases and how people try to be clever but ultimately end up with either the same hardware footprint with a little more elbow grease or being network bound and really you should just try a cloud provider we talked about how it's never just containers and how there's a vast ecosystem of software out there that you need to sift through and try and integrate with your already existing infrastructure in order to build this platform and we also talked about how you need to hire a brand new team in order to build this out and the different skills you need to incorporate in that team to make sure it succeeds but at the end of this day does this really answer your question you may be asking yourself ok fine fine but should we use containers and prod like come on well there's no like obvious yes or no answer but there are some indications for example do you have stateless services do you have a large heterogeneous platform that you would like to bring more inline so everyone's deploying the same way do you have the time the money the people the org support in order to make this happen then yeah try running a container platform it'd be good on the other hand if you have a monolith and a few support services if you have nicknames for every one of your databases if you have a small team with no org support if they're just like oh yeah you want to do containers well it's you and her and that's it and you already work here or if you just want to spite me maybe don't do containers unless it's the third one please if you want to roll out containers just to spite me please do that please live tweet at me I would I would love that but at the end of the day there is a simpler question you can ask yourself and that's do you want containers or a blog post are containers really the solution to the problem you're trying to solve if you are making a list of all the reasons to do containers and reason number two is it would be rad don't do containers they're not worth it it's not worth it there are so many of better tools out there that have the long term support that are tested and proven and aren't constantly changing that will probably work just fine for you because guess what it's okay not to use containers containers might not be the best solution for you but they may be the best solution for you and so you need to look at them critically to make sure they would fit in with what you're already trying to do that they really are your best case solution and if you do use containers just remember that they can be a little tricky but as long as you have a good team you try to have fun you will get them in the end thank you very much [Applause] [Music]

Info

Channel: LeadDev

Views: 11,540

Rating: 4.7212543 out of 5

Keywords: White October Events, the lead developer, lead dev london, lead developer, software engineer, engineering manager, team building, leadership, management, the lead developer london 2018, docker, alice goldfuss, containers, lead dev 2018, lead developer 2018

Id: sJx_emIiABk

Channel Id: undefined

Length: 37min 29sec (2249 seconds)

Published: Wed Jul 11 2018