Automate Kubernetes with GitOps

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so the title guitars we're going to talk about get a little bit I found this screen which has a photo of me looking a bit younger but I if he an apologist of the gitlab people but I do most of my stuff on github so I use getting a lot I'm I'm an engineer I build maintain software I'm not a business person so that's that's kind of where I'm coming from so what about you so who knows kubernetes I'll put my own hand up yeah okay most people good good good who runs kubernetes in production little bit less okay who's using continuous deployment just a few people continuous integration more people okay and who already knows all about Gaea tops nobody okay maybe maybe a couple I should put my own hand up right I I'm here to talk okay so that's interesting so the word is out there the word was was coined by my boss Alexis who is this guy yeah the CEO of weave works and and he he coined it to describe something that we were already doing but anyway here's Kelsey Hightower who's who's way more cool than I am he says stop scripting and start shipping so what we're going to be talking about in this session so this is I didn't want to go on I like a mystery tour I want to give you all the information on the first slide so so this is this is get ups I'm gonna walk through this a bit more slowly now back to this slide but there it is that that's that's what I'm talking about but if we go back why might we be interested in this and I I I kind of I took a look at the title of this whole session and I thought you know code natively strands formation that we're thinking about or looking for and and I thought you know the the big thing how long to get a new server so in the in the traditional data center that might be a month and in the cloud two minutes so that's probably does that kind of resonate that's the transformation that a lot of people are looking for in going to the cloud how long does it take to deliver a software change so well according to forrester so this is this is how how many businesses release monthly or faster and it's like 20% what about I mean who here releases monthly or faster yeah yeah that might be 20% ok so so we're in we're in kind of standard company so how can we how can we speed that up how come we release fat so the the company I work for we we release most days during the week at least try not to try not to work at the weekend but sometimes several times a day and and that is powered by get ups so and then I I looked I wanted to look at like why there's lots of different reasons why people release more slowly but I found this I found this artifact this is this is apparently someone's release and this isn't even the release process this is the release approval process this might be this might be part of the reason why it takes a month or more we got lots of stuff to do lots of checks to put in the box etc etc etc ok one more one more backstory so one day we run we've worked runs a online service we've cloud and where you can there are tools for continuous deployment and monitoring and visualization and stuff like that one day a few years ago a typing slip caused all of our servers to be deleted and and we were up and running again in 45 minutes ran and this is actually the reason I mentioned this is this this is what got the CEO interested in in talking about this he he was amazed like how how could you do this how what and that 45 minutes includes all the time spent going oh and so yeah so you know he he was he was totally amazed by this and really interested in what how did we do that what is the circumstances what are the practices etc etcetera etcetera so so here it is we we use declarative infrastructure everything everything is described in files all of those files are in get there under version control and and and get is the single source of truth it is the the master get drives forwards into what we are actually running git git is the single source of truth so when the unfortunate event happened and all of our servers were gone we could reapply the config that we had in get new servers came up install the software and then we run in kubernetes so so reapply the manifests and all the pods come up and we're up and running again 45 minutes later so that's you know that again that's that's get ops in a nutshell the the ops part we every change to the environment is a git commit so that means you can see all the history it means you can see who changed what when it means you can roll back to any point in time but it also means that the conflict you have got in a file is the config you running and not the thing that someone fiddled with last Tuesday to make it to tweak it so that it just you know just wasn't working right and I logged into the mid is SH tin to the box and I tweaked it and oh we can't we lost we lost a record of that so if you can in our environment you can SSH into a box and change something and a few minutes later an alert will pop up because it no longer matches what's in get in fact if you do that at the kubernetes level it'll just get overwritten we we don't over automatically at the lower levels but at the kubernetes level if you if you like Cuba it'll change something less than a minute later it'll get reverted it'll get overwritten to what's in get okay hopefully I put off I put this slide up again so it says the same thing it says it says describe your system declaratively version control I'm saying get like I use git but it doesn't have to be get but this is a concept so if you want to use mercurial or clearcase or whatever perform yeah I wonder if I get a laugh a clear case I guess it's gone on so it we're so far beyond there that I don't even get a laugh so changes to the desired state per commits and software agents sync up and okay so let's go through some more of the okay one more slide yeah so anyone can use github or get lab whatever we people people this happens people can join our team and they can start making changes a config changes are PR they can be approved whatever all that process can be Amin done and and yeah it's not it's not anything very new or clever I mean I I it's not like we invented some genius new technique that everyone has to adopt it's all pretty simple stuff that you may be doing all of this or you may be doing some of it or but you know it's not it's not like it's not hard to imagine how you could be using it I hope okay so let's walk through because I guess I guess most people put their hands up then you kubernetes so this bits fairly easy I wasn't sure what the audience was gonna okay I wasn't sure what the audience was going to be so III included this stuff so the the the high-level view you know kubernetes runs a set of nodes there's um let's do the laser tech this thing the control plane has a set of objects that define what's supposed to be running and and and that's that's the declarative state of the system the other the other parts of kubernetes this part running on each node cubelet reads the spec of what's supposed to be running and then runs the the actual software in pods but most people put their hands up their new kubernetes so we don't we're not going to dwell on that oh yeah there we go so there I drew some arrows I put up a warning because I thought there might be like business people in the room I promise this is the only time I'm going to show some yanil but I I wanted I wanted to stress how easy it is at this level so this this is a complete declarative description of a of a service of a deployment running in a kubernetes system it is it is in the amyl but it you know it's not that scary and so the the the point that this is this is what you would you would bring out as a file and put under version control and then the things you might want to change in the file you might want to change the number of replicas you might want to scale up you know to a hundred nodes a hundred pods or you might want to scale it down you know if it's a quiet day another thing you might want to change you might want to change the version of the software now in a more realistic manifest word would probably run to about three four or five times the length of that with a lot more detail but the principle is the same all we're doing is editing this file making changes of this file putting that in a in a git commit and then syncing that up to the system so it really is kind of painfully simple at that level the slight pause is because I press on this thing and it doesn't happen and I'm not sure what's going on but we'll we'll get there so another another kind of anecdote this is from my personal history I worked in a in an environment where we had all the config in a database and it was like this is actually a photo of the the trading floor where I worked and these are some of the people that I worked with and they they really did have six Queens this is 2006 I think the photo was taken thirteen years ago we we had a we had a pretty sweet system it had automated deployment you could you could do this same thing you could change the version number and it would roll out and we'd put it in a database because that seemed like an obvious thing to do but we didn't have the history and and so it was a you know really at times it was a like well why why is that why is it running that version you know who changed that we didn't have the history so that for me just reflecting on on my own history with this kind of technology and in this kind of way of working having the audit trail who changed what when having the tools to see that very very easily is is a massive benefit of Gotama so even have an example of the history and again apologies to the get labs people so this is this is an actual Yama file one of our deployments and we run Prometheus and it is in our dev environment because I'm not allowed to show production and details in public but you know this is this is the actual history of this file we we do this stuff every day and they these are kind of conflict changes just oh there's there's a version change sorry you know this is again this is the kind of thing you get the history who changed what went okay let's look a little deeper how am i doing for time okay so start start with the simple case which I've been belaboring you have you put all the files in get you you the the manifest file the ml files which which kubernetes understands you put them and get and then you need something to synchronize them up and we have made an open source tool that does this called wave flux which we host on github who's who's who's run flux anyone to one person okay they work I shouldn't ask that quite it'd be a kind of a different talk if if you said no good I'm glad I'm glad it worked for you yeah so I mean you might you a lot of people look at this problem and they think well why you know why do I need a tool this is just keep it'll apply you know why why even talk about it and my answer that is basically I don't care because like well if you if you download the source you'll see there's like 30,000 lines basically there's a lot of corner cases there's a lot of kind of little wrinkles and things that people want to do they there's there's a certain amount of automation across different features and things like that but I I do believe you can do this in a in a bash script in in like eight lines or something like that just without all the corner cases covered so so it doesn't really matter it's open source flux is one tool you can probably find other ones and you can probably make your own so so we put the files in git and we synchronize to kubernetes that's that's the basic operation for a config just as I've been describing if I stand closer oh okay so here's a person a person has made a config change and we sink we sync it up to kubernetes the another part of the picture is that is the image repo way if you're running software that's going to come down from you know docker hub or somewhere like that and that's a separate repo but configured by what's in in the gate config let's move on to look at continuous deployment so if what if you have a build pipeline so this is this is good for software that's going to build and and that's the output of the software going into the image repo so this is this is the case where you're building your own software you know you're making you're making changes to the source code or someone is making changes to the source code the output of that goes into an image repo like docker hub and then what you might like you might like continuous deployment meaning every time you get a new change to the source code it makes it all the way through to your running system so I'm going to stick to pressing the spacebar so so what's useful there or the laser part still works that's interesting so yeah so what you can have is is you can automate that piece too and flux will do this as well so every time a new image appears here will actually update the version that's in git which then synchronizes through to the kubernetes system running your environment so so the top line is continuous integration the bottom line is continuous deployment the the we we separate the two I think I have some more slides about this but it's kind of a it's kind of an interesting point that we we separate the two tasks so we we automate for personally we automate every version that gets built goes into our staging environment by default automatically and we we do not do that into production we are scared so I don't I don't know you know people should live their own lives but we are we are not brave enough to automatically feed every change in this or got through to production so what we do is we we put a human approval step in between staging and production but it's the same git repository it's the same config you know with with kind of differences in scaling and things like that but it's it's the same config that runs our staging environment and production environment and and this is a PR process like it can be a PR process or it can be just a promotion release we have a we have a tool to do that as well to kind of copy the versions through without without actually having to edit the file yourself but this I guess my point is this could be any process you like all the tools are there the the it's just a git commit at the end of the day however you get there it's just a git commit so so this is effectively how we run our system and and you can run your system the the tooling is open source the concepts are really pretty simple we we automate this step here which is to staging and we we have a human design if it's ready to go to production what else oh yeah so this is something I often hear people say well I just I just drive my deployment I don't need this extra tool I don't you know why why are we separating CD from CI and and basically if I step back there because you know I have I have one CI pipeline and multiple CD pipelines so to me it doesn't work so well to have one thing driving because I actually have a malted multiplicity the other the other thing is there's there's multiple ones of these right I mean in our system we we deploy many different projects we deploy things that were built at works we deploy memcache we deploy other stuff that is built-in in other people's projects so there is not a one-to-one mapping of continuous integration and continuous deployment so we separate the two things we we run continuous integration the the final product of continuous integration is an image which has passed all of its tests and then we run continuous deployment and the end of that is a running environment so that's yeah that's a comment I often hear you know I just I just put the I just put the deployment in my Jenkins file another little bit more detailed wrinkle is occasionally this process of applying the updates will fail partway through so this happens you know particularly admit you make change the way things are laid out or something like that and it fails because of permissions missing permission to make that change or fails because the version of kubernetes has changed and it doesn't accept that that data anymore or you know something or maybe maybe it's the network glitch it's it's good to blame everything on the network right it's you can always blame everything on the network if if something goes wrong in this final apply stage and you changing like like for yanil files and two of them got changed and two of them didn't then then trust me that is a hell of a job to unpick and figure out what happened and so on and so on the song in the gate ops world what will happen is it'll do it again it'll do it again like a minute later and it's it's just applying the same files well unless unless another update has come through in the meantime it's just applying the same files so when the underlying problem the network glitch or whatever when that gets fixed it will just apply the config and you don't you have to debug it you don't have to figure out the half-half applied config so so we do not recommend that you drive deployment from CI because it's just much more dependable and reliable to separate the two things and run CD separately from CI okay let's move on to observing stuff because another thing that lets you move faster is if you know what's going on the it's all very well to kind of produce software and throw it out there but you know things go wrong people make mistakes in the software things things don't go the way you expect so what makes things a lot better for from the point of view someone operating the system is observability so if you have some idea how the thing is actually running especially in production then you are going to feel better about pushing things out faster you that you don't release faster if you know more about what's going on and also if you have the confidence you can always roll back you know the get history lets you roll back to any point in time completely accurately across your entire system so so basically turns into a loop like this yeah this is a better one to take a photo the the loop around release observe operate that you you know you make a new release you push it out you see how it's working you know you're ready for the next one you push that out you see how it's working and and you get into this loop and you basically drive that as fast as your organization is comfortable with so for us that's that's pretty much every day not necessarily the same software every day but but we have a lot of different parts of the system so so things are released pretty often so so that's conceptually talking about about this kind of loop we can get into a little bit more a kind of particular way of doing that which we call progressive delivery or canary releases is sort of what another another term that people know so the the idea here is we start off and when we're running version one of our of our service we're running multiple instances maybe and the traffic is being kind of routed to to all of those instances and then we reintroduce version two and we get we direct a small amount let's say five percent of the traffic diversion - so this is what's called a canary deployment because you you kind of you don't care about that the canary dies or not it's just it's just a small percentage of the traffic and you're gonna you're gonna run a small percentage of the traffic and then you're going to observe that and see if it works if that works you're going to ramp up the amount going to the new version you're going to run more of those replicas ramp that up now all of this the the the tooling and so on we have another open source project called flagger which automates this you know other other ways to do it or possible flagger works with with a steel you need something to redirect the traffic to be able to send a percentage of the traffic to a particular version of the software so works with this do it works with a p-- mesh and if anyone wants to make it work with another mesh then send us a PR it's open-source or send us money so so if you're if you've got to if you got to fifty percent and it's all good then we start replacing the version 1 with version 2 and we basically go all the way up to a hundred percent on version two zero percent on version one and then and then we kind of whoops we go back to the we go back to the beginning state this is the right number of instances but we're all on version 2 now and this is automated looking at metrics so you you can define what you care about might be an error rate so you you'll ramp it up as long as as long as everyone's getting the 200 responses and not the not the 500s you can ramp it up based on latency staying below a threshold whatever you like the the these automated canary deployments through metrics so automated driving around that that loop of release observe automate operate excuse me this is this is kind of cutting-edge stuff right fully automating this but it exists and relies on on the amazing underlying features of kubernetes and ECU or rap mesh so that's that's kind of I don't know next-generation technology maybe for for some people but it is it is certainly possible there yes some considerations like not not every kind of software will do that so I just wanted to just throw it put up some thoughts about about evolution if these things you know if you if your software is serving an API you can't just drop an AP you know change an API drop one version put in a new version and do all of this automated rollout canary deployments and so on it is necessary that the new version you're putting out supports the same API as the old version it's necessary that that everything broadly works the same so what you what you might need to do if you if you are changing an API you might need to support both of them for a period of time so that's work right if you're if you're a developer but but it is necessary in order to work this way in order to be able to just smoothly roll out new versions of the software you need to do this extra work you need to not just abruptly change api's and things like that are the things that don't fit so yeah basically don't do this to your Oracle database or whatever you know whatever your primary storage your things things that would that kind of hold on to data and take minutes to react to changes in configuration and so on don't don't do this whole multi multi times a day config changes to that I don't know I mean generally generally I hope I advise people not if you if you're moving to kubernetes don't don't do the the primary storage you know if you have a big Oracle database leave it where it is at least for the first year just run all the other stuff run run the run your web servers run your business logic run that stateless those things can can cope with this kind of rapid evolution and rapid changes roll forward roll back the your primary store unless it is exceptionally engineered to cope with that with that kind of way of working and even even things like Cassandra which which are engineered to run on many nodes some which may die I still don't think you want a goal like randomly rolling them forwards and backwards and so on - does anyone can run Cassandra no one yeah would you do that would you roll them forward like no okay yeah so so we're talking we're talking about your ear like business logic display logic web servers that kind of stuff where we're not talking about your big Oracle database with with this model and I think oh yeah oh one more thing get ops kubernetes itself so so far I was talking about the software that you run inside kubernetes but what if you want to apply the same principle to kubernetes itself what if for instance you you start with with no kubernetes you just have some nodes you just have some VMs or bare-metal machines running well then we can we can do the same thing we can put the config in get there is a recently adopted specification like like a yeah mole spec for kubernetes clusters it's called the cluster API it's not really an API but you don't like make coals to it it's it's a it's a declarative definition of a cluster but you could you can take that as the ML files you can stick it in get there you go you can have some kind of agent which will understand that file and we'll install kubernetes based on that former and and now you have a running kubernetes cluster and then so installing it from scratch you you might do occasionally other stuff you might want to do like a version upgrade to kubernetes go from 14 point zero to fourteen point one that is a that is a git commit which is then synced up to your cluster in this world and and it turns out we make one of those as well but again you know cost cluster cluster api is a standard part of the kubernetes project so other people will have that too okay I think I'm just about done cunning subliminal advert we have products to sell but this is my this is my message guitar pool describe your whole system declaratively put that in a day in a version control system such as get changes to the desired state or commits and sync it up automatically and you're doing get ops so there we are thank you oh and we're yeah we're running an AWS event tonight if if anyone wants a free drink then search for our AWS birds of a feather session or try and type in that tiny URL we'd love to see you there does anyone have any question got one right there so basically we flux if I just change a config map or a secret will it restart my pops so they take that into consideration or do the containers have to have like a file watcher and notice it themselves yeah that's that's a good question that's it's probably a little bit detailed so maybe explain some of the background so in kubernetes you can have a config map like some some sort of data which is read by the running software and if you make a change to that data the the default case is not nothing happens that the software will be able to read it but it doesn't know that it should read it so in the case of the tool flux that I was talking about does not itself help with that problem one thing a lot of people do is they they take a hash of the contents and actually use that in the name of the of the data so when you change the contents the name changes so the spec changes so you get a you get a rolling deployment how will do that automatically and flux will work with helm so you can you can drive all this with with the straightforward yeah milles or you can drive all this from helm and take advantage of that feature but it has it's certainly a requested feature in the flux tool itself let me think yeah I mean this there's actually a question about the file watcher thing what you will get the effect that everything will see the new file at the same time and if you made a mistake in it everything is going to read bad contig at the same time which is not great so if you think of the idea of the canary rollout you you actually want that behavior you what you want like 5% of your states to read the conflict before the 95% and where we have we don't have that feature yet but that'll be kind of where we'd like to think about things but yeah good good question who has another question any preference for handling secrets and get ups secret yeah you did mention secrets and so two questions about secrets really so one of the things you might not want to do is just check into your secret into your git repo so one of the ways around that is to encrypt the secret with another key that's the so called sealed secret model and that's that's one that we've used so that in that in that event where you need to bring up your whole cluster in 45 minutes you need to be able to find that other key which is not in the git repo and you need to be able to kind of get that out and put it in and so on so some kind of other secrets manager like vault or something like that is is then gonna outsource that problem so as long as you don't delete that you you you can you can leave the problem there other ways you know I've talked a lot about how do we run in production which I think people are often interested in we run on AWS and and we kind of outsource a lot of that to the the Amazon authentication and identification system so so they're basically are are no secrets in our software and and they dynamically pick up credentials by talking to the a IIM API in Amazon so that are you guys that's kind of the same thing as the vault idea I guess like like basically don't any secrets in your config pick them up dynamically from from a credential service yeah the the secrets yeah sealed secrets is kind of like like the the purest thing where they get they get decrypted on the way into the system but I I think we're you know maybe still a little bit on on the journey there I think if you if you have to deal with that kind of thing then then there's there's probably more evolution to come there okay hi thank you for presentation what is your forecast will so in the past in the last part it's a primary storage I mean all legacy parts right enterprise hardware software everything what what exists before kubernetes and cloud nation will get ups covers this will this part just disappear and the tops will be cloud native will be everywhere or you see any third way yeah so I I couldn't hear you very well but I think I think you're saying what's what's the future will will get ops cover more of the kind of legacy domain yeah yeah well probably I mean I you know legacy or not the the big distinction is about its really kind of the pets versus cattle thing is is this something you have to kind of look after carefully and and you know you have to stroke it exactly the right way and or is it something you can just you know shoot it in the head and make another one which is the cattle domain because the the automated the automation is is more angled at the cattle way of doing things so I mean the only the only way I could see more of the automation applying to stuff that needs very careful handling is if maybe some kind of operator in the in the environment understands how to how to stroke the thing very carefully like how to shut it down very carefully and then move it across to this other thing and then start it up very carefully if you have that then then yeah you can you can do all this automation you can do get ops as long as your automation does everything that your pet needs so I don't I don't know who's gonna get there faster i you know i whether whether stuff will just migrate from being pet like to being cattle like or whether more operators will be written but some some of those things you know I've gone about the big Oracle server people use those things because they work really well right I I don't I don't necessarily believe that that's going to get swept away you know big big iron databases work really well at the stuff that they're good at and you need to treat them carefully you you know you don't want to just shoot it in the head to move it onto another node so so probably for those things they will get that automation built in and then and then we can use these kind of techniques for everything okay thank you
Info
Channel: Sysdig
Views: 5,344
Rating: 4.8518519 out of 5
Keywords: kubernetes, gitops, cloud-native transformation summit, automate kubernetes
Id: 69ZV1UkVmFM
Channel Id: undefined
Length: 47min 19sec (2839 seconds)
Published: Thu May 30 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.