Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everybody thank you for coming everybody hear me okay all right wonderful so this talk is called how to test infrastructure code I will go through some automated testing practices that we found for tools like terraformed docker packer kubernetes there will be a lot of code so get ready to read code this is a hands-on I'll try to run some of the code we'll see how that goes but first I'm gonna start with a bit of an observation something I've noticed about the DevOps industry Ops sis admin's whatever you want to call them and that is that we're all living in a bit of a world of fear this is this is the predominant emotion that I'm seeing from most of the people that I work with they're just living in fear fear of things like outages fear of security breaches and data loss and just generally fear of change just people constantly afraid to change things because they don't know how late they're gonna be up how bad it's gonna be or I just terrified and we know what fear leads to right fear leads to anger anger leads to hate hate leads to suffering you know the great scrum master Yoda taught us these lessons and we all know what suffering leads to write you know elites the things like this right this this sort of thing I saw this one on Twitter I love this I think it's pretty amazing how many of have you had this sort of feeling right you're typing along and sudden you're like oh no oh you just feel it like deep inside your writer if you don't like the the Star Wars memes we can do your you know maybe Harry Potter is your preferred one that sort of thing usually after you ran RM RF in the wrong place or draw up table so most teams seem to deal with this in two ways one a lot of drinking and smoking and number two deploying less and less frequently right it's scary it's terrifying so you just avoid it and you do it less and less often unfortunately both of these solutions just make the problem much much worse right your releases get drunker but also they get bigger there's more risk this actually makes the whole problem a lot worse and then you end up in this sort of the world right sixty percent of the time it works every time right so I don't want to live in that kind of world right I think there's a better way to deal with this constant state of fear and that is automated testing now I want to make the claim that this is gonna solve all the problems in the world it's gonna make all your fears go away but mid tests do have one very interesting impact and when you see teams that do a good job with it this is exactly what you see which is instead of fear you start to see confidence that's what tests are about tests are not about proving that your code works they're not some perfect thing that says yeah everything's great they are about confidence it's about emotion it's about how you feel about making those changes and that's really important because you can fight fear with confidence that's really the key so we do mostly know how to write automated tests for application code if you have an app built in Ruby or go or Python or any of these general-purpose languages we more or less know how to test these things but how do you test infrastructure code if you have a whole pile of terraform code how do you know that the infrastructure deploys works the way you expect it to or if you have a pile of kubernetes code how do you know that the way it deploys your services is the way you actually need it to work how do you test these things so that's the goal to talk I'll share with you some ideas some insights on how to test with some of these tools and we will look at a whole bunch of code and hopefully by the end of it you'll at least have some ideas of you know how to sleep better at night how to be a little less afraid I am Evgeny Berkman also go by the nickname Jim which most people find a little easier to pronounce I don't why I put a picture of myself into my own slide deck because I'm standing right here anyway I'm the co-founder of a company called grunt work and this is where a lot of this automated testing experience comes from a grunt work we built a library of hundreds of thousands of lines of reusable code for terraform in kubernetes and docker etc and it's used in production by hundreds of companies and the way our tiny company is able to maintain them all of that code and keep it working as the whole world around us is changing is through a lot of automated testing so we spend a lot of time thinking about this I'm also the author of a couple books there's terraform up and running that's actually the old cover I need to update this slide as well second edition is out yeah go get it and hello startup which also talks a lot about the software delivery process so here is what we're gonna talk about today we're gonna look at the various testing techniques that are out there for infrastructure code look at static analysis unit testing integration testing end-to-end testing and these are loose categorizations some people become very religious about what each of these terms means just these are more a helpful mental model to navigate the space alright so we've got a lots covered we get started with static analysis and the idea here is you want to be able to test your code without actually running the code or in the case of infrastructure code without actually deploying anything for real that's the goal of static analysis look at my code don't run it tell me if there's a bug or if there's some sort of issue and there's a few categories in here again these are not perfect groupings there's some overlap between them just a useful mental model for navigating here so the first one are the compilers the parsers the interpreters for whatever language you're using and so the idea is these things are checking your tax the structure of your code the very basic thing you know does it compile is this valid yamo HCL go whatever light is release and so there's a variety of these tools I don't know how well you can read some of the smaller text in the back but I'll go through this really quick so for example for terraform you have the terraform validate command I'll show you a really quick example of that well that's intriguing one of my screens updated the other did not hold on okay I might have to exit the show to make that work all right there we go okay so I'm in here I have a little bit of terraform code we'll deal with what the code is in a minute looks like this nothing fancy using a very simple module and in here I can run the validate command and it tells me everything looks good and then if I mess up the code like I make some silly typos in here and I run terraform valid again it will give me an error alright so that is a very very basic level of testing that you can do for your code scan it tell me if the variables that I'm referencing are actually defined tell me if the syntax is valid I missed a curly brace there's similar commands for Packer and in the kubernetes world cube control has a dry run and a validate flag that you can add that'll do something pretty similar alright moving one level up from that you want to catch not just syntactic issues but also common mistakes so there's a whole series of these tools by the way these slides will be available after the talk so don't worry about all these links should be easier for you to grab so there's a whole series of these tools so for terraform there's contest which actually works with more than just terraform terraform validate TF lint etc a whole bunch of these tools that will read your code statically analyze it and try to catch common errors one of the kind of idiomatic examples these tools give you is you have a security group that allows all inbound traffic in other words a firewall that's way too open so something like that can be caught using tools like this and a lot of cases so these are good to plug into your CI CD pipeline they run in seconds they're going to catch a bunch of common mistakes which again is better than having no testing at all third group which don't have a good name for I'll just call it dry run here we actually are gonna execute the code but we're not going to deploy anything it's not gonna have any effect on the real world so we are running the code a little bit here so it's a kind of an in-between between static analysis and unit testing and we're going to give some sort of a plan output and be able to analyze that so in the terraform world there's some nice equivalents of this so there's actually a terraform plan command that I can run here so on this module I can run my plan command but this little thing at the front this is just how I thoughts gate to AWS basically ignore that this is the actual command terraform plan if I run that it'll make some API calls and it'll tell me what the code is going to do without actually changing anything in the world so here's my plan output it shows me that it's going to deploy some lambda function some API gateway stuff etc etc and you can analyze this plan as a form of testing so there's some tools that help you with that for example in the terraform world there's hash record Sentinel and terraform compliance both of them can run terraform plan and statically analyze that thing and catch again a bunch of common errors in a static way in kubernetes world there's a server dry run I think this is an alpha feature actually it's pretty new which will actually take your gamal and your configuration and send it to the api server that server will process it it just won't save the results and so it's not going to affect the world but again this is a good way to check it does my code more or less function to any extent so those are quick little overview of the static analysis tools what's nice about them they run fast easy to use you don't have to learn a whole bunch of stuff but the downside is they're very limited in the kinds of errors they can catch so if you're not doing any infrastructure testing at all at least add static analysis it really just takes a few minutes of your time and it will catch a bunch of these common mistakes but if you can do a little more let's do a little more so that's where unit testing comes in now we're going to get a little more advanced so the idea with unit testing is you want to be able to test as the name implies a single unit of your code in isolation so in this section we're going to go through a few things we'll introduce the basics of unit testing I'll then show a couple examples for two different types of infrastructure code so to look at Tara Forman doc and kubernetes and then we'll talk about cleanup so the basics first thing to understand about unit testing is what's a unit I've had a lot of people come up to me and say hey I have 50,000 lines of code deploys this enormous infrastructure how do i unit test it well you don't that's not a unit unit testing with general-purpose languages is on a single method or a single class the equivalent with infrastructure code is going to be a single module whatever module means and the language and tools you're using so your infrastructure should be broken up into a bunch of small pieces if it's not that's actually step one to being able to unit test it if you right now have a terraform file or CloudFormation or any other language with 50,000 lines of code that's an anti-pattern break it up into a bunch of small standalone pieces and one of the many advantages you'll get is you can unit test those pieces okay next thing is with app code when you're testing those units when you're testing a single method or class you can typically isolate away the rest of the outside world right all of your databases file system web services you isolate them and you can test just the unit by itself which is good because then you can test very quickly and it's this tests are gonna be nice and stable but if you actually go look at most infrastructure code so here's some terraform code what's this code doing all it's doing is talking to the outside world that's 99% of what your code is doing whether it's kubernetes CloudFormation AWS that's all it really does is talk to the outside world if you try to isolate the outside world there's really nothing left to test so the only real way to test infrastructure code beyond static analysis is by deploying it to a real environment whatever environment you happen to be using there might be AWS for Google Cloud it might be your kubernetes cluster you actually have to deploy because that's what the code does if you're executing it a deployment is the result so key takeaway there is no pure unit testing for infrastructure code in the way that you might think of it for application code which means your test strategy looks a little more like this you're going to deploy the infrastructure to a real environment you're going to validate that the infrastructure works and I'll show you a few examples how to do that and then at the end of the test you under ploy the infrastructure again so really again this is where the terminology cuts gets kind of messy this is more of an integration test but we're testing one unit one module so I prefer to just stick with a word unit test and just think of it that way now there's a bunch of tools that can help you implement this strategy not a comprehensive list this is just some of the more popular ones some of them will do the deploy and undeploy steps for you some of them expect you to do the deploy and unn deploy outside of the tool yourself so terror tests for example can do deploy a nun deploy can do validation and it integrates with a whole bunch of tools including terraform and kubernetes and docker but there's a bunch of other tools some that are specific to terraform some that are specific to checking servers so definitely check these out all the links are going to be or are in the slide deck and you'll have access to that soon in this talk we're mostly going to use terror tests but just bear in mind that these the same technique will work with pretty much any tool all right so let's try to write a unit test here the sample code so this talk has a bunch of sample code there's some terraform code some kubernetes and the automated tests for it I don't know that that's the best length I should have gone with a slightly shorter link it's in the grunt work io org it's called infrastructures code testing talk I'll tweet this one out it'll be in the slide deck so all the code I'm showing you here you can check it out after the talk and one of the things you'll find in that sample code is a simple little hello world application that we can test so let me actually deploy that little application now takes about 30 seconds I don't want to sitting staring at the screen so let me run by there we go all right so I'm just going to deploy this thing in the background and then I'll walk through the code and show you what this thing is actually doing yes okay so here's the hello world app it's terraform code looks a little bit like this very simple code that's really all there is to it it's using a module to deploy a service application so for the purposes of an example I'm using a Tobias lambda and API gateway here just cuz they deploy quickly so the talk goes faster if I do this this module lives in the same repo here it is if you're interested in the code it's does more or less what you expect to play lambda function create an iamb role for a deploy api gateway etc and this code also outputs the URL of this little endpoint at the end and what we're actually running in Italy is lambda is some JavaScript code and this is basically the the hello world example so it just says hello world and returns a 200 okay it's a really simple piece of code it's deployed in the background I can now copy and paste this URL run curl on it hit enter and there we go we've got our nice hello world so this is a nice thing for us the test and play around with here during the talk let me actually undo ploy it now just so I don't forget about that but what you're notice is what I'm doing right now is I'm manually testing this thing right what did I do deploy validate and now here I'm doing the unemployed so we're gonna actually write a unit test that does exactly these steps but automatically in code so let's see what that looks like I'll skip ahead here and I'll walk through what the code does in the slide deck and then I'll show you the actual code snippet in a second we'll run it and see if it works so since we're using tear tests tear test is a go library we're gonna write the tests and go if you don't know go don't panic not a hard language and not critical to understand everything about the talk it's more of the concept just to get your mind just to get the mindset right so we create a hello world app test go and this is the basic structure of the test and I'll walk through this line by line so this is actually almost the entire unit test so the first thing we do is we say okay here are my options for running terraform my code lives in this examples hello world app folder I then use a tear test function this terraform dot in it and apply to run terraform in it and tear from apply so this will actually deploy into my AWS account I'm then going to validate that the code is working and I'll show you the contents of that in just a second and then at the end of the test we're gonna run terraform destroy so this defer if you're not familiar with go defer basically says run this before the function exits no matter how it exits so even if the test fails it'll always run Toofer similar to a try finally or a ensure in other languages so that's a test apply validate destroy that's really what we're doing the validate isn't particularly complicated we're using a tear test helper to read that URL output and then we're using another helper to make HTTP requests to that output and we're looking for a two hundred okay that says hello world and we're gonna retry it a few times because the deployment is asynchronous so it's not guaranteed to be up and running the second apply finishes so that's the whole test let me run it really quickly it'll take about 30 seconds to run so I'll jump into the test folder run go test this is our hello world unit test here and I'll let that thing run in the background for about 30 seconds let's look a little more at the code so what I'm actually running here is here's my test folder here's hello world app unit test here's the go code and it's pretty much identical to what I showed you in the slide deck there's one little piece that I'll explain in a few minutes but the rest is exactly as I said terraforming in and apply validate destroy and the validate basically reads the output and does a bunch of HTTP requests in a retry loop speaking of HTTP requests the reason we're using HTTP is the infrastructure I'm deploying here is a web service so it makes sense to validate it by making HTTP requests but of course you might be deploying other types of infrastructure and there's different ways to validate those so for example if you're running a server that's not running any not listening on any port then you might want to validate it by SSA tching to that server and a whole bunch of properties and terror testa's ways to do that in spec all those other tools they're really good at that if you're running a cloud service you might want to use the cloud api's to verify that it works if you're deploying a database you might want to run sequel queries etc so just bear in mind that validation is very used case specific but for the purposes of this talk it'll just always be HTTP requests so running tests you authenticate to whatever environment you're deploying to in this case I'm with indicating to AWS and then you run the go test command to actually kick off the test suite and so if I jump back to the terminal it should be done running the tests yay that's always good to see the word pass took about 35 seconds and if we the log output unfortunate is hard to read because the font size is kind of wrapping around but if you dig through here you'll see that the test ran a tear for a minute then it ran terraform apply here's the terraform apply log output it deployed the serverless app ran terraform output to fetch the URL it then started making HTTP requests got the response it expected when terraform destroy and voila in 30 seconds I can now check that this module is working the way I expect to I can run this after every single commit and that's huge because I just went from a bot pile of code that maybe works maybe doesn't who knows well I guess our users will find out too I can test this after every single commit to this code alright so that is the unit testing example for for terraform and just to make the point that this is not something specific to terraform let's do a unit test for something a little different so we're going to look at some docker and kubernetes code here as well so let me jump back into my IDE the sample code is in that same repo so up here we have our docker kubernetes example and there's really just two files one is a docker file and this defines a really simple docker image for a really simple hello world server in the real world this would be your Ruby app your java application whatever it is that you're building but for this talk it's just a really simple hello world server and the other thing in here is this blob of yam oh this is used with kubernetes it defines a deployment if you don't use kubernetes this is basically a way to say hey I have this docker container over here I want to deploy one copy of it and I want to stick a load balancer in front of it that will listen on port 8080 so deploy the thing put a load balancer so I can access the thing so I can run this thing as well I'll show you how I test this thing manually first and then we'll write the automated test for it so I'll jump into the examples folder first thing to do is build my docker image so you can do that with the docker build command and that will run pretty quick because it's all coming from cache I run this before if you're running it from scratch it takes 30 seconds to a minute so that created this a docker image that I can now deploy to a kubernetes cluster and I can oppose any kubernetes cluster I want to one running in AWS or in GCP if you have the latest docker for desktop app kubernetes is actually built-in you have one running on your own computer or you can push a button to turn it on which is pretty neat because I can also now test with kubernetes completely locally so what I can do is I can run cube control apply on that deployment yellow file I hit enter and that thing will deploy my service so we can see if that worked we can go fetch the pods so there's my container it's now in running status and that I can do get services and there's the service in front of it that's that little load balancer you can see it's external IP as localhost and it's listening on port 8080 which means I can now curl 8080 and get a nice little hello world so okay we got a little darker example it's running kubernetes and then of course at the end we can also delete it by running the cube control delete command all right so that's how I test manually how would I test the exact same thing with a unit test and automated test as you can probably guess the structure is gonna look very very similar to what we just did for the terraform unit testing so I'll walk through it again in the slide deck so we created docker kubernetes tests go and that's the basic structure of the test I'll go through it so the first thing we do is build the docker image and I'll show you the contents of that method in just a moment then we say ok the kubernetes deployment is defined in this file I want to authenticate to my kubernetes kubernetes cluster and I'm just using all the defaults which means it'll just use whatever my computer is logged into which is the kubernetes running locally we were on Cube control apply using a tear test helper we validate I'll show you the contents of that in a sec and then at the end of the test using that defer keyword we run cube control delete so there's no magic right all I'm doing is taking the exact same steps I was doing manually and we're just writing them down in code and the value tear test brings is just to give you a bunch of nice helper methods for doing this but you can find similar helpful methods or write them by yourself so let's look at the two functions I mentioned this is the build docker image function it's using another tear test helper this docker dot build and it's basically just telling it where the docker file is located and what to tag it with not particularly complicated and then the validate function looks very similar we wait until the services available basically kubernetes is completely asynchronous so it can take a few seconds to actually deploy depending on the cluster you're using then we start making HTTP requests to this thing just like we did with the hello world app and the way we get the URL for a kubernetes service is to basically automate those steps I showed you with you know Cube control get pods get services I just put that into this method so there's a get service and a get service endpoint method so to run this test you will authenticate to some kubernetes cluster as I said I'm already authenticated to the one running locally and so at this point I can just run that test let's do that just go test and there it is kubernetes hit enter and this test should run very very quickly because it's all running locally so hopefully there we go ok so that took a grand total of four points nine seconds and what did the test do well the test built my docker image so you can see the output there it's all running from cache so that runs especially fast then it configures cubed control it ran cube control apply you can see it started making HTTP requests and actually the first one failed because kubernetes is asynchronous that's why we do it in a retry loop but after another try or two it succeeded and then it cleaned everything up again at the end of the test so again in five seconds you can now add this even as a pre-commit hook if you really wanted to or after every commit you can check if these kubernetes configurations you're writing not just that there's syntactically valid which is good to do with static analysis but that they actually deploy a working service the way you expect to and the code by the way so I showed you the code in the slide deck but the actual code for that test is very similar build docker image here's our space I skip the name spacing thing I'll come back to that a little bit and then basically here does cube control apply delete alright so that is unit testing a lot of people see this in their ike is that it like that there's no magic there's no just like magical thing that does this for me and no that's it you're just automating the things you would have done manually that's the basis of unit testing infrastructure code you deploy it for real but for me this is well worth it right because right now would these unit tests I have a lot of confidence in this code I know that if somebody changes the code and does something silly these tests will almost certainly fail and will catch it before it makes it to production so that's worth a little bit of work I'll mention one more thing about unit testing which is cleaning up after those tests so especially tests for terraform CloudFormation things like that they're spinning up and tearing down all sorts of resources in your google cloud AWS Azure counts for example we have one repo that deploys the elasticsearch stack the elk & elk cluster and after every commit that spends up something like fifteen elk clusters and various configurations pokes at them for a while and it tears them all that's a lot of infrastructure after every single commit so you definitely want to have a completely separate sandbox account for automated testing so don't use production I hope that's self-evident but you might not even want to use some of your existing staging or dev accounts where human beings are using it just because of the volume of infrastructure that's gonna be coming up and down will be pretty annoying so we usually have a completely isolated account used solely for automated testing and there's one other reason to do that which is has to do with cleanup so the tests that I showed you they all run terraform destroy or cube controlled deletes they all do clean up after themselves but occasionally that fails right you might have a bug in your test somebody might hit control-c something might crash so you don't want a whole bunch of stuff left over in your testing account so there are some tools out there that can clean everything up and it's really nice you know that the tool for example is called cloud nuke don't run it in production but if you have a dedicated testing account that's a good place to run something like that all right so you can run these as a cron job and just clean up stuff every day okay so that's unit testing let me see how I'm doing on time all right let's move along to integration testing so they do with integration testing is just because your individual units seem to be working doesn't mean that they're gonna work to when you put them together and so that's what you want to find out with integration testing and I'll show you just one example of integration testing and once you see it you'll see the structures more or less identical to what we've already talked about so there's not a whole lotta new to learn the basics approach we used was more or less identical and then we'll talk about a few other things with parallelism and test stages and retries okay so here's an example from that same repo where we have two modules that we want to test and see if they work together correctly we have one called proxy app and we have one called web service so I'll show you the code for those these are using basically the exact same module so there's nothing really new here they're using that same serverless app module the only difference is web service instead of a plain hello world it tries to pretend that it's some kind of a back-end web service that your company relies on and it returns a little blob of JSON instead and then proxy app very similar thing again another little service application the code that it's running will proxy a URL so you pass in the URL you want it to proxy as an environment variable it'll make an HTTP request to it and then forward along the results so you can sort of think of this as one of these as a front-end application one of these is back-end and you want to make sure they work together correctly so how are we going to test these things well the first thing to note is the proxy application has an input variable which is how you tell it what URL you want it to proxy and our web service has an output variable which is its URL so we want to proxy that URL that's our goal so we're gonna write a thing called proxy app test another go file and here's the structure so hopefully you're starting to get used to this approach and going through it line by line you'll see there's really nothing new here we're gonna configure our web service and I'll show you what this is doing but it's that same terraformed options thing from before we're gonna run terraform in it and apply to deploy the web service then we're going to configure the proxy application passing it information from the web service so this is really the only new thing here is we're passing information from one to the other and I'll show you these methods in just a sec then we're gonna run terror from apply to deploy the proxy application we're gonna validate it works and then at the end of the test in defer we're gonna run terraform destroy on each of those modules so exact same structure apply validate destroyed looking at those methods here's config web service it's just returning one of those terraformed options structs it says that's where my code lives here's the slightly new thing which is config proxy app so this thing is also returning a terraformed options with one new thing it's going to read in the URL output from the web service and it's going to pass it as an input variable to the proxy application so here we're chaining one modules outputs into the inputs of another module just by passing them along using whatever variables those modules support the validate method is completely identical to the hello world one it's just doing a bunch of HTTP requests the only difference is if looking for a blob of JSON in the response instead of plain text so we can run the integration tests the code for it by the way is right here so it's exactly as I said config the web service run apply config the proxy app run apply validate and then at the end of the test to run destroy a couple times if we run that test let's see where's our proxy app that's the one I will let that run in the background this will take a little bit longer and that's actually an important point so that's running in the background and it'll take a few minutes to run all told that's important integration tests in infrastructure code as you might expect to take longer than unit tests just like everywhere else and they can actually take a lot longer so I'm testing these really simple hello world lambda functions that deploy quickly but if you're deploying a database that could take 20 minutes just by itself so these tests can take longer so what do you do about that so there's a couple things you can do to speed things up one is run your tests in parallel this of course doesn't make any individual tests faster but at least your whole test suite is only as slow as the slowest test rather than everything running sequentially and that's useful because these tests can take a while telling tests to run in parallel and go is really easy you just add T parallel to the top of any test function and then when you run go test all of those tests that have that will run in parallel and so if you go back and look at the actual test code in this example repo you'll see that every test has T parallel as the very first line of code in the test there is one gotcha though which is you could run into resource conflicts if you're not thoughtful about this so here's what I mean by that your modules whatever it is that you're testing your infrastructure code is creating resources so for example here we're creating an iamb role and a security group in AWS and those resources might have names and in this case aw actually requires that I am roles and security groups the name has to be unique so if you hard code the name into your code and you run two tests in parallel and they both try to use the same name you're going to get a conflict and the tests will fail so what you need to do is you need to namespace all of your resources in other words provide a way to override the default name so that you can set it to something unique at test time so I'll just show you a couple real-world examples of that if we go look at our service app that module I've been using you can see it creates a lambda function and the name it sets to this input variable and it does the same thing with the IAM role and basically all the other named resources are the name is configurable and then when we're using that code so if we go look at our hello world app we set the name to var name which has a default but at test time we're gonna override that default so this is the one piece that I hadn't shown you before if you look we pass in a name variable in our tests which we set to include this unique identifier and so there's a little function and terror test that basically generates a six character string as something like 56 billion possible combinations it's a randomized value so this gives you a pretty good chance that two names are not going to conflict so if you override all of the names and all of your test code with something that's pseudo-random in here then you're gonna avoid these resource conflicts what's interesting is this isn't just useful for testing you should actually get into the habit of namespacing resources anyway because you might want to deploy two copies of the server less app in a single environment or across multiple environments and so being able to namespace things is useful for production code anyway we do something similar for kubernetes as well which is kubernetes actually has a first class concept of namespaces and so at test time we generate a randomly named namespace and we deploy all of our code into that namespace to ensure that this does not conflict with anything else that happens to be in the same Cooper name cluster so namespacing very very important in general but especially for automated tests that run in parallel alright one more concept that's pretty useful to know about our test stages so if we take a look at this proxy app integration test there's really five steps five stages in that test we deploy a web service and the proxy app then we validate the proxy app and we unemploy it and then we had to play the other thing now in the CI environment you need to run all of these steps that make sense but when you're coding locally especially when you're first writing this test you might want to be able to iterate on just some inner portion of this thing like maybe you're working out how to validate the app correctly you just want to be able to rerun the validate step over and over and over again and you don't want to run the rest of the stuff but as the code is written initially you don't really have a choice and that's a problem because all those other steps have a lot of overhead you might want to run the validate step that takes seconds but the test will force you to pay five to ten minutes of overhead for every single test run so like it's very annoying so you can work around that whatever test tool you're using ideally supports this idea of test stages so here's what it looks like I'm not going to run this one I'll just kind of walk through the code really quickly in the interest of time so this was our original test structure we're deploying the web service the proxy app and validating what we're gonna do is we're gonna basically wrap those in functions so it's the same thing there's a deploy their deploy web service deploy proxy app and validate but you'll see there's this new thing called stage using that just as an alias so the code actually fits on the slide and I basically wrap all the code with this little function all the actual deployment code moves into these named functions and each stage has a name and you can name it whatever you want as long as it's unique and the point of doing this is now if I have a stage called foo I can tell terror tests to skip that stage just by setting an environment variable that's a skip foo equals whatever you can set it to any value so here's how you might use this you might run that integration test and the very first time you run it you tell it to skip the clean up steps and so when you run the test it's going to run deploy web service deploy the proxy app it's gonna run validate but it's not going to clean anything up so those services will keep running in the background so now you can rerun the test you can skip the deployment steps as well and so the next time you run the test it's just gonna run the validate step over and over again and that takes seconds rather than minutes so this allows you to iterate locally much much much much faster you can also make changes manually you can inspect things you can debug things it's basically as if you're pausing the test in the middle that's really what we're doing with just some environment variables and then when you're done you can basically clean everything up again and you're done so test stages are very very useful the one thing you have to do to make them work besides wrapping your code in functions is since we're running these tests in separate processes right we're running go test over and over again those are separate processes if two stages need to share data they can't just pass it in memory like they were doing before because separate processes so whatever data you need to pass which is usually just like these terraform options things you just need to write it to disk and read it from disk so for example the deploy web service code will store the terraform options into the temp folder and there's a helper to do that so it's a one-liner and then the cleanup web service code needs those terraform options to know what to clean up it's gonna read it from disk and so that allows you to have these completely independent test stages and if you want to see the real version of that grab that repo and in here there's the integration tests with stages and so here here it is here's my deploy step another deploy step validate you can see each of these is wrapped in this test stage thing and they're all loading and saving various things to disk I will personally tell you that this simple like basically a hack has helped me keep my sanity some of these tests take a really long time and the ability to rerun pieces in seconds rather than waiting 20 minutes is huge it's incredibly incredibly valuable all right one other pro tip has to do with retries another thing we learned from long experience is infrastructure in the real world can fail for a whole bunch of reasons intermittent reasons I don't mean bugs in your code but just things like easy to gave you a bad instance or there was a brief outage somewhere or some intermittent issue of that sort and if you don't do anything about it then your tests can become very flaky they will basically fail for reasons that have nothing to do with actual bugs in your code and so the easiest solution for this is to add retries you already saw that the HTTP requests and terror tests we were doing those in a retry loop but you can actually do retry loops all over your code and some of them are natively supported by terror tests so in that terraform options thing in addition to saying where your code lives in addition to passing variables you can also say hey if you see an error that looks like this this is actually a very common error you hit with terraform these TLS handshake timeouts are very frustrating you can basically say retry up to three times with three seconds per retry and this will make your tests much more stable all right one more category of tests to talk about which is end-to-end testing the idea here as the name implies is to test everything together but how do you actually do that right if you have a big complicated infrastructure how do you actually test that end-to-end you could try to use the exact same strategy I've been showing you this whole talk deploy everything from scratch validate undeploy but that's not a very common way to do end-to-end testing and the reason for that has to do with this little test pyramid so a static analysis unit tested bottom integration tests end-to-end tests so the thing about this pyramid is as you go up the pyramid the cost to write the tests how brittle the test is and how long it takes to run goes up very very very quickly so these are some really rough numbers obviously it depends on your particular use cases but typically static analysis runs in seconds unit tests in a low number of minutes integration tests take more minutes and to end tests from scratch take hours most architectures even if completely automated to deploy them completely from scratch can take hours and to test them and then undeploy them at the end so that's unfortunately too slow the other issue is brittleness and you can actually see this by doing a little bit of math so assume that some resource you're deploying ec2 instance database whatever it is has a one in a thousand chance of some random intermittent flaky error I don't know if this isn't exactly accurate stat but it's probably somewhere in the ballpark and so you can do the math and do a little probability calculation and see what are the odds of a test failing for flaky reasons based on how much stuff you're deploying in that test so if you have a unit test and it's deploying just a handful of resources about ten and each one of those has a one in a thousand chance of failing then when you have ten of them your chances of failure go up to one percent if you're deploying fifty resources in an integration test the chance that you get some kind of a flaky or intermittent error is around five percent and if you try to deploy your entire architecture which has hundreds of resources I mean we're talking 40% 50% chance of just some some things somewhere hitting that one in a thousand chance so you can work around 1% and 5% with just retries that's what the retries help you overcome but there's nothing you can do if if if 40% of the time your tests are failing for flaky reasons that's gonna be very very painful and so unfortunately doing an 10 testing from scratch tends to be just too slow and brittle in the current world to be useful so the real way to do antenna testing is incremental e and what I mean by that is you set up a persistent test environment so you deploy everything from scratch which will take hours and become annoying but you do that once and you leave it running and then whenever you go and update one of your modules you roll out the changes to just that module so this is what your commit hooks are doing they're not deploying everything from scratch they're actually just updating an existent architecture with each change and then validating then you can run in spec or whatever you want to validate that things are still working as expected and so this will be approximately the same as unit testing or integration testing it's not going to take that long it'll be reasonably stable and it'll actually give you a lot of value in seeing that your entire stack is actually working and to end and as a bonus you can test not only that the thing works after the deployment but you can actually write a test that tests of the deployment itself for example one very important thing is is my deployment zero downtime or every time I roll out a kubernetes service to my users get you know 500 errors for five minutes so you can actually test that you can we have a whole bunch of automated tests around exactly that so this is a really nice way to do end to end testing all right so wrapping things up here's a kind of overview of all the testing techniques I talked about apologies for the small font on this slide I'll go over it and summarize really quickly basically static analysis it's fast it's easy to learn really don't need to do deploy any real resources you should use it the only downside is it's very limited in a kind of errors it catches and just because my static analysis past doesn't give me that much confidence than my code works so if you're doing nothing at least do static analysis but don't stop there unit tests tend to run fast enough they take a low number of minutes mostly stable if you do retries and they give you a lot of confidence that the individual built individual building blocks that you're using work as expected downside is you do have to deploy real resources and you do have to write some real code integration tests pretty similar the only real difference is that they are even slower which is a bummer so you're gonna have fewer of those and then end-to-end tests similar thing but if you do them from scratch they're way too slow and brittle so do them incrementally and then they'll have similar trade-offs to unit tests and integration tests so which ones should you use correct answers of course all of them they all catch different types of bugs and you're going to use them roughly in this proportion that's actually why it's a pyramid you want to have a whole bunch of unit tests and static analysis catch as many bugs as you can at that layer then a smaller number of integration tests and a very small number of high-value end-to-end tests so wrap it up infrastructure code is scary when it doesn't have tests in fact I've heard that's actually the definition of legacy code is any code that doesn't have automated tests and so you can fight that fear you can build some confidence in your life by writing some automated tests thank you very much you
Info
Channel: InfoQ
Views: 36,475
Rating: undefined out of 5
Keywords: DevOps, Terraform, Docker, Packer, Kubernetes, Cloud Computing, Testing, Automation, IT Service Management, Automated Deployment, Automated Testing, Infrastructure as Code, Containers, QCon, QCon San Francisco, InfoQ
Id: xhHOW0EF5u8
Channel Id: undefined
Length: 47min 59sec (2879 seconds)
Published: Mon Dec 09 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.