Chaos testing with k6, Prometheus, and Grafana (Schrödinger's Pokémon)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good morning everyone i'm nicole van der hoeven and i'm a developer advocate at casex dot io i've been in performance testing for over 10 years now and this is schrodinger's pokemon observability for chaotic load testing today i'm going to combine three things that i love pokemon load testing and observability i'm specifically going to use k6 as the load testing tool and prometheus and grafana for metrics and visualization i do happen to think that this is a killer combination despite my biases for reasons i'll get into later but the principles i discuss about load testing and chaos engineering will be applicable to any project regardless of your tool stack so here's what you're going to learn in the next 40 minutes first what is chaos engineering we'll talk about what it means the process it involves and how i think it relates to testing second i'll go through the steps of chaos engineering using a real application i'll talk to you i'll talk you through the tools i used the script that i wrote the tests i ran and the results and third i'll wrap up by talking a bit about why observability is critical for chaotic load testing so here's a definition of chaos engineering from principles principlesofchaos.org chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capacity to withstand turbulent conditions in production and the emphasis is mine notice that this definition doesn't talk about preventing turbulent conditions or failures it talks about finding out how the system will respond to them there are four steps to engineering chaos in the first step we define and prepare a steady state think of this like the control group before we start running tests and trying to break things on purpose we need to have a good understanding of what normal looks like for our application that way we'll have a baseline to compare our test results to the second step is building a hypothesis a hypothesis is a statement of what you're testing this is a good opportunity to come up with very specific assumptions we've made about our application it's not enough for example to say that we want the application to respond quickly we need to identify concrete metrics that we can measure quantitatively and in the third step we execute experiments that try to disprove our hypothesis now that part's important we're not trying to confirm what we think already we're going out of our way to prove ourselves wrong and finally in the fourth step we analyze the results we compare the results of our experiments with our hypotheses and think about what we could do to improve the application does this process sound familiar in a lot of ways i think it sounds like a lot like testing right in testing we start by establishing a baseline how does the application work now then we discuss with stakeholders how we want it to work in the future and we formalize our objectives into requirements we write tests that will check whether those requirements are met and we execute them to verify how the application behaves and finally we analyze the results doesn't that sound pretty similar to you so this might be a controversial view or opinion but here's what i think i think chaos engineering is a testing discipline first they're similar in intention the goal of both a chaos experiment and a test is to improve software quality through repeated and methodical verification as i showed in the previous slide their processes are pretty similar they share some similar concerns like how close an environment is to production or the reliance on test data and the importance of those initial requirements and of note for testers like us i think there's a lot of overlap in the skill set required between chaos engineering and testing but there are still some key differences one difference is in the attitude in testing we seek to prevent failures chaos engineering's primary focus is to handle failures gracefully so it takes as a given that failure is going to happen we're not sure from what source but chaos engineering encourages us to think what if so what if the main server goes down what if another page gets all the traffic what if one of the services experiences an outage it's a slight shift in perspective but i still think that the two are compatible another difference is the tools that they require chaos engineering is still relatively new compared to software testing in general so they tend to require different tools i'm not aware of many fully featured testing tools that can also run chaos experiments and i'm also not aware of chaos tools that can also run full testing suites so this is a big obstacle to testers fully adopting chaos testing in the next few slides i'll show you how i managed to get over this obstacle so here's the application that i want to test or at least the publicly available version of it that i started from it's called poke api and it's essentially a big database of different pokemons throughout various games they have an api that you can use to query the database for pokemons moves items abilities and many more here's the problem imagine that pikachu here is my application what happens when i put it into a box well i don't see how it's doing the box prevents me from being able to see exactly what's happening since i've just put it into a box i can assume that it's fine for now but this right now is as much as i know about the app or as much as i would know if i just deployed it as is but what if things happen to the box in the meantime in production many things happen that are not as you may expect so how does pikachu for example handle to handle high load how does it respond to that and how does it respond to chaos maybe in the form of team rocket irvine schrodinger is an austrian physicist who conducted a similar thought experiment although he didn't use pokemon to prove a point about quantum superposition but i'm liberally repurposing that to describe the situation that pikachu is in at this point while pikachu is in the box i don't actually know whether it's alive or dead that's something i'll have to address for my chaos testing because i need to know how pikachu is doing the first step in chaos testing and at this point i'm using the term chaos testing interchangeably with chaos engineering is defining the steady state so what is normal for the application how does it behave before throwing chaos and load into the mix here's a structure of my app i took the source code for polka api available on github and i created manifests for it here's a link to my fork of the repo that includes those manifest files and by the way all these links as well as the slides will be available later at the end of the presentation so you don't need to necessarily be taking screenshots of this then i put the application on a kubernetes cluster managed by digitalocean i set it up so that with if there are three physical nodes with a few pods allocated between them it consists of three web pods three two web pods sorry three app pods a pod for the database and one for the cache i also added replicas to improve availability and performance and then to address the issue of not being able to see into the box i installed prometheus and grafana on the cluster now prometheus is a time series database and it's also a great tool for capturing metrics it works out of the box for capturing kubernetes metrics and i like out of the box grifon has a visualization tool that will let me see metrics from prometheus nk6 i chose grafana because like prometheus it's part of the grafana labs stack of projects so they work well together just a disclaimer here i would be remiss and not mentioning that k6 which is a company that i work for was acquired by grafana labs a few months ago so my tool selection is certainly biased here for what it's worth though i've also created a version of this presentation using new relic and k6 so i think you can substitute any observability platform even though it's difficult to look past prometheus it's a cncf project for a reason luckily it wasn't difficult to install grafana and prometheus here's the series of commands that i ran to install them onto my cluster via helm helm is a package manager for kubernetes so essentially i had to add a helm chart for both of them and then i ran the commands to install them now prometheus does have a ui but it wasn't really necessary in my case since i really just wanted everything to go to grafana and look at all the metrics in one place from grafana rather than looking at them as well in prometheus and then i forwarded the port that grafana was running on and then i got the password to log into the ui that way i'm running grafana inside my cluster in the digitalocean cloud but i can still access the grafana ui locally and then i had to make sure that prometheus and grafana could talk to each other luckily they're built to work together well so after i added prometheus as a data source in grafana i also found out that there was already a grafana dashboard specifically for monitoring kubernetes clusters using prometheus that meant that i could immediately see information coming in about my cluster so this is what the ready-made dashboard looked like in grafana before i even did anything to it i just find that it's a great place to start so this is what my setup looks like i have my pokemon in a box but prometheus and grafana are installed inside the box prometheus is gathering information about pikachu and i imagine it's checking for vital signs respiration rate etc grafana then is able to see those metrics and make them available for me to access outside the box this approved this approach works really well because now i know exactly what's happening inside the box and i can get immediate feedback about how the application is responding to my experiments installing grafana and prometheus were essential for me to establish the steady state and it kind of goes back to the idea of a control group to compare the experimental group or the out of control group in a real meaning production-ready application you might not need to do much more than observe and monitor via via grafana as users access the application but since in this case i'm the one using the application i also wanted to run a small test of 10 users to service the baseline just enough to introduce some load to do that i used k6 k6 is a load testing tool that is designed to be performant it's written in go and the javascript is the scripting language so it's a lot easier to get started and here's what a k6 test script looks like this is the catch-a-mall function and i'm querying the app for information about pikachu so i'm checking to see whether the response that was returned is a status isn't http 200 and i'm also checking whether the response includes the word pikachu just so i know that it wasn't an error and then i also included a think time to simulate a real user reading that information i ran this test with 10 users from my local machine but i also outputted the results to casex cloud now this is an optional step but i just like the extra visualizations there's also a shareable link to these results so if you want to go through the slides later you'll be able to click on it and see the results for yourself and then i looked at the prometheus metrics of my application on grafana during the test and this is what it looked like this is the network io this is the cp utilization across all nodes and then the memory utilization split up with each node split up here this is what i learned from that test let me get rid of this pip here from the side of the k6 test with 10 users running for a total of 1 hour of which 45 minutes was a constant load the application had a 95th percentile response time specifically for requests that returned an hcp http 200. there was also a 1.16 error rate which is actually to be expected since there are pokemons in the csv file that i used that don't exist in the database the response time isn't great but remember that this is a demo application so i didn't actually expect it to be production ready now from the server side the cpu utilization cpu utilization was a maximum of 53 percent across all of the nodes the memory utilization was 60 percent at its highest point none of the nodes had cpu or memory utilization that exceeded 80 there were also no pod restarts and there were a total of 32 running pods at any given point during the test now these are all solid specific quantitative ways to measure the normal behavior of an application that way when one of these metrics race too far from what we now know is normal we'll be able to identify it and determine why that was the second step is to formulate the hypothesis what experiments do we want to run and how do we think the application will respond to them well i knew that i definitely wanted to do some sort of load test given that this is a demo app i think that a 40 user load test is enough of a load because that's four times the baseline i also wanted to know what would happen if one of the user if one of the pods was terminated how long would it take for another to take its place it's not as important to think about why a pod might be terminated remember that we're doing chaos testing so instead of y let's ask what if what if it happens to figure this out i wanted to make the second experiment involve terminating an app pod and then the third one could be to terminate a web pod to complete these hypotheses i identified two of the metrics in each of these experiments the average error rate should still be less than or equal to 5 percent and the 95th percentile response time should be less than or equal to 5 seconds step 3 is running the chaos experiments this step is where hypotheses are proven to be true or false i actually have six hypotheses if you count every combination but there are only three experiments so for the first experiment i wanted to use i wanted to do a load test so i modified the same k6 script by ramping it up to 40 users and then i left everything else to the same this is what it looks like on k6 cloud the 95th percentile response time was significantly higher than that of the baseline at 17.66 seconds but surprisingly the average error rate for the test didn't really increase significantly ending at an average of 1.22 percent so the next experiment included terminating kubernetes clusters now i could just rerun this test at maybe 10 users and then also run some cube ctl commands in my terminal but the problem with that is that it's really manual so if a colleague of mine needed to rerun the test i'd have to leave instructions somewhere and tell them exactly when i terminated what pod and they might do it a different way so it's not very repeatable so the approach that i took was to include it in the script so a cool thing about k6 is that it can be extended there is a tool called xk6 which allows you to add functionality to k6 so you have k6 you have an extension for it and then you put them together using xk6 so for chaos testing i used an extension called xk6 chaos made by my colleague k6 simon aronson these commands show how to install k6 and then how to create a custom version of k6 that includes this xk6 chaos extension and then finally how to run a test using the new extended version of k6 but i haven't added chaos testing to my script yet this is how i do that in this screenshot i'm showing only the chaos testing parts so it consists of one function which is the kill app pod function it retrieves the list of pods using pod.list and then it iterates through each of them to look specifically for the one whose name starts with app and then it terminates a selected pod now this is a different approach to chaos engineering than what exists in other tools usually chaos engineering tools are declarative and they're written in yaml files that means that they're also limited because you can't easily include conditions like you see that i was iterating through the different pods looking for a particular pattern this is really what i think chaos testing should be like when it's in javascript you can terminate random pods you can terminate multiple pods or maybe even skip termination depending on the results of the load test so far here's how i put it all together in one script this is the options section within the same k6 testing script and it lets me schedule when the chaos experiment will happen so the first scenario is this is poke api and it is the load test part which is currently set to have a 15 minute ramp up time here of and it ramps up to 10 users and then after that for 45 minutes it just holds steady at that 10 user rate the second scenario then is the chaos test which will execute the kill app pod function it will run with only one user and one iteration since we only want it to happen once note that i put a start time here of 30 minutes so we want the users to ramp up and have some time for the test to run and then i wanted to be really clear when when the app pod is terminated that way we'll have data before the pod is terminated and also after it i also set up some thresholds thresholds are ways to set conditions for your test if the test doesn't conform to those conditions then k-6 will fail the test the first threshold here says that the rate of failed requests should be less than or equal to 5 and the second is that the 95th percentile response time should be less than or equal to 5000 milliseconds or five seconds and those actually mirror my hypotheses now for the results so this is the experiment where i terminated the app pod 30 minutes after the start of the test at at this point here i expected the error rate to be noticeably higher at least until the pod was replaced but it isn't even visible from here the response time though was slightly higher at 5.31 seconds so higher than that of the baseline now this test in this experiment involved the termination of the web pod halfway through the test so kind of like the app pod but i just changed it so that it would randomly terminate a web pod and this the dots again indicate where that termination happened the average error rate was 1.09 percent which was well within what i expected and again the response time was only very slightly higher than what i expected now if we stopped here what we've looked at so far is just the k-6 side of the story but k-6 is outside of the box so it's valuable information but we still need to look inside the box here are the three experiments that i ran with the total cpu utilization for each of them during the test if we go over the results of our experiments the error rate hypothesis passed for all three of them meaning that the average error rate stayed under five percent for the duration of all three experiments but the response time hypothesis failed for all of them as well so the 95th percentile response time for the transaction that i was measuring was higher than five seconds for all of them now it could be that the response time that i chose for the hypothesis which was five seconds was too idealistic and indeed the app and the web pod terminations didn't significantly exceed the hypothesis so 5.311 seconds 5.09 seconds the fact remains though that the chances that a hypothetical user using the poke api app in each of these three scenarios would be more likely to leave because of the response time than because of the errors that they got in response to their queries so the question now is why was the response time so high right because i had to do the because i had the server side metrics i could explore the metrics to see what i could find so one thing that i found when i drilled down into the cpu utilization of the individual nodes was that this cpu utilization for one of the nodes almost flat lines against that 80 percent line but it doesn't cross it i didn't see this when i was looking at the total cpu across all three nodes because it's averaged out and then i remembered something and that's this thing here see the last time i gave this presentation one of the suggestions that i received back from others who are more experienced in kubernetes than i am is that i should create resource limits so that's what i did and that's sound advice if you can scale out your cluster to make sure you still have enough resources for your app and what happened was i on my fledgling little kubernetes cluster with nodes that i was paying for personally did not have enough resources so it turns out that kubernetes treats cpu like a compressible resource which means that if it approaches the limit that you set for cpu then it's going to throttle that container which could actually lead to a worse performance overall than what you were expecting so was that what was happening here to test this i rerun the experiments but this time i removed the resource limits entirely the biggest difference was in experiment a which was load both of these graphs show cpu utilization but the one above was for a test where i didn't have those resource limits at all without resource limits this 40 user load test did indeed blow past the 80 line 80 cpu utilization and it actually went as high as 100 cpu utilization but what about the response time well without the resource limits the response time was still 13.31 seconds it wasn't enough of a difference for the hypothesis to pass but when you look at what it was with the resource limits this is still 26 24.63 lower than with the resource limits the story continued with a second experiment i didn't think it would make a difference since the cpu utilization with the resource limits for this one wasn't even very high with 10 users but i was surprised because without the resource limits the cpu utilization of at least one of the notes did increase above 80 percent and the response time actually improved to 4.77 seconds which is well within the hypothesized figure of five seconds so let's see if what happened with experiment c this was web pod termination still at 10 users and this had already been pretty close to the response time hypothesis but without the resource limits when it was able to use just a little bit more cpu the response time dropped significantly to 3.63 seconds so now both the the pod termination experiments are passing in terms of the response time but only when i don't set the resource limits it's important to note here that the takeaway is not that resource limits are bad the problem was that the app is pretty cpu intensive and it wasn't getting enough processing power to respond to requests quickly ideally i would still institute resource limits but then also scale out my cluster the takeaway is that being able to look at this data has allowed me to spot issues in my configuration even though i wasn't expecting it and i i wasn't specifically looking for it so the main difference between traditional monitoring and observability is that in traditional monitoring you're looking for these specific metrics and then you set up counters to measure those metrics in observability you measure everything that's in the box in a non obtrusive way so that you can explore the data later in this case i wasn't testing my container resource limits although looking back it seems obvious that i should have been instead it was only by exploring data and looking into individual node cpu utilization that i got the clue about what was happening so let's go back to our pokemon in a box when we're testing an application for its reliability being able to see into this box is essential without it we can launch test after test against the application but we don't get any feedback in return and even when the load testing tool is giving us information about the test itself it's not enough because it's only telling us what we are doing right not how the pokemon is doing that means we can't really draw any meaningful insights about how the pokemon will behave when exposed to the real world observability tools like prometheus and grafana make the box transparent so that we can see what's happening setting up the right monitors lets us get acquainted with how the application behaves when it's not under stress which also lets us identify immediately and maybe even set notifications for when the pokemon is struggling it gives us a better understanding of events that our application can't recover from and those that it can take in its stride and it's only through observability that we can really determine whether the pokemon is alive or dead thank you all for listening i am nicole vonderhoeven and feel free to hit me up in at any of these um contact or any of these sites you can see my my twitter handle here and the slides themselves are going to be available on slides.nicolevanderhuven.com thank you for your attention

Info

Channel: k6

Views: 356

Rating: undefined out of 5

Keywords:

Id: 2QHs_HEX7r0

Channel Id: undefined

Length: 31min 10sec (1870 seconds)

Published: Tue Sep 21 2021