Hey friends, it may seem counterintuitive, but chaos engineering is about breaking your applications for a good cause: resiliency. You can use Azure Chaos Studio to inject faults into your application to simulate real-world problems. John Engel-Kemnetz is here to show me how you can test and monitor how your application responds to chaos today on Azure Friday. Hey friends, it's Azure Friday and it's chaos absolute chaos. How are you, John? I'm doing well, how are you Scott? I'm living the dream and I'm excited to know why chaos is the word of the day and why it's the best thing for my cloud application. Yeah, for sure. So it comes out of this this new practice called chaos engineering. It's really evolved over the past couple of years and chaos engineering is the idea that you are going to proactively find failures in your applications by causing those failures in reality, and this is all in the best interest of improving the resilience of your application so common things you might see somebody doing is injecting high stress into an application to see hey, can my application scale up or handle a noisy neighbor or failing network to see will you fail. Over to another region or another availability zone. Interesting, so I want to say 20 plus years ago. Kind of this. I don't. We didn't call it chaos engineering, but we would occasionally do things like unplug the SQL Server and then we'd see. Well, can you lose a local cache or we'd have a right fail. And then we'd say, alright? Well, we're going read only and then we have a banner on the site that says, you know we read only or you can't use the shopping cart, but the product catalog still works. So what I'm hearing you say is you want to throw a wrench into the works, but you don't want the entire car to collapse. Fall apart like it should still do something exactly and I think with chaos engineering in particular they found this very good way to structure how you do that sort of chaotic event, which is always kind of starting with. Let's understand first our steady state or healthy state for the application. Make sure we're looking that at that in our data. Then discover a hypothesis, formulate that, come up with an experiment to validate or invalidate. Look at the results and then make an improvement based off the results. So it's very kind of used the scientific method to make sure that you're not just willy nilly unplugging things interesting. So what we were doing many years ago was willy nilly like, "Hey, go and plug that see what happens." There was no formality to it. What you're describing is like when security turned into threat analysis and assessment, you're doing a chaos assessment. You're saying like hey and then applying the scientific method. I like that. So it sounds like chaos engineering isn't just randomly doing stuff. It's a formal way of thinking about your systems, right? And in fact one formal way of thinking about your system is what happens when there is absolute chaos, right? But that certainly isn't the only applicability of chaos engineering is, you know, the complete havoc on a system. It can be more structured, you know. We want to make sure that this one failure mode is accommodated for in our our architecture, and I assume you could potentially sort. Your failure modes by impact and say we can't handle all things. It's going to be times the sites going to go down. DNS happens, it's always DNS, but we could go and sort it, figure out the long tail and say we're going to handle chaos from here to the left. Yeah, totally anything. Especially as a cloud provider, right on Azure. What we care a lot about is helping our customers to make sure that they can recover from any sort of Azure side failure as well, right? So should there be any impact or an outage? We want to make sure customers. Can replicate that on their own and avoid impact should it ever happen again in the future. Very cool. So where do I get started? How do I do this? Yeah, so let's go over to the Azure portal and just get started. What I've done is I have a sample application I'm using that runs on Kubernetes and it runs in a K S. It has a website that we're kind of regularly interacting with and a couple of other services, all in their own pods within a namespace. That's called the back end development name. This connects up to a set of Cosmos DB instances that are located their primary right region is in east US and they have a backup read region, all of them. That's in West US 2. So what we'd like to do is say, hey, what happens if there is a major failure in this application? Will it stand up? Will availability remain high? So to do that? We're just gonna jump over to Chaos Studio in the Azure portal so you can search for it using the search bar, or it's in my recently used services I can just jump in here and if I were doing this completely fresh, the first thing I do is come in and on board the HKS cluster those cosmos DB instances that I want to target for failure for fault injection. I've already done that in this case, so we can just jump straight ahead into building our first experiment. So an experiment is an Azure resource. That means you can develop it as an Azure template. You can use project bicep. All the familiar things you're doing to deploy your infrastructure as code. You can apply that to deployment of your chaos tests. So in my case I'm just going to pick a resource group in a subscription and give this a nice name like Azure Friday. Uh, I then move into my experiment designer and this is where I'm actually formulating. What are the specific failures that I want to cause in my resources and kind of when do I want those failures to occur so I can build up what I would call like a broader scenario of failures? Now this can be really powerful because you can run actions in parallel, run things that happen in sequence, have a break in between, like a time delay so you can really build up what a real. Life outage would have looked like and replicate that in the experiment designer. Just do a really quick example here to show you how this all works. The steps these are what executes sequentially and each step can have one or more branches that will run in parallel. So let's say I just wanted to come in at default and say hey I want to shut down a virtual machine or a set of virtual machines that are running an application for a period of time. So I've got that VM shutdown fault. You can see the full list of faults here and our fault library is in our documentation, but. We strive to cover the most common outage scenarios that customers see. So virtual machine shut down. I'm going to give it a duration for how long I'd like the VM to stay shut down, and then I have a. I have different parameters based on the the fault that I'm selecting. So in this case I want to do an abrupt shutdown. That means it is the equivalent of that. Unplug it scenario you had. There's no time for an application to save state for the OS to safely shut down, boom. It's dead, so if I may, and this might be a weird question, but is this real? Like? Are you really shutting it down? And if you did, what if? What if your application wasn't very resilient? What if by shutting it down you broke something so? And how do you make sure you don't actually run chaos in production? Yeah, that's a great question, and in fact I I wouldn't even say it's. How do you not run chaos in production? It's how do you control when you're running? Yes, exactly. I want to run it, but I just I don't want to get in trouble and be fired because I pushed yes. Yes, so you know. First thing to answer your first question, this is not a simulator. We are injecting faults, they're real. When I shut down that VM, it's shut down. It isn't making it look like we shut down a VM. What you're simulating is a real world outage, right? But the failures themselves are real things, and so that does mean you need to be safe. We do a couple of things in Chaos Studio to help with this. The first is all about permissions. Who has access to create an experiment and run an experiment and then even beyond that, when you on board a resource to Chaos Studio, you can be really specific about which resources are onboarded and which faults are allowed for a given resource. So I can say hey I'm OK with virtual machine, high memory and high CPU failures but I never want a virtual machine shut down to be able to happen 'cause that's more destructive right? And then the final thing is, you know, with our experiments, you can always stop them and rollback the state of fault injection if something starts to go haywire. If things look way worse than you thought they would. So all of those safety measures are just built into how you, how you work with the product, cool. So that's just a simple example. I'd add a virtual machine, shut down. Let's just say that's the only one that I wanted to add. I'd then go to review and create, and I could create this experiment. Now you'll notice to your point about safety. There's a little banner here indicating that before I can run the experiment, I have to give my resources or grant the experiment permission to my resources. So the experiment itself will get a system assigned managed identity in my Active Directory tenant and that identity. Has to have the appropriate access to each resource that it's going to impact, so there's even that additional layer. And I mean I'm going to make an assumption here, and I guess, and you tell me if I'm right because we're using, you know, using user identities and service principles, and this is actually a resource. All of this is auditable, so if anyone is worried that someone could use this for nefarious reasons, the audit log is going to show you everything that's going on exactly, so the activity log will always have any operation we perform with the experiment name. As the identity that that performed the action. So what I did for this application is I've got an experiment all set up and I'll just get it started here while I show you a little bit of what's in here. So what I'm doing is first I have two different branches, so two different things that are happening in parallel here. The first is in a KS pod Chaos activity and what we do is we just use a utility from the cloud Native Computing Foundation called chaos. Match it's an open source fault injection tool for Kubernetes and we just leverage that right out of the box for doing fault injection on a KS rather than trying to boil the ocean. And I do something different. We use what we heard customers are already using. So what I've done is I've just set this up to do a pod failure action over the course of 20 minutes continuously fail all pods that are in one specific namespace and that's the namespace. Running my application. And our this Jason spec those parameters is that a standard thing? Like how do I know that that's the way to pass in parameters? Yeah, so this is just we're just using Chaos Mesh specification now chaos mesh does use YAML and so you have to convert it to Jason just to go from the kind of Kubernetes world where you're typically using YAML to the Azure resource manager world where we use Jason, right? But it's just a conversion of the the chaos mesh spec. That you would have written in YAML for a chaos mesh experiment, and I assume that my team or my large enterprise could write my own faults. We could share them with other organizations, they could be faults that are specific to our, or in this case you found one on the shelf somewhere. I could make my own. Yeah, you could adapt the chaos mesh spec in terms of more broad like custom faults. Definitely something on our road map. We've heard a lot about this is can we? You know, just come up with our own faults right now. They're a little bit more kind of restricted. We say Witcher. The faults that are available to curate a list that we know will work well, but it is on a road map to accommodate kind of any fault you might come up with. Very cool, yeah, so in my experiment I also have a cosmos DB failover action. This one is going to take those three Cosmos DB instances and it's going to attempt to fail it so that the right region which was east US gets demoted to be a read region and the read region WUS. 2 gets promoted to be that right region, so I started it. I can see it's moved into running state here and I can always click into details to see how it's going so I can see that my first step is running and, uh, nay, KS as well as on my on my cosmos DB clusters. The fault is now running, so let's go ahead and see if it really is first. I'll go back to my Kubernetes cluster. We can see before I do a refresh. That everything is ready and healthy in terms of the pods in this namespace. Clicking refresh. Oh no, everything's moved into this. This state where there's warnings. And if I were to click and I'd see that the containers are cycling between errors and trying to recover, but chaos meshes continuously failing those pods. On the customer's DB side, I started with my right region as east US. Let's give it a refresh oldest son in my right region is West US 2. For all of these, so that failover is happened successfully. Will stay with W U S2 as the right region until the end of the experiment, which is when we revert all of that state. So final thing I would look at is just I've got an app insights component all set up for me to monitor this and this is so that I can see availability in history. And understand the impact of the chaos I caused. And sure enough, if I look into availability here, I have just a ping test on the website. Availability is all the sudden in the last couple of minutes here dropped down to 0% and that's because that ping test is getting a 404 and it's getting that because those pods running the website are now down so we can see. Sure enough that chaos experiment I setup is having impact. It's breaking something. My next step is going to be to say. OK, where can we make an improvement? Should we add a load balancer in front? Should we attempt to find a way to redirect traffic to another region? What might be that mitigation that prevents us from seeing 0% availability? Should we have this sort of major failure happening right and then the engineering team and the architectural team could say this is a thing that we could solve within Azure feature? Or this is something that we could solve with code? Or this is something that we could solve with software based networking or CDN or whatever, but you don't know. Until you caused the chaos and then yeah totally very cool. So where can I get started? Do I just go and make a new resource now? Yeah, sure. So you can get started. I'd recommend going to AKA DOT Ms Slash. Chaos studio. Get started. That's just an easy example. Starting point in our documentation, but feel free to just go to the portal and search for Chaos Studio and start building your first experiment. Very cool. And what's the what's the pricing? Do I have this already? Is it per experiment? Is it how often they run? Yeah, so the pricing is it will be based off of what we call the target action minute and that's based off of the duration of the actions you have per target resource. So there's a charge per minute per target resource, but the service is free up until April. Very cool, so the service is free 'til April. So go and cause some chaos and then it'll be a nice, you know thoughtful consumption based model which makes a lot of sense fantastic. Well I am learning all about how to cause chaos and make my applications. Just that much better today on Azure Friday. Hey, thanks for watching this episode of Azure Friday. Now I need you to like it. Comment on it, tell your friends, retweet it. Watch more Azure Friday.