Resolve Incidents Faster: Transforming Your Incident Management Process

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Applause] thank you all for the welcome I'm incredibly excited to be here let's talk about incidents so speaking with customers over the last year what's become really clear to me is that there are some key differentiators between the organizations that manage their incidents effectively and the ones that don't so my goal here today whether you're in the audience or watching on at home in the live stream is to really give you those tangible tips and best practices that will help you close the gap and start operating like the world's best the Incident Response Teams let's start by getting on the same page about exactly what type of incident we're talking about today's talk is about major incidents an unanticipated disruption that inhibits the functioning of a service those real emergencies that on a major scale get in the way of your customers actually using your service whether they can't access it at all a critical function isn't working as it should be or that actually be an impending emergency if you didn't act immediately and there are two key reasons I think it's important that we talk about this firstly the impact of major incidents on organizations like yours is larger than it's ever been before as Scott mentioned during the general session customers are demanding that 24/7 availability they're demanding that uptime and it's getting increasingly difficult to provide that type of service I'm sure you can recall when you try to send a text message and it doesn't work or you can't get the directions that you need on Google Maps it's something that's incredibly frustrating for you and the same applies to the customers of your software solutions so when you fail to meet that need that's potentially clearly measurable in terms of revenue loss but there's also a reputational impact to that it's important to be aware of second key point here what's making things even more challenging for teams is that complexity is growing in conjunction with that demand so there are now more systems you're potentially running micro services and those services are generating more alerts and there's more noise in your logs more things to be keeping track of and now because of that custom and demand there are even more people that are stakeholders in the status of your services they want to know what's going on and they want to see how they can help because it's so crucial to be doing that correctly and I see a lot of teams struggle with this and if you don't manage it effectively it can actually make things worse not better so thankfully we've been working incredibly hard I've got some really smart people working on this actually building the tools that you need for modern incident management so that all you need to do is implement the process that I'm going to be outlining clearly today quick show of hands do we have any option users in the audience anyone let's try the product okay cool a few people so we'll be looking at opportunity to day a bunch of cool new features that we've just announced this morning and we'll also be looking at JIRA service desk Status page and of course JIRA software so everything you need on the technology side to really transform your incident management process so just one quick caveat before we begin what this talk won't do is it's not gonna mean that you'll never have an incident again but what it is gonna do is really help you thrive in an always-on world and turn your organization to put it into a place where you're seeing less major incidents over time you're continuously improving and you're actually turning the reliability of your services into a competitive advantage for your business so let's dive in these are the four key takeaways that I want to leave you with today just the four and I really racked my brain trying to think about what to fit in here because there is so much to say but I think this is really what it all boils down to and if it seems a bit high level don't worry because we'll be going through each of them individually and pulling out the individual tips that you can maybe check in your own mind to see if you're influencing in your organization so we're gonna break it down tangible tips best practices we've got breaking down team silos communication how to do it the right way having a game plan and how to run one review on and also post mortems and running those blamelessly and collaboratively now you notice on the left-hand side of the slide we've got one word for each of the key takeaways detect respond resolve and learn and when you pull these out you actually have the four phases that cover your whole incident lifecycle so though that's the core foundation that we're going to be using today you've got one tip for every phase of your incident and the reason that I've structured the talk this way is because in the teams that I've spoken with the very best teams they actually find that when they start strong that has a knock-on effect and they end up really finishing strong so the teams that do well in their detection their response is that much easier and then if they're communicating effectively during that second phase they've got customers off their back they've got management off their back and they can focus on resolving the issue at hand so there's a reason we're going through it chronologically and just think about that knock-on effect as you do manage incidents in your organization now we've got a tracker at the bottom of the slide for me in time to recovery MTTR and it's from when the incident first starts to when it's actually fully resolved and those of you in the audience that are keen-eyed might have noticed that it doesn't actually span the learning phase it's just the first three phases of an incident and one classic mistake that I see here is organizations see that and they think that there's no real tangible improvement they can achieve in that metric by taking up those learning actions those preventative follow-up tasks and post-mortems and we're going to look at why that isn't the case why that's not the right thing to be doing and how you can actually achieve the most meaningful improvement to help you resolve incidents faster by conducting learning in the right way so let's get started now when it comes to a major incident you want to find out before your customers do but in most organizations today even when a customer spots an incident first that doesn't kick off the right process to actually get that resolved before it's too late so I'm sure you're familiar with the scenario service desk agents receive a ticket from a customer having trouble with part of the product JIRA issue was created for developers and no one really thinks to look at it until you've got a full-blown emergency on your hands until you actually realize that it's affecting everybody and it happens because your team's use different products so your support agents are in JIRA Service Desk respond to customers you're developers are in JIRA software and in option II and other DevOps tools and management might be in slack or email messaging individuals so there's all this chatter going on but customer information that could actually be critical to resolving the problem faster it's not getting to the right people and I'm sure this is something that resonates with all of you especially those of you on the IT side of the business because actually 92 percent of individuals in IT departments actually felt uninvolved in their company's DevOps activities and ambitions so all this work going on to release quickly and the you build it you run it methodology but actually big parts of the business that play a critical role in the running and operation of those services actually aren't involved in the way that they should be so let's make this better let's show off a quick demo because you'll find the solution to actually breaking down team silos is to check out one of the pretty neat integrations that we have I think it's going to illustrate this Krait work quite well so we're going to link up JIRA Service Desk and opportuni and we're just going to jump over to our JIRA Service Desk project settings now for those of you on premise potentially service serve our data center they screw these tutorials still apply the screenshots are just cloud based but very much the same thing so what we're gonna do is we're going to add an automation rule in the top right hand side here and that's going to allow me to create a custom configuration rule where I can define with actually whatever level of granularity I like what I want to occur when something happens in one of my service desk portals so in this case I've declared that when an issue is created and you might want to go even further and say when an issue is created in my outages portal actually send that straight through to ops genie as well as making it available for my agents to respond to and what this will achieve for you on the ops genie side of things is that now when my developers are looking at their automated alerts they're looking at those Splunk logs or those AWS error messages that could be indicative of a serious issue they're actually seeing the customer outage issues that have been sent through as well and they get all the detail the title the description and they can actually click straight through to that potentially see it in line with an AWS arrow and go okay this is actually a serious issue that I need to take action on and they can leave a comment that will be sent straight through to JIRA Service Desk for agents to look at so that they get an idea of what's going on and then for agents to reply to directly which will be sent straight back in to the opportunity side of things so that's really just one example but the key thing to take away here is to think about the tools that you're using in your organization the products and which ones are actually very centric to a single silo in the business and see if you can link that up with some of the integrations that we have at Atlassian or potentially some of your other tools and see if it'll get teams communicating more effectively and potentially get you more involved on the IT side of things so moving on to the second key takeaway communicate early and often and really this is a key differentiator from the teams that I spoke to the ones that communicated well they actually had the best incident response process so think about maybe ask yourself how effective is your communication during an incident let's dive into some fundamental ways to avoid strain between you and your customers and make sure you're going about this in the right way so the first thing that you want to do is have a templated first comms ready when that serious outage occurs what you're gonna see is a bunch of service desk tickets popping up from customers you may be receiving tweets or emails or slack messages from people in the organization just trying to make you aware that something's wrong and potentially checking in for an update so you want to get ahead of that by automating your first communications and making it nice and simple to just send something out get it out there and help yourself focus on resolving the issue the next big tip don't rely on your usual infrastructure so many teams for their service update or Status pages they actually build their own and they host it in the same place as the rest of their services so when that major outage occurs and customers are flooding that page just to see if they can get some sort of update on what's going on they see a blank screen or worse yet they maybe even see an error message so something that's really easy to avoid status page offers solution to this their cloud-based their hosted externally and you don't have to use that product but really those key principles make sure it's not in the same place as the rest of your infrastructure next quick point make sure you're considering your internal stakeholder update pages so it doesn't have to be the same update as the ones that you're sending out externally some teams like to separate these out but have them under the same user interface so that you don't need to log log into something separate you can just send an update out send the second one out in the same tool and move on with resolving the incident and finally I want you to see it as an opportunity to build trust so when you think about the very best teams when it comes to resolving incidents or incident management in general they're not the ones that don't experience outages they're not the ones that don't experience incidents rather they communicate early and often they share what they know with customers and even though incidents are an incredibly frustrating thing they actually take that opportunity to really walk the walk when it comes to putting customers first so it's a really big thing for me and you definitely see the very best organizations like Google Amazon really take this up so I want to look at a quick demo here for templating your first communications because I want to show you really how easy it is we're gonna go through the full end-to-end flow so I'm gonna click over here to the Status page product and I want to cover the case of my mobile app experiencing an incident I know it's an important service for me that's had trouble in the past so when that next incident occurs I don't want to waste time thinking about the exact wording so I'm going to create a quick template here that just covers a general initial message to let customers know that I'm investigating I'm gonna pick the mobile app component and hit the Save button which means that that template is now ready to go now let's fast-forward to that real mobile app incident occurring at some point in the future and see how quick it is here to actually send that update I'm gonna hit the create incident button I'm gonna go to the top right of the page and hit that template that I just created and then hit the Save button those communications have been sent out it only took three seconds and I can go straight back to focusing on resolving the problem at hand now next up we're going to integrate status page with our service desk portal as well so that update that I just created that only took a few moments if I go into JIRA add-ons and install the one for Status page I'll see it come up right at the top here so it's actually centralizing my status update so that I don't need to go to email to send an update out and then go to Twitter and then go to my status page and potentially even slack I've picked the one place my status page product to send that one update and now it's tightly integrated to and centralized to go to all of the right places I need it to sending out the one update instead of the five or more so that's communication now let's move on to the third phase of your incident and hopefully by this point your team is really bad already you've integrated your product it's informations flowing through all of your teams and you've sent your comms out through a Status page everyone's ready to go the situation is for the most part under control but this is a really crucial part we're going about it in the right way can make all the difference what you need to do here is set share run and review your game plan let's look at what you need a cover but I had a quick show of hands how many people actually have an incident game plan in their organization some type of run book that they can look to cool just a few hands great so for everyone that kept their hands down if you don't have one of these you're going to want to write one up to find it put it in writing and a few essential things that you need to include and for everyone in the audience maybe just check this off mentally and see if you're covering these checkpoints first one is your assessment and severity criteria so you need to establish impact ask your readers to consider how many customers may be affected what are they currently seeing is this a potential security breach or is there even data loss these are all questions that we ask ourselves at Atlassian because it helps us to actually make that right call on the level of severity that we're going to be selecting and do it quickly so when you've got that total outage it's a very different response to one button not working for a handful of users and you want to make that clear with clear examples and next steps for each level of severity so that first responders knowing exactly what actions to take and you can kick-start your response in the right way moving on escalation rules so make sure everyone knows that they should never hesitate to escalate especially more junior team members maybe it's the middle of the night there's a bit of a problem and they think I'll just revert this deployment or I push some code everything's going to be ok and then what you find is two hours later other problems going on you've affected thousands more customers than you needed to and it was all because one person might have had that fear to to wake everybody up now at Atlassian and really the largest and most effective organizations when it comes to this we don't hesitate to escalate it or if you even think there's a slither of a possibility that this could be something major wake everybody up bring the right people in immediately and if it's a false alarm and no one's gonna be mad at you next up rolls an individual Authority so another really big thing for us at Atlassian we have an incident commander for every incident that has that high level authority over things he's the go-to person he or she is the go-to person for every incident and it really helps to ensure you've got that right level of organization and effectiveness in your response especially when so many people are involved also consider if it makes sense for you to have a senior technical responder someone to make the call on which deployments to investigate or what type of code to push it all specially help with allowing less experienced team members to get involved in the incident response process or take on some of that on-call responsibility without you actually being worried that they're going to make the wrong decision and actually waste a few hours or impacts more customers than necessary and finally consider your communications cadence when you're writing this up so think about how often you want your comms to be sent out is it every hour every two hours every three hours it really makes a difference here at the at the end of your update to say something as simple as we don't have anything to share at the moment but we'll post our next comms in an hour and when you're a customer that's really hanging on to an enterprise service coming back up to health that makes a world of difference for you so thinking back to those points I shared on communication it's an opportunity to build trust but it's also an opportunity to just communicate effectively with your customers and help them get on with things now to circle back to roles really quickly I have a quick video to play from one of our lead principal architects at Atlassian and I want to drive this point home so I asked him to share a few words with you guys and play that video now one of the most notable things about an incident Atlassian is something you noticed during the incident and that's the clear presence of the incident manager or incident commander the commander has clear authority everyone knows what that role is and everyone respects the person who's in that role it makes it instantly clear who's responsible for making decisions you don't get people making contradictory decisions and it's a single Authority for the status of the system therefore it's very clear whether the problem has been resolved or not and stakeholders can trust that when they've been told what the status of this incident is that they're actually being told what's really happening the incident commander is always the calm at the center of the storm so that was Matt Quayle one of our lead principal architects here Atlassian and he's been with us for over eleven years someone who has no doubt seen his fair share of really major incidents someone who knows their stuff so now that you've got a game plan in place how do you make sure that this all gets put into practice actually becomes valuable for you what you need to do here is share your game plan so propagate it throughout your organization and make sure all responders have read and understand it and ask them for feedback put it in confluence so that it's available to everyone where your team works and they can have a quick look over it and pick up on anything that maybe doesn't make so much sense to them next stop you've got this game plan written up it's been reviewed what you need to do is actually run a simulation drill hands up has anyone partaking in an incident simulation or a war game before one or two of you okay cool yeah so those of you in software will know that nothing's really truly verified until you've tested it and the same goes for your game plan so what you want to be doing is to have someone replicate as close to a real production environment as possible or whatever would make most sense for your business and actually create an incident there run through it as if it were the real thing and it's gonna give you two major advantages first one is you'll be able to shore up those quick wins in your game plan things that made sense on paper but didn't work as well in reality and you'll be able to do that before the big incident comes up the second big thing this will give you and I used to play football when I was a kid you wouldn't go into a grand final without having kicked a ball for three months and you shouldn't do the same thing here so it gives you that speed and muscle memory and really automatic ability to just work through the real thing because you've done it so many times before and you're ready to go so make sure you set a cadence for this sort of dry run something like quarterly would work well and it'll help you stay fresh to really manage those incidents and can be pretty fun as well final point on this make sure you're involving less experienced teammates it definitely helps them take on that on-call responsibility you can trust that they've been through something similar with the process before and they'll be ready to go again when it comes to working on production last quick step for the game plan once you've gone through a few incidents actually using this criteria just ask yourself really quickly do we still follow this process do we do things differently because two weeks ago we had an incident we figured out we could get through things two times faster if we went about it in a slightly different way document those things update the game plan for posterity so that actually new individuals that come onto your organization they've got something that's up-to-date you don't want this to go stale after all of your hard work so felice invaluable nuggets there you can take away for building a game plan now let's move on to post mortems now we spoke a bit about this last year at a few of our track talks and keynotes but this year one key difference is that we've actually built the core capability for post mortems into our opportunity product so that you can go straight from alert to incident and then pulling through all that key information automatically inside a UI that is very similar to what you'd be familiar with in other Atlassian products so we're going to look at that really quick walk through it and we're also going to cover the best practices so what you need to be thinking about when you're writing a post-mortem so let's move over to ops Jeannie I'm here on an incident page the incident for my mobile application and you'll see here that the problems been resolved I've got my incident timeline on the right hand side the chronicles all over the key events and now I've got this button that's available for me to actually create a post-mortem because the incidents been finished up so I'm going to click through and that'll take me over to the postmodern template that we've created for everybody and you can change this configure it however you like but it's quite a good framework to actually start working through the problem and ask yourself okay what happened in the lead up what are the preventative actions that we could have taken here what was the root cause of this incident the lessons that we learned and on the right hand side you'll see that because it's linked into ops Jeannie we can automatically populate all of that key information without you having to arduously go through and do it yourself so no mental maths on the incident that started at 9 and finished it 11:53 how long did it go on for at 2 hours and a bit we're just going to calculate it for you pull through who the incident commander was the severity all those key bits of information that you want to be keeping track of but don't want to do the work doing and you've also got the opportunity here to add attachments add any related incidents potentially maybe a database went down in conjunction with your mobile app and two teams were working on that separately you can bring it all under the same post-mortem and of course add those key preventative actions those follow-up tasks integrated directly into JIRA so now we've filled one of these out we've put in all the work it should look something like this and this is actually a real Atlassian post-mortem that we've changed sensitive information on and what you can see is that there's actually been quite a bit of work put into one providing a quick summary of the incident and then to actually going through tidying up that incident timeline going through it manual potentially adding any events that automatically weren't picked up changing time stamps obviously putting in all of the answers for detection lead up etc and also at the bottom here I'm gonna attach some attachments I'm gonna throw in some follow-up tasks in JIRA because I want to make sure that all of the learnings all of the hard work is actually going to make it into our production environment so follow-up tasks linked to right at the bottom there to my JIRA instance and I can follow on the status of those directly from the post-mortem and this is all functionality that is available as of today inside the opportunity product so what do you need to remember when you're writing one of these post-mortems when you're running a post mortems process in your organization well first off you want to avoid blame keep it constructive I've given it away in the key takeaway itself so the teams that aren't really as strong as this and they have kind of a blame culture when it comes to their post mortems individuals are afraid to speak up they don't want to say oh hey I actually deleted that production database I didn't have the right credentials I didn't see the right message and I made that mistake and they're not gonna put their hand up and tell you that which means that you're not going to be able to make the key change to that environment potentially a password or extra layer of security to make sure it doesn't happen again so keep your post mortems blameless and make sure that everybody knows that you've got a culture of trust and that there's no problem to say that you actually did something that obviously was well-intentioned but didn't work out as you expected next stop collaborate and share knowledge so the post mortem might be written up by one person but don't think that it's not a collaborative exercise post mortems at Atlassian we kick them off with the post mortem meeting where everybody involved in the incident can take part share what they learned share anything that they think would lead to a more effective production environment or Incident Response process next time give your post mortem a workflow so ask yourself does every incident above us have to require a full post mortem review potentially that extra layer where the CTO has to look over things or maybe minor incidents don't require you at all at Atlassian the most severe incidents actually require two reviews from two individuals plus the person actually writing it up so think about what sort of level of detail you want to go into here and then maybe even document that in your game plan of course preventative actions so you saw how easy it was to create them inside the opportunity product linked them straight through to JIRA but I see a lot of teams actually put all that hard work into writing the post mortem and then when it comes to thinking about okay what do we need to change they gloss over that part they see the same incident reoccurring and it takes actually a lot longer to deal with those incidents as they recur so make sure you think about the preventative actions don't forego it put those into JIRA linked them up in ops trini and then follow them through make sure they get into production now finally ask for feedback so are there any people managers in this room anyone overseeing any development work or few people what you want to be doing is actually checking in with the people writing these post mortems and doing the work to make sure you're hitting the right level of time and detail that's required to get the learnings out of a post-mortem but not go any further than that so we all know there are a bunch of competing priorities in a fast-moving software organization or IT team there's a new feature that needs to be shipped or some work that needs to be done some bugs to fix and can seem like if you're working on something that occurred last week ago and doing a retrospective review you may feel like that's not the most productive thing to be doing so just make sure that you're checking in and that when people are doing the work on that post-mortem they do just enough to get the key learnings out but you don't have to do any more than that so those are the four key takeaways here and I got the chance to actually speak with some really great individuals that Amazon Web Services and talk to them a bit about how they do things and that gave me the opportunity to refine some of the slides that I just shared earlier but there were some key nuggets that didn't really fit that I felt were too good not to share so I'm gonna share a few of these and I hope some of them resonate with you so the first thing Amazon does is they actually have a validation checklist before a service even going to production so does it meet their predefined requirements for monitoring reliability and security if it doesn't then it doesn't go into production it needs to hit that minimum benchmark and think about for yourself this is before any incident even takes place what's the minimum benchmark for you what's the level of reliability that you want to be hitting sort of security guarantees do you want to be making to your customers and that should be baked in here next up Amazon of course runs with a DevOps methodology we do the same at Atlassian and that means that teams actually own the services that they run but what do you do when you get that cross service incident that impacts maybe two three four or more teams you need someone at a higher level that can actually coordinate that response between teams and individuals and for that Amazon has a dedicated incident management team and it makes it so much easier potentially the IT ops could take this role in your organization but think about if that makes sense for you especially if you're moving to DevOps you don't want teams to be interacting with each other during something that's so serious and having that miscommunication occur because there isn't someone at a higher level that is the sole authority and the final point I had to harp on it Amazon does it Google does it at Atlassian we do it it's the most important thing that you can do to resolve your incidents faster run post-mortems for all major incidents to prevent reoccurrence it's going to make a huge difference for you and I'd encourage you to try it if you haven't yet so now I wanted to make a quick point because I'm a bit of a technologist and I love to focus on the tech side of things and I had so much fun sharing these demos today but actually you won't be able to achieve any of this process improvement any of this performance improvement or without your teams so really you want to achieve the high performance that it's the goal that is the goal but you're not going to be able to do it without putting your team first you may have heard of Atlassian's belief that behind every great achievement as a team and we definitely go about this and I think the way that you should look at doing it is to maybe even check out the health monitor so run a quick 15-minute walk through at your at your next team meeting it doesn't take too long and it'll show you actually where your team is is moving on key metrics like team health and leadership and customer centricity we've got one specifically for service and support teams and we have one for incident management teams as well so this is something that doesn't tie in directly to the actual process improvements that you have in your incident response but if you've got a happy team if everyone's working well together if they're functional and you invest that time ahead of time you'll find that implementing process improvements becomes an order of magnitude easier so let's flick back to those four key takeaways go over them really quick and just wrap things up so you want to break down your team silos think about the products that you're using and see if you can integrate them or couple them closer together so that information flows better between your teams we installed a few integrations I only took a few moments to setup you saw the one for JIRA Service Desk and Status page and had encouraged you to have a quick look over the atlast scene marketplace or the app stores for any tools that you use and see what would be there for you next up communication early and often so don't be afraid to get those first coms out straight away and do it ahead of time so that you don't have to worry about wording you can get that approval as to if you've spoken about it in the right tone and manner and then when that real incident occurs you just click the button go through the steps the comms are out and you can focus on resolving the problem make sure you're integrating as well so centralize your comms through a single tool that's cloud-based and hosted externally so that you don't have to send 20 updates during the course of an incident because you're sending four and they have to go to five different places separately thirdly set share run and review your game plan so only a few hands in the audience for actually having one of these in your organization so I hope to see more hands next year and I'd also love for you to share it amongst your team and run a war game for the first time and make those review changes potentially on a quarterly cadence to make sure this stays fresh and finally blameless and collaborative post Wadhams so we think it's the best way to resolve your incidents faster and going about it in the right way where you actually create a culture of trust in your team to own up to what actually happened is going to make a world of difference for you so that wraps things up it's been a pleasure to speak to you all today and I hope I've given you some key things that are really gonna make a difference for you you can go back to your teams around the world and share some of this information so I hope now that your team actually will thrive in an always-on world I'd love to speak to some of you afterwards take some questions so feel free to come up for a chat that's my presentation today and yeah thanks so much for listening in [Applause]
Info
Channel: Atlassian
Views: 8,838
Rating: 4.8677688 out of 5
Keywords: Atlassian, Atlassian Summit 2019
Id: yFqZuGwdDXs
Channel Id: undefined
Length: 34min 7sec (2047 seconds)
Published: Tue Apr 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.