DevOps and the Art of Release Engineering

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so this is navigating the software delivery minefield DevOps in the art of release engineering and we're gonna kind of talk about what that means and I'm glad you're all here to join me for this today so this is the obligatory about me slide I'm not going to go through it except for a couple points I'm Jay Paul Reid on Twitter so if I say something confusing or you want to continue the conversation on Twitter you can find me there the only other thing I'll point out about this slide is that um I'm a Masters of Science candidate and human factors and system safety and I point that out because you're gonna hear human factors and system safety probably in this talk and that's kind of where comes from the last eighty years of research in the safety sciences on on this particular topic so I wanted to start with a little survey for everyone in the room who here has release engineers or a release engineering team in their organization oh good okay lots of hands going up Peter actually knows those people like they can find them and talk to them okay good about half yeah alright and who here is and I say this kind of tongue-in-cheek who here is doing the DevOps and yet you know they think that you're you're making progress you've got a continuous delivery pipeline you've got the culture change going on and yet deploying software still hard like it's still difficult that's not quite as easy as Netflix and Amazon make it sound who is in that boat I see one hand yeah okay good there we go cool right so this I think you're in the right room for this all right so I want to start with a little story this is a very old picture of a very young me when I was a release engineer back what back in the day and it was funny because I was trying to explain to my manager at the time who was a VP of engineering sort of the way that I thought about things in the way that that I thought release engineering should be practiced right and so I basically said you know we're kind of like air traffic controllers right we sit at a certain place in the system we kind of direct commits and releases and we have multiple different customers right even sort of said you know here's like a little game right i'ma build engineering right and my boss's reaction at the time was something like this right where he's like what are you talking about he looked at me like it was crazy now the one thing I would point out is that the interesting thing is most of my classmates or many of them are pilots and air traffic controllers in the human factors system safety course so I kind of actually feel a little bit medicated that I sort of made that analogy so a couple of the things that we're gonna be talking are the two main things we're gonna be talking about today do we have a release engineering problem and ways to sort of notice that or yellow flags even if we're doing the DevOps even if we feel like we're successful and even if by a lot of metrics we are successful is there something about release engineering that is holding us back people that raise their hand about yeah releases are still hard right and then the other thing is sort of the art of release engineering that I called it and it which is really about human factors who's heard of continuous delivery yeah most of the hand should be going out which is good so there's an interesting question about like what is continuous delivery what's the definition of continuous delivery right so a couple of interesting definitions that actually focus on different things condition delivery means minimizing lead time from idea to production and then feeding back to idea again this is a thought works consultant of course people know Jess humble humble Dave Farley wrote the book a set of principles and practices reduce client time cost time and risk of delivering changes my definition is actually a little different continues to liveries your organization your entire organization caring about release engineering and QA in a way it has never cared about them before right so a lot of times I see when people try to go you continue a continuous delivery release engineering and quality are kind of the non-functional requirements we don't really sell release engineering to our customers and we don't really directly sell quality we sort of do but a lot of times these are being chronically under invested in organizations for a really long time so when we go to do continuous delivery we actually have to start caring about them right or we fall under that trap of we'll just ship bad stuff more quickly right or more continuously as the case may be now one thing and this came up actually a number of conversations here at does release management is not release engineering so do we have any release managers in the room ok few hands go up yeah I am NOT talking about release management release management is super important but I'm actually talking about the release engineering so the the scripts and the tooling and all of that that gets written a by engineer supported by engineers a lot of times you see it yeah they're the artifact they produce really is the the continuous delivery pipeline so what I want to go through is actually some release engineering smells if you will in your environment if you run into these it may be a sign of some of the troubles and these are the troubles that I see in a lot of organizations that people have when they do continuous delivery when they do quote unquote the DevOps Rell and smell number one modern version control chaos what do I mean by this well I mean get in github people have this right so so who here uses git right ever the all the hand should go up right because that's the the new hotness actually it's it's not that new anymore it's but what's interesting is has anybody ever looked at their commit history and they see this like that is our branch structure now with yet right all of these weird lines this is another one that I actually find particularly interesting like I don't know what's going on there but and that was actually I'll just say it this was on a reddit and the reddit was called shitty programming which I thought was kind of funny ok but what's interesting about get right is one of the things that they kind of tell you about is well ok you should have smaller more localized repositories well from a delivering a product that we care about this means like an explosion of repositories that we all have to deal with now right and we have to find ways to sort of stitch this together stitch all of these different repositories that now make up our product together and this can cause real problems in continuous integration and then of course continuous delivery environment where we have to basically put together all the parts of our project in some meaningful way that's actually where if you're struggling with that you're not alone see a lot of that and that's sort of a major problem that we see today in terms of release engineering and again these are all kind of things that release engineers it's kind of their bread and butter they can help with this stuff the least engineering smell number two can someone just tell me what is running in production this used to be easy right we would do the little about dialogue and we go oh I have version whatever right I stole this so jazz humble has a continuous delivery gauntlet this is a configuration management gauntlet if you will so what I want you to do is is raise your hand and keep it raised but if I say something that you don't do then put your hand down and we'll see how many hands are per the N so raise your hand if your app the app that you're responsible for you can tell me what artifacts are running in production right now let's go ahead and raise your hand if you can do that right using a unique human parsable identifier so that's something like a version number or something I could tell you good that is unique to the entire company okay and it's traceable back to the commit and action that generated those artifacts in a way that I can divine define or identify the exact person who wrote any arbitrary line of code in the product including the open-source components that you use okay I see two hands which is good that's great now what's interesting is by the way when Jess humble does that survey pretty much the same results right when he does it with continuous delivery his version of that um this is the software artifact so we haven't really even talked about operations yet so we were to look at packages and maybe our environments right who wrote the line of the chef's cookbook that we're using or the the puppet manifest or whatever it might be right so this is really an issue that we see where it's actually hard the traceability here becomes really hard and part of that is due to the stitching together that we do have repos that we kind of have to because of the way source control works now release engineering smell number three reproducibility is a problem still which so people might say containers will save us right who's run into tho this problem with containers there's a great post by Julian Dunn who used to work at chef and he would talk about the war file right and he would say it's the enterprise war file that we would all mail to each other and people would like unzip the war file and put their stuff in and zip it back up and send it on that line right and who knows where all this stuff came from right and he was saying containers if we do it wrong are sort of this same problem right people dump stuff in a container and then they compose it and then it goes out to production right and what's interesting is that you see we've pushed that reproducibility into the container space but we haven't really necessarily solved it right how do we reproduce that entire container from the very bottom right and this I'll come back to one of the other kind of problems in in the way this the services that we rely on where this really bites us another one actually another example who remembers npm gate where node unpublished a module and everybody was using it broke the internet right part of the problem with reproducibility is we're in such a connected world we use so many services like npm like docker hub like even public package repositories for things like linux you know linux packages that a lot of times the reproducibility is hard because we don't actually own that infrastructure we're not mirroring it we're just grabbing whatever's out there and we might have a lock file but we don't actually know what other people are doing that because that's not our infrastructure those bits aren't actually within our walls right there's a question really about this comes down to people heard software supply chain who owns your software supply chain if you ever are curious and need a reminder of the answer you can go to who owns my software supply chain comm and you'll get the answer it's you right so this is one of the things that from a release engineering active we actually have to care about we need to start owning our software supply chain and this comes up in all sorts of interesting cases in terms of problems the NPM one was was an example where it sort of broke a lot of websites there's a great report from Sano type on the state of the software supply chain and that's actually where some of those earlier graphs came from an explosion of open source projects they're looking at it from a security perspective right so Equifax and some of the really bigger hacks they go back to like a struts vulnerability right that was known for a really long time there's an interesting story that they tell in here where they talked about there was a bank that had something like 40 different versions of struts in production and something like 38 of them were vulnerable to a security problem now the thing that I like to say about that is if you have one version of a particular dependency that has a security problem that's a security issue if you have 38 that's a release engineering problem right so the point that I really want to make here is that DevOps continues to lean the rest of the stuff that we talk about do not directly or inherently address these problems these are release engineering problems one of the interesting things there was a great talk at puppet comp this year by their VP of engineering who has an operations background and said you know operations people you know in the mid thousands started doing deployments they became the de facto release engineers well what we're finding is a lot of people that are responsible for doing deployments responsible for building that continuous delivery infrastructure may not have had any release engineering training experience and it's not their fault I mean it was if they would get operations and then somebody said now you're doing deployments congratulations that was kind of the point of his talk is the operations people just suddenly became took on this new role um but what we see a lot is that because they don't have that experience they struggle a lot with coming up with things from first principles again things that release engineers can really help you with so it's really important now you have a scene you may have seen in the title the art of release engine and you might go like art what does that mean and really I want to talk a little bit about sort of human factors and where human factors sort of intersect the work and sort of release engineering and operations because this is a new topic that's coming up a lot in a lot of these discussions and I think it's it's fascinating where you see these issues pop up so going back to the air traffic controller analogy that I made all those years ago it's not that we you know how coordinate releases that we build infrastructure but we sit at a different place in the system right we have a different view by the nature of the fact that we are touching different bits and people care about you know the product that gets released so it means that when we come up with a release engineering process we end up doing a lot of things that turn out to be related to squishy human things like how humans talk to each other about what's in production or how humans debug a problem in an environment that you know in production when they're when there's an issue or an incident right how do they do that how they do that quicker those are all sort of human factors issues so you might be saying okay well human factors what's but what did what do you mean by that especially in this context right well really it's about sort of I was saying it's really the operability of a released and deployed product right so I'll give you a silly example if this was your package let's say you're you have your software and you package it into RPM so that's great good we're using packages we're not pulling things directly from get which I see a lot take it from getting deploy it right out you got a package uh when was that package built that's your package name when was it built how old is it if you saw it in production in did a package list you actually can't tell now if I ask the same question with that package name can you tell me when it was built just by inferring from the name now this was an example actually that I had an interesting conversation when it with an here because we're talking about you know package managers look at these numbers and they they know it's an update if it's a higher number right so he's like why don't we just use the number like one two three four and I said well if you use a date you have all this and the other information confer and he was like yeah computers can can compare numbers I'm like yeah I know computers can compare numbers but this is more human possible and there's interesting research that says in an incident what are things that operations engineers look at and if they can find out that a particular package maybe is two weeks old and they know that we were supposed to have done a deployment last week that can shave off a lot of time just by encoding a little bit of information in your package name so the way that I kind of like to say this automation or the point about automation is that release engineers don't just implement automation they design automation for others in complex socio technical systems right you'll probably hear this term a lot more in the coming months and years socio technical systems it's important because it the socio part is the people part realizing how important the people part of the work is in addition to the technology and I have a picture of a checklist here another way to sort of think about it is if you look at aviation checklist this actually is it an aviation checklist if you're talking about something like starting an engine the checklists aren't designed to like push a button over here and then push a button over here and turn this knob where you're moving about the cockpit they talk about flows in the cockpit right so the procedures are designed such that you can flow from one system into the other in a way that makes sense for whatever procedure you're doing and it makes it easy for humans to go through that flow that that's actually the term that you go through the flow right so I've seen a lot of organizations that have released checklists but they don't have a good encoded flow into them so they make them actually really difficult for humans to execute and when you have a flow it's really easy for you to go from what item to the next and say wait a minute that thing that's next in the flow or last and the flow doesn't look quite right and you can kind of reason about that as opposed to being very distracted moving around and so you might think about if you have a release checklist this is a problem you know is it is a lot a lot of high-touch at different points in the system another example versioning I love I love version bike shed conversations so which is easier for two people to discuss i revision alpha Charlie Delta 3 Bravo 9 are checked out on my get clone which version do you have right or I've revision to 8 for 9 to sync to my repository clone which is easier alright here's another interesting question which revision precedes these two right so I point this out because if you follow me on Twitter I will talk about it endlessly it's one of my favorite things to talk about but this is one of the things that makes this a lot harder than it needs to be and there are reasons and get why that is but I want to point it out sometimes these tools actually can have an impact on our cognitive load when we're trying to have a conversation and you see this come up like when two engineers are trying to debug something and I've seen this repeatedly with git and they don't have the same versions checked out and they they didn't use the phonetic alphabet so E and three got jumbled up when they were talking or something like that right in in their revision when they get it and and it takes a lot longer for them to debug things another one you might say well okay how do you do versioning semantic versioning I know who knows about semantic versioning yeah so semantic versioning is basically a set of rules you have three numbers and the rules mean things or the rules say the three digits mean something right and so you should be able to communicate to other people something about your package by the version numbers right so this is from Twitter and this is why this stuff is actually really hard numbers I warn you dot this is shining oops my bad but that somebody replied made life easier for me made life dot made life easier for you dot white space and then I love the question mark question mark right yeah but it's usually oh god no lol dot whatever right and so what's interesting is that you know we think a lot especially if you you know if you look at release engineering version numbers are a hard problem you can't just say just do semantic version because you're trying to communicate something and that's the squishy hard part about humans right we're trying to encode a bunch of information that trying to boil it down to a version number by the way I bring up this example because you think that this was a solved problem but apparently not the release the open SSL which that's a thing right security issues with open SSL the release was broken so instead of bumping the version number they just refreshed the tarballs and had the same version number with different bits please do not ever ever do that but again this is versioning is really a communications mechanism that's why so many organizations have different different patterns and and you know semantic versioning you may not be the answer for you couple other human factors yellow flag so opaque stages of the continuous delivery pipeline so these are stages if you use a tool like electric flow and you go look at the dashboard and there's stages of pipeline that are opaque for some reason oftentimes that can mean something that you know something's going on there that is really a human factors issue maybe it's actually people are needing to talk to each other and that's not really encoded in the pipeline in a way and maybe it should be right you know but there's always you know maybe a little smoke there around that we see this I've seen this actually a surprising amount of time direct subversion of the continuous delivery pipeline so worked with a client and it turned out that at the end of the pipeline so you're thinking okay commit or you know build buh-buh-buh-buh CI testing deployment it turns out that one of the ops engineers goes pulling the binary out of the pipeline messing with them and shoving them back into the pipeline now I'm not saying that that engineer is bad necessarily but there's some why is that engineer doing that right is there some governance process that to literally do their job that's what they have to do right so again when we go back to this discussion of socio technical systems and release engineering we have to look at a lot of these issues and this is where you see those yellow flags right people doing things that you like wait are you doing that and it may not necessarily be you know entirely on its face bad okay so this is our human factors sort of elephant this idea of automate all the things who who here carries that Natura in their organization automate all the things the icy few hands going to good so this is good um but there's a lot I think to learn back an aviation or from aviation here about automation and their experience with automation in the cockpit so I'm gonna go through a few accidents that I hope give you some food for thought when you're designing automation really and the points that I want to make here is when we are designing automation we were designing automation for humans to use these are things that we should be thinking about and these are lessons that were costly lessons that they learned and so maybe we might want to take a second look at at how we automate things and how we stitch those steps together in a continuous delivery pipeline so air inter flight 148 so they were descending on a very stormy night and they had a problem so these are two different displays it's the same display in the cockpit but you'll notice one is a negative 3.3 you see that that little decimal point and the other one is negative 33 and those are different modes so one us one says I want three point three angles degrees down angle down to descend and the other one says I want a vertical speed of negative 3,300 feet if you're curious in the cockpit I've where that is I circled it and if you're having trouble finding it I that there's a big arrow to point you to it but so the problem was is they thought they were in angle mode and they were in feet mode and it was a stormy night they were on an approach that they weren't familiar with and had not trained before before and because of that difference they flew the plane into a mountain and so the point that I want to make here is a lot of time the instrumentation around the automation we see what this is called mode confusion we see that a lot but a lot of times the automation have you ever had a script that's doing something and you like or run it and it normally takes ten seconds and it's taking like thirty seconds you're like huh what's that doing right it's one of those one of these issues err flat Air France flight 447 so people probably this is a more recent one the the safety paradox of automated cockpits right and so the there are a lot of factors in this particular accident but the kind of short version of the story is their autopilot had some sensors the sensors malfunctioned because of weather and then the autopilot disconnected and the pilots were giving different inputs and because they were so stressed trying to debug that situation they didn't notice they were doing different inputs and it turns out that that aircraft it doesn't pick one or the other it averages them so they cancel each other out which is interesting and then it also brought up sort of training issues and and can have we are we relying too much on automation so this is really interesting because it was a situation that sort of caused a bit of a wake up in the industry around how do they use automation are we doing in are we relying too much on on our vation can we can we fly this plane by our by ourselves so and then the final one that I want to talk about is Air France flight 296 which is often called the option incident and so I'm going to show a video here and if you can't see the video one of the reasons that I wanted to show this is because the nair the tone and the narrator's voice I think is very interesting so hopefully that aircraft manufacturers have tried to fix that problem by designing the pilot out the cockpit this is the first fully automated plane flown by a computer so a couple things I'll point out um I don't know if at the end you heard that kind of squealing noise I always thought it was music like suspenseful music it's not those are the engines trying to spool up and ingesting branches and leaves and then stopping the other thing though that our point is point out is is that the pilots lived through this accident and which was interesting because they were able to actually interview them and in briefing them in a way a lot of these accidents they can't what they found out is you would probably in previous eras of aircraft you would never fly that low at almost stall speed to the ground that pilot did that because Airbus told him you cannot stall this airplane so he said okay well the computer will never let me do anything that would stall this aircraft so I'm gonna go do it and that's what happened so the point that I want to make that I think's interesting here is when we reason about automation a lot of times people think oh it's automated so problem solved right the last talk had a really good point about you know there can be bugs in the automation as well right so we need to think about that just because we've automated it doesn't necessarily mean that we've solved all the problems I do this is a great video I'm gonna play about a 40 second clip from it the video is called children of the magenta and okay automation dependency as I mentioned earlier today the reason this segment is in this course is because as we look at this accident history what we find is that in 68% of these accidents Automation dependency plays a significant part in leading these crews to either a critical flight attitude or the requirement to extract max performance from their planes automation dependent pilots allowed their airplanes to get much closer to the edge of the envelope than they should have as we start to study this issue we've decided to take a new tact on this and so what you're going to hear a lot more at American Airlines is the discussion of what we're going to call levels of automation and technology judgments so I think that's really important because I don't want to take away to be don't automate things or that automation isn't important it's that we deliberately need to think about automation and levels of automation and I think technology judgment I really like the way that this instructor put this when we're designing these systems right it goes back to the human factor in the human element element of the operators in the system one thing I'll point out did anybody catch the date of this video at the beginning it's 97 this is 20 years ago you can find this video on YouTube it's actually some of the terminology they use if you look at the human factors system safety area they there's references to situational awareness and if you mentioned that people might yell at you which is kind of funny we can talk about that off over lunch but for being 20 years old it is remarkably relevant to us today so if you find this stuff interesting or you're more curious about it there's a couple of presentations here at does that you should definitely go see these are experts in this field dr. Sidney Decker is is here he's gonna be speaking later today on a lot of this stuff and then John auspi who's done a tremendous amount of work in this area will be speaking tomorrow on it so if you find this stuff interesting check both of those out so I want to close up with just a couple of takeaways to go back and think about potentially over lunch so delivery software delivery is a focal point of our software systems that's where all of this stuff really comes together and because of that because that's the part that is really delivering the value there's a lot of meat there for us to look at right if you're having even if you have a continues to delivery pipeline and it's painful there's that there's a system that's telling you a lot there and so it's good to really dig into that and find out why and also that continuous delivery pipeline is not just for computers it's actually for humans as well DevOps is not a substitute for release engineering any more than DevOps as a substitute for security engineering right so if you're like well we're doing DevOps we don't need security or QA or release engineering you might want to rethink that a little bit especially if you're still finding you have hurdles and things and finally probably the most important important your release engineering practices must be informed by human factors so if someone tells you that release engineering doesn't matter you can have them come talk to me that's all I got enjoy lunch

Info

Channel: IT Revolution

Views: 1,772

Rating: 4.8518519 out of 5

Keywords: information technology, DevOps, software, software delivery, enterprise, executive, what is devops, continuous delivery, Agile, IT, development, DevOps Enterprise Summit, DOES17, devops training, devops tutorial, Lean, release engineering, operations, does17 us, IT Revolution, does17 san francisco

Id: lYl1dJvzW5E

Channel Id: undefined

Length: 31min 44sec (1904 seconds)

Published: Thu Nov 30 2017