Failure & Change: Principles of Reliable Systems • Mark Hibberd • YOW! 2018

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Applause] cool thanks uh hope everyone's been enjoying the conference so far um I'm going to be spending a bit of time exploring reliable systems today um something I really enjoy talking about something I really enjoy working on and the reason for that is simple it kind of matters it actually matters to people that software Works which is kind of surprising sometimes um so can I get a show of hands if today will somebody have a bad day or not be able to do their job or just get frustrated if the software that you worked on stopped working right now okay most people um the the thing about most software is that it does actually affect somebody um doesn't matter if it's uh a big piece of software a little piece of software um for millions of users for a couple of users it tends to actually matter so I I think that this is tremendously motivating it's motivating because it lets us talk about architecture and techniques and reliability in a in a context of outcomes it actually matters that our software works it's not talking about techniques for techniques sake or techniques because somebody else is doing it gives us the context to actually make tradeoffs and actually have reasonable conversations about what we want to do so let's start by talking about reliability the most common definition is this idea of consistently performing well when we talk about an extreme case like airline travel this is a pretty obvious property to understand performing reliably performing consistently well is it would be getting a safely to our destination so as a quick test let's have a re let's have a discussion about a real world tradeoff we want to design a new jet and we have a choice that can have four engines or it can have two engines these engines have gone through lots of Trials lots of testing lots of certifications so they're really reliable engines and our plane actually only needs one engine to like kind of maintain altitude and have enough thrust to keep going so can I get a show of hands for people who think we should put four engines on our plane people who think we should put two engines on our plane pretty pretty surprising did I hear all one living on the edge um so somewhat counter intuitively or maybe not uh there are many reasons why two engines may be safer than four these engines are so reliable that the chances of having two failures on the one flight is actually so small that it's less of a concern than having one failure and by doubling the number of engines from four uh from two to four we've actually doubled our chance of having a single failure and this single failure can have knock on consequences a an engine failure could affect could damage a wing or by having four engines we have to put the engines closer to the cabin and we could damage the cabin so the consequences of this failure are far more significant than the chance of having multiple failures failure and reliability are complicated topics we're dealing with these tradeoffs all of the time so some early takeaways we can avoid failure through more Reliable Parts but that's not the whole answer because Reliable Parts can be hard expensive and software rarely has this this weight of testing and certification like an airline engine we need to be resilient to failure by controlling the scope and consequences of our failure we need to be able to make these trade-offs all the time let's continue this idea by having a look at the statistics of failure this is an exercise I find really useful to work through with people sometimes as software practitioners our statistics are a bit Rusty or uh in building software we kind of just lose touch of our intuition for numbers and our relationship to reliability of software so if I get have a service any service doesn't matter what service is and I just arbitrarily say that it has a 10% chance of failure um today okay but I really don't want to be interrupted while I'm giving my talk so let's deploy some replicas got some redundancy I'm going to deploy 10 of these services so if the probability of one failure is 10% then the prob ility of a system fa is .1 10 to get our availability we subtract that from one we end up with 10 nines better than Google and Amazon just by deploying 10 awesome winning we have to ask the question are these F really independent of each other because that's what our statistics are assuming that each of these failes is independent let's have a look at what happens if they're not independent the probability of failure is 10% means the probability of success or not burning to the ground is 90% means the probability of all successes is 0.9 to 10 so that means the chance of us actually having one failure when we deploy 10 is actually 65% which is quite significant and if our services have some sort of mutually assur destruction pack where if one Burns they all burn then the probability of at least one fire is 65% but if one go they all go so the probability of our system failure has actually risen from 10% to 65% by deploying 10 so without Independence having redundancy is actually a liability we need to maintain this property of Independence through our whole software development life cycle we have to understand that service redundancy means embracing more chance of failure but it does give us an opportunity to handle those failures in more Anders ways that opportunity is entirely predicated on Independence so what about more complex systems again I'm going to keep saying this word independence if the worst case if we have to have all three of these services available for our system to be up then the availability is M multiplicative so things get pretty bad pretty quickly we're able to control the scope and consequences of failure Say by gracefully degrading and just having losing a function instead of having our whole system down then things get better we need to control the relationships between our components or between our services in our system in order to kind of control the scope of these vales and manage things and be able to trade a little failure for not having a whole system go down when I've actually talked to people about this before the first question I normally get asked is this a big argument for having a monolith well no it's not that simple yes the challenge of having many Network calls is real but just because we don't have Network calls don't doesn't mean that we don't have to worry about independence for example if one of the functions here was a search uh piece of functionality that uses lots of memory and it runs out of memory and brings the whole service down it doesn't matter that it's not making doesn't matter that it's not making a network call it's still going to bring our service down by putting it all together with our other features we've kind of lost control of things we've done so in a way that we lose lose all of our functionality we have no way to recover service granularity gives us this opportunity to trade the likelihood of a failure for the consequences of a failure so we explored a bunch of trade-offs we need to be aware of redundancy granularity architecture we have to be aware of all of these at the same time to be able to make our software more reliable none of these are a free lunch they're all and we have to work at them so let's take these lessons to actually building a system although I'm not particularly good at it I quite enjoy playing online chess so it's created an interest in thinking about how we construct a reliable online chess service it's an interesting problem because it isn't too big um it hopefully isn't too hard to follow um but it actually offers a lot of the challenges you see in systems every day so what makes up an an online chest Service uh it's probably five key functions first we have pairing so if two people jump online we need to be able to pair them up this isn't just two arbitrary people there's a huge deviation in skill levels and the time length of games that people want to play so we want to pair People based on the similar skill levels and the same same game duration we want to be able to play the games enforce the rules see who wins want to be able to look at historical games it's always good to learn from your own mistakes even better to learn from other people's mistakes we want to be able to analyze games computers are quite good at chess we want to be able to have a computer show us what the best moves are again helping us learn and we want to play against those engines maybe just because we really like getting beaten or maybe because there aren't any other players around but playing against engine is a common feature so I don't want to get into the merits of monolist vers services but there are some deep impacts to be considered on all sides some of them social some of them technical but it's worth exploring the different trade-offs purely from a reliability perspective we look at our online chess service and all the features that's a lot of things to pack into one service and not actually that much relationship between them if we look at our modes of failure that we talked about earlier the risk of failures is removed because there's no interactions between Services we just have one but we've increased the consequences of failure if anything happens if anything goes bad we bring the whole chest service down not great but we also remove any independence of operational control it's kind of whatever the slowest feature is maybe the chess engines require lots of memory lots of CPU we're now Hostage to those from an operational perspective we're going to have to deploy lots of all of these services in order to deal with it so an alternative approach was to break it down into Services how would we go about this one of the most common mistakes I see is when people start uh moving from a service based approach uh to a service based approach they start grasping for nouns so player and game more recently I heard Michael nogard who you're lucky enough to hear his keynote yesterday um term this The Entity service anti pattern seems pretty apt regrettably this is such a common pattern that it's ingrained in pretty much every M microservices tutorial um it's in reference architectures for web service Frameworks people leap at these entity Services the problem is you end up with very deep communication structures pretty much everything ends up having to talk to the game service a lot of questions have to be asked how do I what is the state of this game what are the current moves am I waiting for a pair am I waiting for a move and a lot of the information that that game service has to maintain isn't actually to do with games it's to do with the clients of those games so the game state am I waiting for a pair am I waiting for a move these states are really only relevant to the intermediate services what happens when the game service has an issue pretty much everything feature everything explodes everything explodes really quickly it's not going to end well basically we've managed to construct a system that has all the downsides of our monolith plus we have the ne extra Network overhead and our complexity we need to be a little bit more deliberate about how our services talk to each other so back to the drawing board a good place to start rethinking about Services is starting to think about their inputs and the states and what they have to ask we've already seen when Services have to ask too much of their dependencies things don't work very well so instead of reaching for game service we can think about games as values that we can pass to these Services conveniently chess has a text notation called PGN portable game notation that lets us represent games of Chess it's pretty handy we can pass these around and this lets us have a different way of thinking about our services so instead of starting with our nouns and starting with our entity Services we can kind of start with the different behaviors that we have to have in our system pairing of players playing games analyzing games looking at the history of games and having a game engine so if we dive into the pairing service to start with we don't even have to have much we don't have to even have any chess knowledge about our pair pairing service pairing is basically just about agreeing on some sort of shared identifier player one comes to the table they say I want to play a game of certain length and I want to play with certain strength players another person comes along the only thing that they need to agree on is what their game ID is so the pairing surface is simply about negotiating the shared identifier a party who wants to be paired has to be able to communicate its constraint and how it can be notified so the actual inputs to this service might look something like this notify me at this URL here are my constraints these constraints are a little bit chess orientated but they don't have to be this pairing service just maintains an index index list of the waiting parties and then asynchronously canare up people that are similar it can maybe give it a few seconds see if it can find the best match and then notify them with their game ID this flows on into our game service the playing service is responsible just for the inlight games a people two people have just been paired up they have a shared identifier they can just go and say I want to play this game and it'll be in an initial state it doesn't matter where those identifiers come from they don't even have to have come from the same service there's definitely no direct relationship to the pairing service we could have a second service that's pairing human players with computer players generating a different set of IDs again just passing them off to the play service if we follow this logic all through our services we look at the history service history service is just a static database of games it's not pointing back at the games that were played and not referencing the other services it just has the values and it hands out those values so when we actually go and search uh we finish a game we just pass the value of that game to the history service and it'll be stored we don't actually maintain a reference the original game it can get archived and removed this is really important because that means if if we are currently playing a game and that service goes down we can still look at the history of our games and further if we then want to analyze our games the an analysis game needs to know a little bit about Chess to set up an analysis board but it can just be past a value of the game this means it doesn't have to go back to the history service to interpret it our history service doesn't actually have to be up in order to look at or analyze a game and more than that if I want to create another feature off the back of this say I want to analyze an over theboard game I just want to pass pass it a current a game state that hasn't actually been played in our system I'd be able to do that as well so the takeaway here is we want to be very deliberate about how we break our system down into independent responsibilities not shared nouns and one very effective way to do that is to emphasize this idea of values and passing state from different services this goes against many people's intuition I see people starting with some services and even if they start with this approach they then start trying to factor out everything well it's commonality our playing service has games our history service has games we're pairing up games we end up back in the game world we have to be more comfortable with this idea that these different Services all know about games but their view of games is slightly different and that's okay and by keep keeping these independent we're actually improving our overall reliability so if architecture creates opportunities for liability we have to then think about how we're going to operate these Services what do we have to actually worry about when they're running so if we take a look at our services what happens when we see a failure if we look at our chess engine it's the service I'd be most worried about Chess engines fa like computationally expensive um they might fall over a lot they're fair Fairly performance critical some of their implementations choices are sometimes a bit questionable so that's something I'm going to have to worry about when I talk about failure of a service and I'm thinking in operational context I'm not thinking about a service I most likely have several deployed I might be talking about all of those being broken or I might be just talking about a single instance being uh currently being down our first job is understand what is failing and how it's failing so that we can start to do something about it there are many useful monitoring techniques we can apply but a simple health check on a status year role can be really effective it provides a nice way to introspect on services and can be used to determine uh determine the health so this one has an exp expit explicit status okay or KO if if the service is not okay or we could just say can we contact this URL or not once we've identified what is failing we need mechanisms to Route around that failure this might be this might be uh different in different environments but we have to be pretty aggressive about it if something's failing and we keep sending requests at it and we keep trying harder it's going to put more load on that make it harder to recover we need to give some space to rest either get shut down start a new one um and recover if we don't the likelihood is that we're going to start cascading those failures upwards so our chest engine's going to see failure after Failure Its requests are going to get slower because it's only succeeding one and every few times that's going to result in more requests to the engines we're going to get more failures and more failures if we can isolate these failures get back to returning Services normal things will get better a lot easier a lot quicker one of the things that when people first start implementing these checks and something that I've run into uh with teams a lot is this idea of entangled health checks though they have a health check and say well my service is okay but then they say well my service is not okay and what does that actually mean and when they've implemented that they've actually gone and enumerated all of the other services that they talk to so my pairing service well it has all of these notify URLs are any of those notify URLs down that's a pretty degenerate case but I've seen services that have shut themselves down because they couldn't talk to a very optional dependency and this is something we have to watch out for we can introduce coupling during this operational phase monitoring can be very effective um but fire is not always that clean as well uh we kind of get this suffer from a whole bunch of bad assumptions history's given us kind of these eight fallacies of distributed computing and there's a list of fallacies of pretty much every topic in programming but um historically this is the one I've referred to probably the most um but there's a lot of there's some basic techniques we can use to kind of start combating these assumptions first and foremost time out save lives um assuming that a request will either succeed or fail straight away creates so many problems we want to make sure we know about a failure and in many cases just going slow is as significant as a failure in fact it can be sometimes worse because it's can be harder to recover from timeouts are so important that we probably should have some sort of government sponsored Public Service Announcement some sort of slip slop timeout it's that important if we fail if we time out if we can't make a request then we do have to try and try again we have to take care and patience with that again if we have a failure and we keep trying we're going to overload our system and make things worse uh these plots which you probably can't see very well um with some research that Amazon published or some uh graphs that Amazon published about um different strategies for backing off of retries doing optimistic concurrency but the same strategies work very much from recovering from failure um their exponential back off so taking longer and longer if we're getting more and more FES but doing that with a little bit of Randomness so we're not everyone's not jumping on the service at the same time when they're retrying and having some in and backx boundaries so things don't get too out of control Okay so we've got our timeouts in place we've got our health checks we know when things are going well things can still fail right what do we actually do the first thing to remember is that serving some serving some requests is better than serving none if I look at the chess engine example uh again I might not have enough CPUs to service all the requests so for our chest sension this might mean tapering limits it's kind of the staircase here to remember if we can only service 30 requests at a time we don't want to send 60 requests to it and let it fall over we want to stop those requests before it gets to it this applies to things like databases if we can only handle 100 database grps queries at a time and we have three database queries for every HTTP request don't allow 700 HTTP requests and then hope that your database is not going to fall over Stop Those requests as early as possible so we don't overload our systems they can recover quicker finally if all mitigation and isolation fails and our engine service is totally offl and unable to serve any requests we need to take more drastic action there are lots of patterns we can apply like circuit blak Breakers to cut off communication to our engine service and hopefully let to recover but more important than that we have to understand holistically what this actually means for our total service so that means thinking about things right through to the user interface if our computer engines are offline the play computer button probably doesn't make much sense having people click this button and then just get a Fue message isn't going to help being able to remove it from the interface or disable it temporarily when we know these services are offline is a very useful tool to have in our belt okay recapping quickly we need to understand what is failing first of all we have monitoring health checks a whole bunch of techniques to help us identify is it a single instance is it lots of instances what is actually failing when we've identified it we want to Route around that failure as quickly and effectively as possible we don't want to keep piling traffic on our failing resource if we're faced with a question where we can't serve all of our requests we're better off failing some than failing everything because we try too hard one of the ways to do that is to set limits and set safe limits so work within our capacities and ultimately being reliable sometimes involves reaching out to your clients any time we can degrade gracefully or negotiate some uh variability our service um we are able to maintain independence of service if it's a matter of them not getting this field right now or our whole service being offline um hopefully people can make fairly sensible decisions about that changing systems one of the big big differences between uh I started by talking about airline travel those engines are tested built and built to tolerance over and over and over and over again they don't often have to change the engines mid-flight but that's what we're constantly having to do that that's what we're having to constantly do when we're changing software and that introduces its own challenges to reliability whether we've created my ideal independent set of services or we actually just have one big service it doesn't actually matter this this dependence on change management actually hampers hampers reliability of all shapes and forms over time we're not going to have one instance we're going to have many in production over time we're going to keep deploying keep changing and each of those is coupled in its own way if we take a look at our pairing service we don't have one we have many all at a specific version those Services probably depend on some persistent data store so when we deploy a new version they tends to be coupled through that data store the old versions using the same data as the new version this introduces all sorts of complications we're also coupling through the interface when we have two versions up there are services that are depending on consistent semantics consistent structure from these services so this change over time is a big deal if we create it if if we draw it as a black or white problem where we have old or new we kind of introduce an unreasonable dependence on the results of our service there's a cliff that we have to jump off every time we deploy is it going to work this time is it the same as last time is it okay and these interactions are very difficult to test for it doesn't take long before we've got a couple of servic up and the permutations of versions in production become infeasible to test um in advance um we might have several environments all at different versions that you're going to hit a case where you oh this one and this one didn't work together what do we actually want to do basically we want to avoid the situation where we have these discrete jumps going from version one to version two and hoping that it's going to work we want to create situations where we can gracefully roll things out and flatten out this time Dimension having a version being able to deploy the new version safely have both working in parallel and eventually switch over to the new one or if things aren't going so well deploy one get the old one out and if it doesn't work go back to the old version but we have to be able to do this gracefully and there has to be some techniques that we apply to do it because it's not that straightforward the first thing that's useful to be able to do is this idea of in production verification it's actually probably one of the harder techniques to get right because it it requires a little bit of thought a little bit of restructuring of your data model to support it what we want to be able to do is have two versions in production but have the data stores be independent we want to be able to have a routing source that can replicate traffic so when we get a request we want to send that request to both of the services have those Services provide an answer and then verify are these answers both the same or are they within some tolerance or what are they going to do or if it's bad kind of go well something's something's wrong here there's really no substitute for being able to test against live traffic I've been in a number of situations where I've inherited large code bases that had lots and lots of traffic very few tests it was a choice between working out ways in which we could test against live traffic or never changing anything because we were so paralyzed by fear that if we if we deployed something we would break some something we had hundreds of hundreds of URLs questionable semantics not very many tests this doesn't have to be live we could use logs to replicate it um and there are certain service designs that kind of make this easier so a pair service it was asynchronously returning the results saying here's an identifier for our pair this is actually a really easy case to test this with because we can send all the pair requests to both Services get all the pair responses from the two services and just kind of compare and Shuffle it's a great way to build confidence in a service next we want to in extend this to incremental deployment it's a slightly different setup where if we have two versions in production they're both using the real data source right we just want to offload a small proportion of requests to the new version until we're confident in it so again we have to have some sort of intelligent routing service um maybe we don't even have every request going maybe it's just a certain type of request until we build up confidence in this new version we might want to send traffic most traffic to the old version a little bit of traffic to the new version once we have this in place if we can understand success of a service uh then we can kind of start migrating things quicker is it just that the responses are okay or they're not okay that's a really crude one but we can go further starting to validate the structure of our responses actually verifying things as they're happening in production or having key indicators like well it should only take less than two seconds to find a pair most of the time when we deploy this new service is it going faster or slower than the old service if we can build confidence we can start routing traffic building up until we can get all of our traffic to the end if things aren't going well then we can kind of roll back these techniques aren't revolutionary people have been doing it for a long time I'm sure some of you are doing some of these techniques however when we talk about these techniques a lot of the time we talk about using them for big deployments or people who have lots of customers or are really worried about up time but this is can be a technique that we can use even on small code basis just to build confidence I like the I like asking this question could you survive shipping a bad line of code this is something I think we'd all like to not be scared of we can work really hard we can test our code we can try to verify and be very careful but eventually a bad line of code will get through and what do we want to happen by using some of these techniques or having some of these structures in place we can actually protect ourselves and cover cover ourselves from these cases it doesn't matter if it's a small appp some of these aren't that hard to set up so I've talked a lot about deploying code and this presents some serious challenges in itself but the problem is deeper and more challenged if you start to incorporate data into this how do we migrate data how do we change our databases in order to tackle these issues it's kind of very important that we kind of deal with these problems independently one of my pet hates is services that use the local file system for their storage or their P their persistence um pretty much every development tool ever the number of times I've worked in a company where the hardest to deploy service was their CI server because it only used the local dis is amazing to me it happens and keeps happening time and time again I mean it's easy to understand the file system is attractive it's pretty it's easy to test it's easy to understand it's easy to replicate but actually makes things really complicated when we start thinking about reliable services in production we can't just start a replica of the service we can't shut down or restart the service without care we often can't run multiple versions at the same time we really want to tease apart this handling of data into reliable data storage there are nice patterns for doing High availability databases we want to deal with that as a separate problem than with our code rather than coupling them together on the same server this also comes up in infrastructure problems I see a lot of people now running kubernetes and one of the things that they do is they assume that there's only ever going to be one cluster but there's a lot of things that are actually really hard to migrate so I've seen people who've gone okay we've got kubernetes or we should have deployed it with some TLS Mutual TLS between services but actually rolling outo onto their cluster is actually a really hard exercise and so you have to rebuild a whole new cluster but then they realize well we actually can't have two clusters in production it doesn't work so again we're in this situation where we have to take a big dive off of a cliff and hope that we don't break things we want to be careful anytime there's a singular in our infrastructure in our deployment okay got a long way into building this system and no one has asked me to do the impossible yet it doesn't seem very realistic when I'm at work people are always asking me to do impossible things let's do something impossible how do we construct a reliable system from things that are just not reliable so a little disclaimer first it's based on a true story um it happened to me it could happen to you really um I've changed the context to protect the innocent um also to work better with my chess example um but this is about a machine learning library just a slightly different one over the last few years there's been a revolution in chess engines um we've been we've moved from a where chess engines like uh traditionally stockfish and kodo they're super strong because computers are really fast they just work really hard they basically Brute Force calculate what's the best move right now but newer um machine learning based chess engines like Alpha zero turned this on on its head Alpha zero last year beat stockfish in a 100 game match 28 to nil this is significant because um it's really strong but it's also significant because zero didn't learn by trying really hard it learned by playing lots of games this means that it plays moves that are more familiar to humans or more feel more natural and it also has led to a cottage industry of people trying to reimplement it just like every machine learning library ever somebody has never written python before goes and tries to implement a python version of alpha zero this is great it's awesome people can have fun they can play around did it on the weekend in a few hours Carefree didn't run any tests because it didn't matter I'm having fun it's awesome cool it even works sometimes it's cool yep what's the worst Could Happen well I'm going to put in production it's all good so how would we go about shipping this piece of code and to say that um it's unreliable would be an understatement the chance of it actually working while I'm here on stage would be you know pretty much non-existent if I ran it 10 times it probably fails eight times um yeah and it doesn't just fail um they've done what I didn't like they didn't separate out the computation and data they used the local file system um they didn't even use a format they just pickle some data structures and put them down gets pretty bad when it crashes it often crops that data store because it just it's halfway through writing it doesn't really work and then it loses all its memory so you have to retrain it and it's pretty bad what do we actually do so if I was faced with this problem and I might have been um a way to do this is to create a proxy to take control of the interactions with this unreliable piece of code we want to use this proxy to create a more reliable view of data so how this the training of these engines work is that basically it's playing a game and then a game and a game so instead of just passing the games onto the onto the library what we're going to do is we're going to going to create an immutable log or an pendon log of every game that we've played so we've got our textual representation of games we're going to have some some reliable storage that we can keep these games I'm going to have a log that just keeps going and ever going forever that way when it fails and it will fail we can just discard the library let it corrupt its storage throw it away and then we can in turn kind of rebuild the state of the engine from our log and replay it so by doing this we've created a little independent Universe where we can kind of trust that the service is going to act responsibly it's going to keep giving us answers it's going to be a little bit slow because it has to rebuild every now and again but that's okay we don't have to deal with it constantly crashing and constantly being unusable once we've created this Independence then we can go about trying to recover our um our availability so probability fa of two goes from % to 64% not too bad get it up to 10 we're back to 10% 20 back to 1% experimental chess engine I'm pretty happy with this but we could keep going just trading operational cost for my sleep it's fairly good tradeoff so this idea of reliability being not just about reliability is important to me um it's about being confident it's about being confident that I can change code it's about being confident that uh I can recover if a bad piece of code actually slips through um I like to be relaxed about things just to go through some of the key points we want to avoid failure through more Reliable Parts but we want to do more than that if our parts do fail you we want to be resilient we want to control the scope and the consequences of those vales being redundant means having more Val Val but more ways to deal with those values we have to think about our architecture it's our one opportunity for controlling the interactions controlling the scope of failure we want to care we want to think very carefully about service granularity we're here we're trading the likelihood of a failure against the consequences of a failure basically we want to take all of these and make a good decision for our service it's not going to be the same for everyone but if we have these all all of these tools at our disposal we can make some good decisions so bit of a whirlwind lots of different concepts um there's lots of places where you can read more about about these um there's probably been a whole bunch of them already mentioned at this conference uh but I just wanted to present this through a fairly narrow lens I just wanted to worry about reliability and keeping my services up hopefully that's a different way to think about it um hopefully it's useful to some people um thanks

Info

Channel: GOTO Conferences

Views: 1,620

Rating: undefined out of 5

Keywords: GOTO, GOTOcon, GOTO Conference, GOTO (Software Conference), Videos for Developers, Computer Science, Programming, Software Engineering, GOTOpia, Tech, Software Development, Tech Channel, Tech Conference, YOW!, YOWcon, Mark Hibberd, Failure, Change, Reliable Systems, Complexity, Complex Systems

Id: VgDIpEMZenk

Channel Id: undefined

Length: 40min 41sec (2441 seconds)

Published: Sun Jul 14 2024