Applying the Saga Pattern • Caitie McCaffrey • GOTO 2015

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

FYI, here's the talk Abstract

As we build larger more complex applications and solutions that need to do collaborative processing the traditional ACID transaction model using coordinated 2-phase commit is often no longer suitable. More frequently we have long lived transactions or must act upon resources distributed across various locations and trust boundaries. The Saga Pattern is a useful model for long lived activities and distributed transactions without coordination.

Sagas split work into a set of transactions whose effects can be reversed even after the work has been performed or committed. If a failure occurs compensating transactions are performed to rollback the work. So at its core the Saga is a failure Management Pattern, making it particularly applicable to distributed systems.

In this talk, I'll discuss the fundamentals of the Saga Pattern, and how it can be applied to your systems. In addition we'll discuss how the Halo 4 Services successfully made use of the Saga Pattern when processing game statistics, and how we implemented it in production.

👍︎︎ 6 👤︎︎ u/goto-con 📅︎︎ Mar 07 2019 🗫︎ replies

Jump in a longship, row across the ocean, and steal code?

👍︎︎ 2 👤︎︎ u/scooerp 📅︎︎ Mar 07 2019 🗫︎ replies

This was a pretty interesting talk, and given in a very accessible way. I have never really touched web server stuff, but I found I was still able to follow most of it, and get the main idea.

👍︎︎ 2 👤︎︎ u/s73v3r 📅︎︎ Mar 07 2019 🗫︎ replies

This talk was awesome. She seems super sharp.

For anyone interested in a client-side implementation of sagas, take a look at redux-saga.

Github: https://github.com/redux-saga/redux-saga

Redux is mainly a library used by React developers which you might not be... But for a working implementation in a common language its pretty good.

👍︎︎ 2 👤︎︎ u/start_select 📅︎︎ Mar 07 2019 🗫︎ replies
Captions
I'm Katie McCaffrey I'm a distributed systems engineer that's my disco hoodie I program in that sometimes so I work at the Twitter is currently on infrastructure and platform specifically observability and I just joined in January and I live in San Francisco prior to that though I spent six six plus years working in our team in industry on games like Gears of War two and three and then Halo four is the one that I spent the bulk of my time on building like services and networking and distributed systems that essentially power a lot of the entertainment experiences that when you turn on your Xbox and start playing that you see so that's my Twitter and I have a tech blog if you want to talk to me on the Internet I'd do that so feel free okay like I said I worked at halo I joined in 2010 when 343 Industries was a studio created inside of Microsoft to take over the Halo franchise from Bungie who decided they want to go make other games and they just made destiny right so I was like web service dev number two hired in this tiny studio at like 30 people to figure out what we were gonna do with this old halo stack because the original games had services there was like Halo one two three and then reach and ODST all had additional services to help you play and experience the game but they were you know the way we were used to building old services where you had the game talking to a static service that you could just like linearly scale out and then you had one giant sequel database that everything that housed the source of truth for all of like Halo all of the Halo games and so we realized that that wasn't gonna scale for us when we looked at what we wanted to do for Halo 4 and the future of the franchise going forward so like we knew or we actually hit numbers of 1.5 billion games for Halo 4 played and then 11.6 million unique users and so that wasn't gonna fit on a single giant sequel database anymore so we ended up going and rewriting all the services and moving to Asura's cloud and we ended up using Azure table store as our as our largest form of like no sequel storage and so that's a key-value store for those of you who aren't familiar with it and so now in this world we had this challenge where we had used to do things with transactions on a single database since we were very used to programming against transactions giant database and that kept our system consistent and that was really nice but now we no longer had this because all of our data was split across multiple partitions so we decided we thought really hard about how we're going to do this for me thinking really hard is definitely like going to a bar and reading a bunch of papers and drinking bourbon but we came across the saga pattern which is a paper that was published in 1987 that I'm going to tell you guys about and we actually went implemented that so that we could process statistics in sort of a transactional way and I'm using it in quotations because we are giving up some things if you saw in a house talk we do not have serialize ability we were giving up some things but it does give us a way to guarantee consistency like from the beginning to the end of the transaction and what to do in failure scenarios so I'm going to talk to you guys today about sagas give you a little bit of motivation for why we need them and then we'll talk about the original paper its contribution to the distributed or to the database space because it actually came out at the database community and then we'll talk about how you do this in a distributed world because it is a little bit different there are additional challenges and constraints that you need to impose and then we'll talk about how we actually did this in Halo 4 ok so right like our systems used to be really simple we used to be able to have like an app or a website that talked to a stateless service I talked to this giant canonical database source of truth and so you got transactions you got serialize ability and acid so right serialize ability is our guarantee that things look like they sort of happens quench alee at Thomas City is this idea that it either all happened or it didn't and we do not see the state of things being processed in between consistency is application level consistency it's not linearize ability consistency but it's basically that our system is still in a valid state after we have us when we've started the transaction at the end of the transaction the system is in a valid state in that it either like totally whether the transaction completed or did not isolation is that these transactions do not affect each other if they are running concurrently and then durability is this idea that once we've committed something it persists in as long-lived and like we don't lose data so right but now we're in this world we have service-oriented architectures and micro services and you don't have canonical sources of truth anymore often when we are writing applications we were interacting with maybe some of our own api's or we are interacting with someone else's api's and using them and so we actually can not and they all have their own backing stores right so there is no like one canonical source of truth we don't get transactions anymore we just can't have them and so some ways that we've tried to solve the distributed transaction problem as two-phase commit so two-phase commit is an atomic commitment protocol and basically it's a specialized version of consensus protocol so sometimes it's like okay and it's been implemented and done at small scale and basically I'm gonna go very quickly through it but you have a prepare phase where you'll have a single coordinator the coordinator is very special in the system he's a single point of failure and he'll propose to all of the resources involved hey we're going to go do this thing and then all the resources get to vote yes I want to do this thing or no I do not want to do this thing and the coordinator collects all of the votes so once the coordinator has received all of the votes we now enter the commit phase and if everyone has said yes then we say go and do the transaction commit and so that is how they achieve at Thomas today because everyone is committing at the same time or if anyone says no then they say don't do it right so there's no like state where you could have the car reserved and not have the hotel reserved in your travel app everyone at the end must say done in order in a real distributed system so that you can the coordinator can know what has happened so this is kind of nice and this is actually people use this but it doesn't really scale all that well and failure conditions you can have up to Oh N squared messages going through your system your coordinator is once again a single point of failure so if that thing dies at any point during the transaction like it's just aborted and so the more resources you have the more latency there probably is and you're bounded by the slowest resource in the system and so therefore that coordinator is a larger chance of failing and there's just sort of reduced throughput through your system because you're still like holding these locks on these resources as you're operating over them and so that's kind of slow also it's worth pointing out that none of the large cloud providers implement two-phase commit protocol and they actually specifically say they don't like azor has a blog post on this because it just doesn't scale well and it's not something that they want you using their system because they're worried about all of these things okay and then Google has spanner right this is how they do their distributed databases this is also another paper that came out of Google I highly recommend going and reading it I'm gonna cover it briefly it also is a great point of challenging your notion of time because it doesn't really exist or we can't have one logical use of time so spanner is the way that Google does globally distributed databases and transactions between databases across the world and the way they sort of make all of this work is they have their to time API and they have GPS and atomic clocks installed in all the data centers it's also worth pointing out that Google has fiber between all of their data centers so they're not going over in the network like they're not going over the normal Network that most of us use and so that's much faster and so what this true time API does is it takes all of the inputs from the system which includes the GPS and the atomic clocks and then it Cal is able to calculate a bound of time when this event occurred and so they use that to create a total ordering in the system but they're only getting down to 0 to 7 milliseconds like it's still not synchronization is not a solved problem they just get it really really tiny window and so it works pretty well so obviously Google spanner is not available to all of us it is really expensive and it's proprietary so write distributed transactions is not once again solved for the masses and it's really you know synchronization is not solved so what I'm trying to get at here is that distributed transactions are really hard and they're really expensive especially if you want serialize ability and acid so can we do better right can we get distributed transactions with serialize ability an acid well I'm not a researcher I just build large-scale distributed systems so right now the answer is no we can't because of the cap pyramid we can't have nice things and distributed systems are terrible but we can start to give up some of these really tough constraints like serialize ability and acid if what can we give up and still program in this model that we're used to and still have build correct systems so this is where sagas come in Saugus is actually from the distributors from the database literature in nineteen seven I was published by Hector Garcia Molina and Kenneth Salem from Princeton University and it's a paper that is looking at just like how do we do long live transactions on a single database so they they came across this problem where if we're doing a transaction that consumes a lot of resources or runs for a really long time like computing a bank statement or something that has to go through a large range of history that creates a bottleneck on even just a single database because it's holding locks on a bunch of different resources as it's doing this and so they they wanted to maybe they knew they couldn't get full acid and serialize ability but they're like what can we do to make this faster and still get some guarantees and so they came up with this concept of a Saga and they purpose this for long live transactions on a single database so they defined sagas as a long lived transaction that can be written as a sequence of transactions and these transactions can be interleaved with one another all transactions of the sequence either complete successfully at the end of the saga or compensating transactions are ran to amend partial execution so let's break this down so it's a little bit more understandable right they said a saga is a collection of sub transactions you take whatever the large thing you want to do and you break it down into these little pieces of transactions and these all transactions can have asset on a single database concerns the one gotchas they cannot like depend on one another so you get the interleaving part like t1 cannot take or teach you can't take an input from t1 and they don't have they can't have any ordering that is supposed to happen for them so if you can break your work into sort of normal system or into these things then you have as the start of a saga each transaction in the saga has a compensating transaction and once again the interleaving and the the not being dependent on one one another principle holds here as well and the thing about these compensating transactions that they're going to semantically undo the transaction that they correspond to and so the paper actually acknowledged is that because we're dividing the work into these pieces sometimes you can't you can't return the system to the same state if you execute a transaction and then you have to compensate for it you may not be able to return to the same state as it was before the transaction started like for instance if you were gonna send an email like one of your transactions was sending an email you can't unsend that email but you can't send a follow-up email to say like hey this is what happened and so like we're correcting it and so it's sort of like this failure management pattern that we're doing okay and then after we define all of this we have this nice guarantee that either all of the transactions completed so our saga was successful and like whatever unit of work we wanted to do has finished or some of the transactions have completed and all of the compensating transactions have ran so the saga failed but we are now back to a semantically consistent state that we were before we started the saga this is actually a really nice model to program against cuz it's very similar to transactions right like we're either it we're keeping consistency and application correctness throughout our system so what we did here though is we traded off atomicity for availability you saw these um these sub transactions are going to execute independent of one another and so you will see pieces of the saga completing before the whole saga has completed and so if you need atomicity this doesn't work for you but for a lot of our applications this is okay and then I also really just like sagas as a whole because they're a failure management pattern and so building distributed systems we should just plan for everything fail especially if you're in the cloud especially if you don't like own your own networks even if you own your own networks you should plan for everything to fail and so forcing your application developers to think about failure as like the first and foremost in your design pattern is gonna help you build more robust systems and it's gonna help you build more like less fragile systems because you don't just code the Golden Path right you actually are having to think about like what happens when we go off of the Golden Path okay so how this works in a single database is if you were gonna do this we're gonna have our travel application you would try and book a trip and instead of having this one large transaction that would try to reserve all of the things we needed to go on our trip we would split it up into like these logical units of work like booking a hotel booking a car booking a flight and then cancelling all those reservations in the case that the saga fails the way this works in a single database is you have this process called the saga execution coordinator or the SEC the SEC lives in process with your database and so it sort of shares the same face so we're not in a distributed world just yet but his job is to go and execute these sub transactions and then in the case of failure start applying compensating transactions you also have a saw a log that is just like your normal database log and so it's going to do things like you're going to commit the messages to it like begin saga ends Agha abort saga and then all the beginning and commit messages for the transactions and compensating transactions and so what this looks like as you're going through just to give you an idea is if a successful saga would be like I want a book a trip right I'm gonna start my saga in the log I will start booking the hotel at this point I now own that hotel resource the rest of the world can see this right we don't have atomicity I'll continue I'll book the car I'll end booking the car that completed I now have that car resource and then finally if I can get my flight then I can go on my trip and we can say that we successfully completed the saga with also nice to note here is that in your application you can order things in the in like a risk centric way so like right sometimes there's penalties with cancelling a flight so maybe that's the last thing we want to do because I don't want to have to go and roll that back so there is a there is a reason I'm showing you the ordering in this way okay so what happens when like things fail and things break right we have unsuccessful sagas and the paper actually discusses a couple different failure modes there's like backwards recovery and forwards recovery and a couple other ones but I'm gonna talk about backwards recovery because I think it's the most common and the most useful and this is the idea that as soon as we have a failure or or we can't complete a transaction or something happens we just start rollback and we start applying our compensating transactions for anything that had been ran so an unsuccessful saga will look like this in the log will begin the saga maybe we start booking the hotel and that succeeds so we own that reservation and then for whatever reason booking the car will fail because maybe there wasn't an available rental car that day so now we need to roll back and start applying compensating transactions so you'll start to apply the compensating transactions it will free any resources that happen to have been taken and then you'll free the hotel reservation as well and now we're back into semantically the same state sad because I don't get to on my trip and I can try and rebook later but right like we are still in a consistent state for application even though we like held on to that hotel resource for a brief period of time now someone else can go and take it it's just like a normal transaction a single database so it's the transaction that's asset it either complete successfully or it doesn't abort in a single database is like someone is canceling it okay so saga is in a distributed system the paper actually comments on this and it does my favorite thing that early database literature does where they say due to space limitations we only discuss it on a centralized system although clearly it can be implemented in a distributed database and I laughed because this is my life so so thanks dr. Garcia Molina so we're actually gonna go through and implement this there are blog posts and stuff written about this pattern but I don't like any of them because I don't actually think they give you enough detail to implement the full guarantee so we're gonna talk about how you actually do that why it's different in a distributed world and some of the additional restrictions we have to apply in order to like get the same year in T so we're back to this world right we are no longer on one canonical database source of truth we are probably operating with a bunch of services that are holding on to different date like using whatever data store they choose to use and applying their own constraints and so this sort of translates really nicely right we can still break the units of work into these like requests a book Hotel book car book flight and then the cancellation requests so so far so good I'm gonna do a little term redefinition because I don't like using the word transaction in the distributed sense because it's not really a transaction it's generally gonna be a request it's also a transaction applies asset semantics to me and you can get like these requests could be acid if that service like gave you that guarantee but it's really up to the service to define what guarantee you have here so I hopefully they give you like consistency and durability you're probably not going to get isolation or atomicity from any of these okay maybe depends on what your API you're interacting with so and so you still have your sub requests as well right they're gonna semantically undo the requests that are happening so the successful distributed saga looks exactly the same and we still have our log we still need a log but now it's not co-located with a database because we don't have a single database so we have to have a durable and distributed log that lives somewhere so this might be Kafka as your service bus or like whatever you want to use it just needs to function as a log you still need a Saga execution coordinator this is once again I process and it doesn't live co-located with anything it's not special I want to point out that this thing is not special does not like our coordinator and two-phase commit it can die it has no state it doesn't do anything special all of our sources of truth is still in that log so sorry execution coordinator is this process that's going to interpret and write two saga logs it'll apply our saga sub requests and then in the case of failure it'll decide when it needs to start applying compensating requests so let's walk through what this looks like because things are a little different now I'll have our service that will commit a message to our saga log to say I want to start a saga a large unit of work it can commit a bunch of data to this log these are not just like start and commit messages now they can commit everything that I need to know to process this request a saga execution coordinator can be spun up or will be there to see that I need to start processing this saga and and then it'll start reading the saga and figuring out what it needs to do it'll first start by committing a message to the log that says I'm gonna do the first request in the saga and that has to commit successfully before it can do anything else it's then going to send a request to the service responsible for handling that and then once that responds it will commit a that I finished the saga message to the log so I walked through this fairly slowly to show you that now we have like four additional points of failure that we did not have when we were on a single database where everything was co-located and like essentially if it crashed every the sec the log and all of these like transactions shared the same fate so we have a bunch of places where things can now fail again so we'll walk through like it sort of does the same thing it'll commit the message to the log for the second request and then it will make the request it will receive the response and then we'll commit the message to a log and then the same with the third it'll commit the message send the request receive a response and end the saga this is a successful saga and it still sort of all works the same and like life is good so what happens when things fail when do we need to start applying these compensating requests right we still can have this idea of an aborted saga our response like the services could say no I'm not going to do this thing for you because I just can't or they could say like I'm not available as a service or they could say things like you don't have access to do this these start requests fail right so this could be like whatever HTTP or failure response if there was a service it's not there and then in some cases if that sec crashes we might have to start compensating actions because these requests we have not applied any additional constraints on there they don't have to be idempotent there's nothing special about them you can do whatever you want with them and we'll talk more about the sec crashes in a second so what happens here let's walk through like a failure case so we've done the first thing we've attained whatever the first request is successfully I've now started the second request and the service says no I'm not going to do this I want you to abort the saga or it gets a failure message and the SEC will then commit an abort saga message to the log so now we know that we're in rollback right we need to start applying the compensating transactions or requests sorry so what it will do is it'll do the same thing like it's replying the regular request it will commit the start message to the log it will send the start compensating request to the service it will hopefully succeed and then it will commit the message to the log and then the same thing with the third one because it's just reading the log to understand that like oh I still to apply the conference any requests for service one as well so that's great and now we still sort of have the same kind of a guarantee we had but what if compensating requests fail right now we're in this world where there were no they're not transactional what if they fail so because they can fail we need to be able to retry them indefinitely until they succeed and so this poses an extra constraint that we didn't have in the database world on our system which our compensating requests have to be idempotent I have to be able to replay them until they succeed so this is a little bit different than in the in the normal single database case okay and then finally what happens when our FEC fails like I've said this guy isn't this this lytic process is not special we can just spin up another one too continue whatever happens like it doesn't even have to be on the same machine we had to determine whether the SEC was in it or the saga was in a safe state when the SEC crashed it or not and so a safe State is that all the sub requests are complete so you have the start and end both logged because at that point we know like we know what happened we just left off somewhere and we can just continue picking up with the saga wherever it left off and then if you're in an aborted state because compensating requests are idempotent we know we can just keep replaying them even if I committed to start one I just send it again and so that's a safe state as well where we have to start applying rollback as when we get into this case of uncertainty of is did we start a request and we don't have the end request logged because I don't know if it crashed before I even sent the request I don't know if it crashed because it got a response back and then crashed like I don't know the state of like whether the service even saw this but it's not safe to replay that request because we haven't put any additional constraints on them if in your system you can also make your normal request item potent you can just you never have to worry about SEC recovery like you just bring it back up and start reprocessing but I think like making all requests idempotent is a tall order to impose on like your normal method of execution and so what happens is you just committed an abort saga message to log if you if the SEC comes back up in an unsafe State and you start the compensating requests okay so essentially what we've done here we have to define some request messaging semantics on top of on top of like these requests so we're gonna define an eSATA or sub requests are at most once they will get delivered zero or one time and then our compensating requests are at least once so they will get delivered at least one or more time in the system and so that's you need to know that when you're designing your saga to make sure that the systems that you're making these requests on can handle that but now we're back at this world where we have the same saga guarantee that we did in no single database and that's really great right because now I know that either my whole saga has completed and I'm in a semantically consistent state for my application or my whole saga has not completed and I'm still in a semantically consistent state for my application and so that's a big win to be able to program with these bounds okay so just to sort of recap distributed sagas they're very similar to the single one except for now you need a distributed durable saga log and you can use whatever thing you like you need an FCC process he's not special but you do need something that will like continually spin it up and make sure that it's running and then you need these compensating or comments any requests now have to be idempotent so that's that's different but I think it that's okay and I don't think up too tall of an order for us as application developers because like if you think about like restful services like post is not or like delete is idempotent right or semantically it's supposed to be idempotent okay so let's go back to this guy this is the Master Chief he's the main character of the Halo series in Halo 4 and like I said before when we started looking at how to move from the single database world into this multi like partitioned world of azure no sequel we ran into some problems so I'm gonna talk about the statistic service that's the main service in Halo that controls everything about your player I wrote the majority of this service and so it we have very very detailed analytics on what you're doing sort of at all times in the game and so we do this by getting a giant blob of data uploaded to us at the end of every game and sometimes while you're playing and so one of the problems that we ran into the system was that people care a lot about these statistics like they get really really upset when you screw them up it's always the best game they've ever had always shockingly and so so and I mean like cause like right me gaming is a thing people play game like halo for money and competition and prize money and then people like to should talk with their friends so and then they also like how like we have this whole website that you can go and like deep dive through and this is just like a fraction of what we have per player so some things to know about the statistics service is that we could have one to 32 players per game all of our player data when we did this migration to fully cloud and storage as a service was on Azure table which is a key value store and each player in the key value store had their own partition and so now you've got you're talking about writing data to maybe 32 partitions and it still sort of has to look like one unit of work and either all succeed or all fail so we you know a little like stumped but sagas came in and helped us do this so this is if you've seen any of my halo talks before where I talk about our liens this is what I normally show so we built the bulk of the services using a MSR actor framework called Arleen's it's not super integral to this but the little diamonds are actors or grains from orleans and so what happens is the xbox will send us a bunch of statistics the game actor or like process you can think of it like that if you're not familiar with the actor model will like aggregate those right it's a blob storage and then at the end we'll say hey all the players please go update your statistics and then all of those players will write to their corresponding partition and as your table storage and then I do this hand wavy song and answer I'm like it doesn't fail just like there's a way to deal with failure and so this is actually what it looks like because it can fail and so what actually happened is we had our Xbox running into our stateless front-end service that we wrote in F sharp which would then log a message to Azure service boss which is a service it's a queueing service and message broker in Azure that you can use and it would commit the payload of stats to the service bus and like hey like go start the saga for this game so that was like that's what we use as our saga log we had these router grains these stateless router workers that would just notice that like hey something I got it committed and so we need to spend someone up to go and process this work and we treated our our game actors or a game processes like our sec coordinator so all the routers did is they just notice that like new work was there and then they would spin up the right the right game grain to go and handle it so the game grain then acted like the s you see at this point and would you know do the whole like send all the messages and then commit back to the ad room saw as your service bus log so what happened in failure right in failure you maybe don't want to just roll back statistics right because if I've committed your statistics and you can see them even though your friends can't and then I roll them back that looks really jarring to say like oh you have a hundred kills and now you have like ninety eight kills that's a really bad user experience um also we just wanted to process this eventually anyway it's not like we were gonna throw that data away if we failed to write for one player so we implemented forward recovery which is another pattern specified in the saga paper which just basically says this saga like always needs to succeed so just recover forward right like don't roll back and so the basic premise here is you check point and you say when am I at a safe state you check point these safe States and if you have a failure you roll back to the safe state and then you roll forward luckily for us any players succeeding on writing was essentially a safe state we didn't have to do any rollback so we would just retry the saga later if someone failed to write and I'll sort of walk you through what that looked like right so we had our game drain which is our SEC he talked to all of the players and the players talked to all of their own partitions and so say like player free couldn't write to its partition because we blew our ups budget on that storage account or something I don't know and so like three of the players will now be able to see their statistics for that game on Halo Waypoint and player 3 will not and that's ok we like people knew that our statistics processing scene with someone asynchronous generally halo players are a little narcissistic so they weren't looking at their friends stats they were looking more at their stats so we actually like never had complaints about this so that was fine and so we would then have to go and replay throughout the system we would have to go Bri play layer player 3 right so we would actually put this the message on and like back off on processing this for a while to give the system a chance to catch up so you don't just want to like hammer your storage account again if it's already like saturated so but because of this if you're noticing I'm not rolling back and I'm actually replaying the request to player through again and because great the game drain didn't know where that failed it didn't know if it wrote and then like just failed to like get it back or like what had happened and so so Ford requests or if you're gonna do forward recovery in your system in a distributed system then the sub requests also have to be idempotent so we were able to do some nice little tricks and rely on our databases consistency to ensure that we weren't like double counting statistics or everything all of our requests right opponent because it was essentially just set operations anyway cuz you're just like adding like requests so then when the saga got retried later and for recovery the game grain would send it to our stats grain player three he would successfully store his statistics and now we know we've processed all of the data in in this game and then we can just sort of like we would discard it we didn't actually have like a persistent log of everything because that wasn't useful to us we actually stored the data we needed and then threw it away and so at this point we now have that guarantee that like all of the players in the in the game had processed their to statistics and our system was in a consistent state and everyone could see their state this statistics for the game cool so I'm lying a little bit about this what we actually did and this is where I like to come in and tune systems is there's a trade-off here right if you're gonna write for 32 players back to the azure service bus log which has to be consistent right there's a little bit of seep consistency and you're having a CP log of hopefully because you don't want to just like drop data or like have errors in there that's like kind of expensive to talk to that thing 32 times per game so we optimized for our failure case because writing this to a single partition was not expensive because players were only doing that like once every 20 minutes when they had finished a game and so for us it was easier to just retry it was a better trade off network latency and all things considered wise to just retry everything because we knew we were idempotent so that's a trade-off you can make so when we would have a failure we word-for-word recovery would actually just like Reeb last everyone and say process your statistics and the ones that already had them when it double count and restore they were just say hey I process this correctly and we were done so I like to point this out because I think as an industry we hide a little bit like some of the shortcuts we take but I think it's actually really valid to take these shortcuts when you're going to tune a very bespoke instance of a system like this is a very like one-off instance you could go build a general saga distributed distributed saga execution coordinator in system but like you can also just go and build one that works for your system and that's totally valid this is a pattern for you to utilize and building your applications and making sure that they are correct so just to sort of recap write sagas are these long-lived distributed transactions and you should use them in your system because I think they're a really helpful way to think and design and program against in a model that we're used to programming against you are trading off atomicity for availability so if you cannot tolerate seeing part execution of the thing happening in the saga and you can't do this but in most cases we can do this right and then we can take corrective actions I build a like my skew in the world is towards highly available systems right games social media stuff like that there's a very real world business consequence to us not being able to take actions and do things like if you can't play Halo then that affects the amount of money that we make and so that's bad so like making this trade-off is fine for us and we if we had to take corrective actions we were going to go do this and then finally it's a failure management pattern right like it helps you build more reliable and robust systems and less fragile systems and so I like this as a pattern in a hole because of that finally I want to thank a bunch of people who helped me out with this talk Peter Bayliss you know Sombra Thomson Tarot Kyle Kingsbury Jeff Hodges and Clemens basser's without them I could not have done this and now if you guys have questions I'm happy to take them [Applause]
Info
Channel: GOTO Conferences
Views: 161,485
Rating: undefined out of 5
Keywords: Caitie McCaffrey, GOTO, GOTOcon, GOTO Conference, Computer Science (Field Of Study), Programming Language (Software Genre), Mathematics (Field Of Study), Software (Industry), Distributed Systems
Id: xDuwrtwYHu8
Channel Id: undefined
Length: 34min 15sec (2055 seconds)
Published: Mon Jul 13 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.