Event Sourcing: Traceability, Consistency, Correctness - Thomas Bøgh Fangel - DDD Europe 2020

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so a bit about myself my name is Thomas Toivonen I work at a company called Luna been there for about four years before that I've been implementing distributed systems in various companies since 2004 event sourcing wise my experience isn't that long we started doing events or seing about a bit more than a year ago our first events or service went into production around this time last year so we're pretty new in this space ok a bit about the agenda first I'm going to give you a bit of context what is Luna I guess most of you don't know what you know is a bit of history the evolution of our tech stack which is important to understand why we chose event sourcing then a bit about what it takes to build a bank from scratch primarily concerned about our technician and it's this technician that has guided us in our implementation and finally we arrive at the event sourcing why and chose that and how we're doing it a bit about some patterns we've seen emerge from our implementation and finally some challenges and learnings if you yourself are going to go this way so what is Luna outside of the Nordics there's not very many who know us we are a smartphone phone only challenger bank founded back in 2015 meant to we are try to disrupt the banking industry so from the beginning at the start the business model was to build on top of an existing bank and then build our experience on top of that and providing our experience using the partner Bank but trying to hide the partner bank as much as possible from the beginning it's been in our DNA to deliver a best-in-class user experience and also a best-in-class support of our users we went live in 2016 and currently we are present in Denmark Norway and Sweden a few facts right now we are bit more than hundred employees of which twenty five percent twenty five thirty around that number developers we have 150,000 users who together produce more than a million it's probably more like one and a half million transactions per month our platform right now is made up of 100 micro services ish we lost count and we are running free kubernetes clusters on AWS so that's some facts here's a quick view at the evolution of our platform as so many other startups we started with the back end in rails and a few Java components with integrations to our partner set at the time at the beginning the back end was running on top of Postgres managed by AWS and everything was deployed in AWS using static dogger compose so I guess a very common setup for a startup that has to do something very quickly since then a lot has happened we've right now running micro service platform as I'm said we've broken up the monolith that adventure started in the summer of 2016 so as you can see up there in the corner rails is still there it really takes time to kill your legacy but right now it's the size has it's very much more slimmer than it was before and it's actually only acting as an adapter into the Danish clearing system we have a lot of different types of services sort of grouped in core services and feature services and adapters to various partners delivering part of our product we are still on Amazon but now using kubernetes in a getup style we have a very active dev ops team who's also doing a lot of speaking about this so we have a lot of talks about that and we have some open source projects around cleanups DevOps the some links later on we've added rabbit as a message broker so all services communicate using rabbit all business domain events are published on rabbit we're still using Postgres that's an example of what NAT talked about earlier sticking with some boring technology that can actually be a very good idea when embarking on something new just use some technology that you feel good with and we do with Postgres it's well known technology and all services are primarily using Postgres okay so a quick assessment of that platform it has served us well it has made our growth possible on the good side the micro service architecture really delivers on its promise we get autonomy to our teams we're running a Spotify like squat model with the autonomous squats and they can deliver new features fast without breaking all the other stuff furthermore the standardization we've made across services and the DevOps setup gives us a very maintainable platform the messaging part is also working very well it's easy for services to produce events and it's very easy for consumers to start consuming and implement new features on top of already existing events on the bad side messaging again and that's because we don't have 100% delivery counties in the messaging due to some of the choices we've made so once in a while that causes some consistency errors which isn't that good furthermore all of our services nearly all Krotz services so we updating state once in a while that causes some headaches when we have to figure out how to this uses data end up in the state because we've updated it so we really don't have the history of the state okay so that's a bit of history which is important to understand what I'm going to talk about now so let's look a bit to the future about half a year ago in August 2019 our own banking license so we're actually going to be our own bank now a real Bank the reason for trying to get that banking license is pretty simple it will give us the opportunity of providing a more complete banking experience to our users but also open up a lot of new revenue streams we are funded by investors and at some point we should hopefully be profitable so getting the banking license is a part of that journey so the goal for this year is to get our new banking platform up and running get all the existing users migrated and offered this complete banking solutions to all the Nordics Denmark Sweden Norway and later on Finland right now we're in a closed beta about 10-15 employees running on the new bank myself included so I've been out in Amsterdam testing my new card which has been extremely fun but what does it really mean to be a bank well we in the middle to the right we have national clearing and to the left we have a card processor so the card processor makes it possible for our users to pay with a card and the national clearing makes it possible to get money into your account or out of your account pretty simple how hard can it be well of course it is a lot more complicated than that and the primary reason for that is that we're dealing with people's money so when we did with money there's a lot of regulation and also a lot of expectations so ask yourself what would you expect or what do you expect from your bank well I certainly expect my bank to be trustworthy I also expect it to be secure and I expect it to be correct when we started this a bit more than a year ago we knew we were applying for this banking licence so we started slowly thinking about what should our platform be like so we sat down some developers and we could all agree on this could we dug a little deeper into this and our conclusion was that beneath these words there's something else or at least we think that there's another property which will enable us to get these characteristics so that single property ended up as our tech vision and that property is traceability at all levels so nothing should happen without us knowing and our system should never be in a state we cannot explain that's what we have seen happening in the current platform so we really want never to be in a in a situation where we cannot explain what happened at the end of the talk I'll come back to how this is actually going to deliver on the other characteristics but first let me dive a bit more into this traceability at all levels so in the era of micro services kubernetes and the likes traceability has normally or is normally understood as distributed tracing tools like what open tracing and Jaeger seeking can can provide in the observability space of course that's really important and we also want that we want to be able to trace across service boundaries and across asynchronous boundaries have distributed logging which we can manage and and search through but we want our system to provide a bit more so to us traceability does not equal distributed tracing we don't want this traceability to depend on developer diligence to have the right lock statements provide through the right spans for the for the raising etc we want it to be intrinsic something which is built into the system by design so let me take a little detour this code snippet with a bit of editing I've seen this a few times in the code bases I worked on and more than once a support case has ended here trying to figure out well how did that actually happen I thought it was impossible how did it happen head-scratching searching of logs trying to figure out what happened so let me see how many others have been in that situation half-fish okay great no matter what your answer had been I have proof that I'm not the only one because if you do a search on github for this phrase this should never happen have you tried to do that I guess not I have and the result is more than 500,000 code results on github where we have this should never happen that's quite fun I think but well yeah so that sort of proves to me that I'm not the only one having that learning and what can you learn from it well my learning is that production is the place where the impossible happens if you can think of some very unlikely situation that could occur but it's probably not going to happen at some point in production it is going to happen and even worse impossible things you think are really impossible they'll also happen at some point so this is precisely what we mean by this traceability what we want is an intrinsic property which can explain cases like this explain the impossible no guesswork no head-scratching no searching through logs just something built into the system will which will tell us this so that's my definition of traceability so in the face of errors we don't want to have to say to our supporters well we really don't know what happened you have to say something like this to our users that's not a good situation if you are banking we want to be able to tell our supporters that well we know we have an error here but you can say this to the users instead something happened to your money but we know exactly where it is and we're going to fix it so that's where we want to go and this leads us of course to event sourcing otherwise I wouldn't stand here because we think that by its very definition event sourcing has some of the properties which will lead us to this so the very idea built into event sourcing will support this often event sourcing is thought of as just a different way of persisting state however this more to it than that there's a shift of focus so you shift the focus from the state to the events so instead of thinking about what state is this entity and you think of what has happened historically and then the state is a byproduct of all those events and it's this very shift of focus which is exactly what we want in order to achieve inherent traceability and the main reason we chose insourcing we want to shift the focus onto what has happened in the system instead of the actual state of an account or a user so we decided on trying out of insourcing and just well we've already heard earlier today that event sourcing right now is in a place where we all trying to figure out what is this find the right words for things and agree on the terminology of of the subject so just to make sure that we all on on the same page this slide will will be a presentation of what an event sourced component or an event saw service in the lunar setup what it consists on some of you it'll be pretty basic but maybe there's also someone in here who actually take something out of this so first of all the fundamental thing is the event stream so we store that in some storage it has an important property it's an ordered sequence of events representing what has happened in the system we structure and organize our event streams around the obvious domain entities in in the domain so we could have an event stream for user 1 for an account but also we also using event streams to represent the act of doing a transfer or the act of paying some payment slip something like that so those streams are very short-lived whereas users and accounts are long-lived events beam's the next thing is of course the projection and as we just saw for those of you who were here before it's just a fold left over the events in the stream then we have the concept of an aggregate root the aggregate root is a well defined entity and in the domain it's technically made up of an event stream where we store all the events happening to this a good root and a projection to hold the current state of the a good root and then some business logic the business logic is the right API of the aggregate root and it protects the integrity of the entity so it's the only way we can add events in the underlying event stream is through this business logic the command handlers of the aggregate root and its pure functions just as Jeremy showed before this piece of code has two inputs the command and the current state and it can do no side effects it can only do one of two things either fail the command if something is not valid you don't have money on your account for example or it can publish one or more events nothing else no side effects which makes it very easy to test this logic but of course we also want to be able to do side effects so the side affecting things is all in the handlers the event handlers so a handler will get an event from the event stream when a new event is added and this is then where we can do side affecting behavior so the side effects could be cooling some external system publishing a message or looping into the system itself and executing a command and a different aggregate so that's basically what our events or services are made up of so what about the implementation yes I probably didn't say it but we're primarily we've ended up being primarily a go shop we started our micro services in typescript with node but now everything new is written in go in goal and we did a sort of looked looked around so for event sourcing libraries in Co there's a few but none of them are maintained by more than one or two developers so we actually ended up writing our own library actually we did it twice because the first version wasn't really what we wanted it to be so we did it over again we are planning on open sourcing that it's not not completely ready for for the rest of the world yet but hopefully later this year we'll open source it we chose another another choice of boring technology just stick with Postgres so we're just using Postgres as json storage the content of the events is just adjacent blow up and there's the sequence number and a timestamp and ID of the event that's everything we need then we're using in memory views for the a grid roots and SQL back views will come later hopefully this month so this is very very new all I'm going to say for the rest of the talk is just out of development so in two months we have maybe changed to something different and now just to be able to actually show some code I had hoped that we already open-sourced Appetit at this point well didn't happen so the only way I can show you some code is on slides so this is taken from an example which is in the library itself the top part here is it's a to-do list example a very simple example where we have a to-do list and we can add items to it and we can check items and archived items so a bit more complicated than the light bulb but not much so the top part is the state defining the current state of the a grid root it only has two properties timestamp when it was created and then a map of the items and the item state it only contains just enough to be able to determine the action when we get a command we should be able to determine this is a valid action or not then here's an example of an event just a struct so an item has been checked it happened at some time and then an ID of the item and down here we have the application of an event to the state so that's simply looking up the item in in the map of items check it and put it back yeah and go if you don't know go is not good at immutable things so you have to get used to that this is mutable so it's not like in the in F sharp as we saw before that's what we really like to do but you have to code around it in there and go and here's then a command handle so all the command handlers are defined as methods on on the state so we get the parameters is the command and we have the current state so the Apogee library will take care of loading the state when you cool this so you get the state at this point in time and the sequence number that the state corresponds to is then also encoded in the unit of work up there so the unit of work functions as the transaction in the event sourcing system so we do some checks if it's an empty item ID we fail look up the item in the current list of items if we can't find it well then we also fail with a specific failure code if the state is already checked what nothing to do just return else we publish a new event very much like what we saw in the talk before yeah so that gives you an impression of what this well library is and hopefully later this year you can actually use it if anyone is interested so that was the code I was going to show today okay so in order to make an event source application work we need one crucial property guaranteed event handling the handlers are crucial that's where we have the side effects the thing that makes the application tick so we need to be able to ensure that all handles will get all events in the proper order and with as low latency as possible so how can we implement that well first we thought well we can probably do that inside the library so it's something like keeping track of how far have your event streams come and how far your hand has come and then just making sure that you always get all the events to the handlers that would give us more complexity in the library so we try to look out look around for different ways of doing that you could do a more generic out box pattern where the events put in an out box and something else is taking care of publishing that and an inbox on the other side taking care of that the events will actually be propagated to the handlers but we'd really like to find a piece of technology we can just more or less pull down from the shelf and and then use that so last month we did a proof-of-concept using division with Kafka so if you don't know what the BCM is it's a CDC change data capture tool which will contain the database and publish all changes to a table on Kafka so right now this is how we think we're going to do this and guaranteed event handling the table is very simple and since we only appending the only type of chain there'll be there is an insert of a new event so the model is really really simple so you just have to subscribe to the Kafka topics and make sure that the handlers will then get these events there's this thing about the order but you can control the petition key based on what's in the table and inside the same consumer group you get the proper ordering so that's how we think we're going to do that okay so that actually leads to the patterns patterns we've seen come out of our implementation it has been very much trial and error process we've been implementing various parts of the new banking platform and then these patterns have emerged so the first pattern concerns public api's so if you think back to the very very simple drawing of what a bank is we had us in the middle and national clearing to the right so this is that part of the drawing in reality we have the banking core and then there's an adapter is this API up here to the national clearing that's typically some old piece of tech technology over here in Denmark there's a bank central here sitting on a mainframe where they have put rest api on top or rest its rest with a lot of weird behavior coming from the actual mainframe so it's a synchronous API so the model in in our new banking core is that for each of the different national queuing systems we'll have an adapter and this adapter will then communicate with the bank or asynchronously and in fact the banking core will know nothing about this part so the adapter is the one executing stuff and also receiving stuff from from the clearing if money goes into the account so this part how can we expose a public API between the banking co and the adapter the inside on the inside in the banking core we have the internal event stream it has some really nice properties an ordered sequence of events so we would really like to have that property on the outside as well but of course we shouldn't expose the internal event stream that's our internal state so who was it was yeah one of the other talks it was stressed never expose the internal event stream it's your dirty underwear right so the way we are solving that is by producing an external event streams based on the internal event stream so it's essentially a handler writing to an event stream and we can keep track of the source events on the inside in order to to guarantee this traceability that an external event that we publish somewhere we can trace that back to what internal event led to this external event so here's a small drawing so we have the an internal event stream a handler may be a projection if we need to keep track of some state in order to be able to publish the external event stream and on the right we have we have two event streams on the outside we have the possibility of exposing several derived event streams this one has fewer events than on the inside so it's a more focused event stream and the top one has more events so there's a in this case there could be a requirement that you have to invent another type of event based on one of those we actually have that when when you withdraw or do something on the account we would like to actually publish a balance changed event that's not in here in here we only have transactions on the account but on the outside we'd like to have a balance changed event for someone else to listen to now the balance was changed on on the account so this handler will just invent and you event based on the internal event stream so it also gives a way to do versioning of your public API so you can just add a new one and consumers can then switch over to the new external events to him and at some point you can then deprecated the old version so this pattern of using external event streams for the communication that's how the banking co and the adapters are communicating in both ways so the adapter listens on the banking co and the main call listens to the adapter that's how we do this communication asynchronously between the departs so the next pattern is distributed flows or distributed transactions or sagas we web sent that as aggregate routes actually the aggregate routes here are used to represent a state machine where the state represent words how far we have come in this process of doing some distributed flow the state of the a grid route determines which action handlers should take and of course all side effects are in the handlers and as I said we're using the public event streams as API so here's a quite elaborate example and this is actually how we have implemented transfers so here on the Left we have the Bangko and on the right the adapter in this case the Danish adapter inside the Bangko we have the the account long lived acted route and up here there's a transfer acted representing the act of doing this transfer likewise in the adapter there's a transfer acted route representing its side of this distributed flow so a request comes in from the user which will be validated on the account and let's say it's valid that will then publish to events on the account at gate loop transfer initiated and fonts reserved so we do have reservation of the money you want to pull out of the account so you cannot use that money for anything else from that point on up here we then have that handler which will get this event and course an action on a command on the transfer aggregate to initiate this transfer so now we get an initiated event on the transfer aggregate inside the bank inside the bank core and another handler will grab this event publish an external event on the external event stream it's not included here in order not to clutter but in between here there's an - external event streams one from the banking core and one from the adapter so after step three we don't get any more events on the inside but we have an event on the outside then in the adapter a consumer will get this event cause a command on the transfer aggregate inside the adapter now we have this initiated a handler will get this event based on the state of the transfer aggregate it will do whatever it must do in that case which is of course try to do this transfer in the clearing so let's say that times out so they request timed out we really don't know what we should do but we store that results in an event timed out on the transfer aggregator that event will then reach the hand again add the behavior when the state is timed out is of course to retry so it'll retried the action again and let's say it succeeds this time then a succeeded event is added to the event stream and now we have to go all the way back right so a handler here publishes succeeded event on the outside the consumer in the banking core consumes that event causing succeeded event on the transfer aggregate inside the banking core and finally this handler will then handle that event and execute the transfer completed command on the account aggregate this will then give rise to a transfer completed then we'll remove the reservation because now the money shouldn't be reserved anymore but instead there's a transaction created event and this will then propagate up to yeah outside of the picture here we have a transfer service actually handling the request from the app so it will get that event and the action it's then completed India yeah this slide is only there for completeness afterwards we have all the numbers and all events okay so some key takeaways again guaranteed event handing is extremely important if that fails then the process stops so down the line we have to have some metrics on that to be alerted if something goes wrong it's extremely important for us to get that done but we get maximum traceability all that goes on as part of that transfer is recorded as events in the various aggregates taking part of the flow and of course errors are going to happen at some point remember production but we can see those errors directly in the aggregates which leads me to the last thing I'm going to talk about our narrow implementation error handling failures first-class domain citizens failures are not exceptional we strive to own to narrow down the part of errors which should be treated exceptionally just as in some languages where errors is just another value it's just another type it's not anything exceptional for example and go there are no exceptions you just have an error as a value so we try to treat else much as possible failures as just part of the domain it's something you should expect happen so your domain should also treat those errors that you expect to happen as first-class citizens furthermore idempotency is crucial both in our own internal api's the transfer service that was over here has to be able to retry the request towards the banking core if if we have a timeout inside our own platform well the only thing it can do is try to do it again and rest assured that if it actually went well then well you'll just return the same result and not do the transfer again but we also depending very much on the external API from the clearing that that's idempotent if we don't have that then we are really on thin ice when we get timeouts or crashes we have no way of knowing did it actually occur this transfer or not so fortunately the clearing system guarantees that for us if we didn't have that we would have to figure out how to after a crash away after time out have some read end point where we can read did this actually succeed or not and then take action based on that okay so let me go back to a bit to the technician and revisit this what about the other properties correctness and consistency correctness I hope you'll agree with me that 100% correctness is impossible so what can we do we can do our best test our software as much as possible but even though we test will make mistakes so what we think is that well if you can understand your errors then you can also fix them if you don't understand what has been going on then the chance of correcting it is a lot less so that's how we think that this inherent traceability will enable us to eventually get to a correct system one error at a time so if we can understand the errors when they happen we can fix them in the system fix the data do whatever is needed in order to fix it go back to our code mega test and fix the code so that's what we mean by traceability leading us to correctness and furthermore in event sourcing error correction is not different from what you do all the time the only way you can fix an error in event sourcing is by putting another event on the event stream so compared to the crud services we're fixing something in the state is doing an update we can always go back in event sourcing if the correcting event you put in the stream turned out to be not completely correct well you can just have another one which isn't the case in the crud service you might not be able to go the other way after you've done the update and consistency well this is where the ordering of the event streams is important we think that by exposing external event streams which have the same properties as on the inside and ordered sequence of events that actually powers the consumers because now they can reason about the events they receive if they receive consumer receives event number 10 and the last one it's always number eight then something is missing if that occurs well we could actually also empower it to be able to retrieve the old events if we put an API on top of the producer or somewhere else where we have the events lying we could put an API on top of that and enable the consumers to actually retrieve whatever events it had hasn't seen yet of course the only thing we can guarantee is eventual consistency but that certainly is a lot better when what we have in the old platform which is maybe consistent and in the old platform the consumers have no means of reasoning about the messages they receive because we don't have this ordering of events okay so I'm nearing the end of the talk what's left is some of the challenges and learnings we've had during our implementation first and not very surprisingly event sourcing is a perfect fit with DDD together with the event sourcing part we also started doing DDD in a more formalized way and the things fit nicely together but it's a different mindset when coming from services which are Krotz services you really have to wrap your head around this way of thinking we didn't do DDD in the crowd world in a formalized way we thought about the domains made up by the micro-services but it wasn't really formalized in the implementation of the banking core we've tried to actually also adapt some of the practices from DDD it can also especially in the beginning when you start doing this cause some mental overflow well you you have to implement a certain thing but you have really no idea how to do that and you have to do a lot of searching and figuring out how to actually implement this using this tool set you have available so you certainly have to believe room for experiments and failures that's how we learn try things out fail and try again furthermore we at least up until now we really think that event sourcing delivers on this promise of traceability we went live with this closed beta about a month ago so since then we have actually been using the system for real and it's such a relief to be able to look at the event stream and see oh that happened that happened here yes what went wrong but and there's a but on all the slides here it's not necessarily mainstream tech depending on your tech stack in our case in goal and it certainly isn't mainstream in the dotnet world it's a lot better and the part about guaranteed event handling that's a challenge you have to figure out how to achieve that finally we've also learned that there's a tremendous power in these immutable events tree combined with CQRS but immutable data can also be a challenge in itself what about GDP our fortunately we in the financial space so we're actually allowed to keep a lot more data for longer periods of time than anyone else which is nice but we eventually have to figure out how to handle that and if you're then streams are immutable hmm at some point we have to look at migrations migrating event streams and that could be also an idea of dealing with GDP are we have already done this compensating events that's the only way you can do Corrections if something went wrong so you have to implement some tools or handles to be able to to do correcting events on top of viewer event streams yeah and with that I'm actually done so here's a few links to our tech site we have a block where we try to write about what we do and collection of talks I [Music] have some colleagues in the DevOps team that are really active in the community they've been out speaking at cube cone and other conferences and we have some open source tools DevOps focused right now but that's also where Apogee will eventually have its own life so take a look at it my slides will be up there if not later today then at least more thank you [Applause]

Info

Channel: Domain-Driven Design Europe

Views: 2,716

Rating: 4.6862745 out of 5

Keywords: ddd, dddeu, ddd europe, domain-driven design, software, software architecture, cqrs, event sourcing, modelling, microservices, messaging, software design, design patterns, sociotechnical

Id: Q-RGrWTN5M4

Channel Id: undefined

Length: 55min 0sec (3300 seconds)

Published: Fri Oct 02 2020