ElixirConf 2019 - Elixir + CQRS - Architecting for Availability, Operability, and... - Jon Grieman

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] okay folks thanks for coming out my name is John Green and I'm a senior software developer with Pedro duty page refrain we're not familiar is leading digital operations platform letting people know when there's did opportunities sometimes that's the opportunity at 2:00 a.m. to learn that someone took the phrase let it crash a little too literally or your hotels on fire who knows they're gonna trip answer and who is familiar with the product a user past user side of a bit of an idea of already ok fair bit fair bit awesome we page people bring them instance that need resolution reduce the time to recovery help get them back to their systems and learn from it reduce the time do and to recover the next time it happens give people the tools to more effectively deal with incidents for myself personally I've been with page three about three and a half years I mean slinging elixir they're just about that entire time I was on the the first team that pay jury to select elixir for production system still finding new things to be impressed with with the language the ecosystem and the community so thank you all be coming to looks at coffee since 2016 so still still find a new stuff today's talk is gonna be the story of a service one we've had in production for about three years now mostly we'll be talking to high-level design there aren't code slides in this presentation this is more about the architecture we talk about the service design the ways in which elixir was a very good fit for it how we leverage the architecture for operability and maintainability down the line and we might even touch on some other benefits that aren't buzz words ending in ility if there's time this is story about one particular service and the lessons we took away from it I hope you can get something from it too so the service we're talking about powers log entries it is called the log entry service we are very creative people when it comes to our service naming log entries are a timeline of actions on an incident of as people act with them everything from an instant triggering getting opened phoning someone someone acknowledging escalating because no one handled it bringing in someone else linking another on and on right they also drive a lot of dashboard context of what's going on within your account giving you some insight into what other people might be dealing with and they power a lot of our post incident analytics originally they were implemented in our Ruby on Rails monolith which is kind of the core of the product originally we've been pulling that apart for quite a few years now starting with the original the key bit of notifying users log entries became a focus for those efforts because they were growing very quickly users were taking more and more actions within the system as we added new capabilities and the pace of growth was very different than things like configuration data phone numbers schedules the like that are much slower to grow so became a focus for the efforts to get it out of the monolith getting people context is very important to us for what we do as a business so these have to be a very durable object even in the face of large-scale failures of entire regions and the like so some design things for the new system we decide that we're going to decompose the system into separate writing or creation components and reading or querying components these are an audit log so outside of like a rare scenario where if a customer sent inappropriate private data that had to be redacted I would side with something like that the only operation you can do that modifies the log entry is create one you can't override an existing one you can't change it you can't delete one they are an immutable log we decided we would make creation asynchronous we would use publishing to kafka from upstream systems as our creation interface instead of a synchronous call into the system the creator component reads from Kafka writes in a new relation of database for querying by the query component pretty simple Phoenix app that presents an HTTP HTTP endpoint to internal services and our public API talks the same database asynchronous writes can be complicated introduced to a system that was previously synchronous it pays off in flexibility and isolation of failure demands so a failure of consumption and processing of the log entry no longer has a way to push back upstream and affect the action being taken for a secondary system like a time line that's a very nice property it decouples your risk to the availability of the system as a whole having log entry Sicily having timelines is this is important and having them be consistent is important but it's not as urgently important as the actual actions as phoning someone telling them their thing is down a couple things that we did to get that synchronous system out to make it possible to go asynchronous well first any code that's expecting things to be available immediately is going to have to change this isn't too deep it's very easy to accidentally end up with this code especially working in rails an active record it's very easy to have this assumption subtly in there you're gonna have to remove that I'm not going to go too deep on that one it's gonna be very specific to your use case but it's also gonna be fairly clear attack technique we used was moving ID generation for the log entries upstream into publishing systems instead of at creation at writing time into the database we found areas to news progressive digression to the system to keep our availability high during the move and we had to make some changes to our monitoring so upstream ID generation originally log entry IDs were an auto increment primary key you're very straightforward very simple numerically increasing IDs when we went asynchronous we move to create generation ID up in the publishing system instead so when they write it to the queue they can give it an ID and hold on to a reference to it knowing that will eventually exist even if it doesn't exist yet by contract it will what we use for IDs are we call them composite temporal IDs we use the current timestamp a precise time stamp and then random entropy a couple random bits to deal with collisions the bath here is called the birthday problem or the birthday paradox it's the likelihood that when you're choosing a certain number of random numbers the likelihood of you choosing the same number twice the term comes to the likelihood that if you have like code of finding two people in the room that share a birthday as the number of people in the room increases so you see the math on the slide here if we know we have a certain rate of creation within one time slice and we know we can choose an astronomically small likelihood of collision we could solve for M here and see how many bits of randomness we need to statistically render unlikely ever seeing a collision there's some useful properties to these IDs first they are still or durable you are going to see the same order with these reasonably as you would with the original auto increment primary Keys you're still at least statistically gonna have uniqueness in reality two libraries can come in at the same time but the only way you end up with the same timestamp is hitting different boxes and I would not trust your timekeeping between boxes especially if you're in the cloud more than the precision you have way more accuracy than you have precision vice versa your truck so it's more important that they be unique and that they be precisely ordered when the timing is virtually instantaneous this also lets us do this all well still not needing synchronization or writing to a database to take a auto increment key at creation time so lets us have that asynchronous system and the benefits there but not have to pay some of the costs submit it easier to make the to make the migration because the downstream systems could keep their assumptions about what the keys would look like they didn't have to change them as part of the move there's this paradoxical thing that can happen when you move when you decouple a system and move them apart your availability as a whole can decline let's imagine you have one system that's managing both incidents and timeline entries well if that database the backs those goes to yeah you've lost them both but the one that's urgent the incidence is down if you move the two apart but still have a hard dependency on both systems if the rate of failure of either system is still the same as the original one your availability as a whole has declined now hopefully by isolating them separate them you're going to have an easier time improving them in the long run but at the time you separate them you can see your availability decline that's never good so finding ways of breaking those hard linkages is important progressive digression is an approach for doing this where you take data that you consider secondary and decide that if you aren't able to get it you still have reasonable behavior you can proceed with you don't need to block processing waiting on it we would prefer to show an incident without its timeline then not to show an incident so that lets us keep the availability up well we moved it apart and then we're able to improve it finally when you move to a cube a system it's important to start monitoring your queue very closely there to keep up with processing is going to be visible as degradation or downtime to your users you need to know if things start backing up and falling behind page someone there another paging tool I have some salespeople who would love to talk to you come talk to me later so this goes being proactive what your capacity management your ability to catch up you're ready to recover from an incident that saw you stop processing is gated on you processing all of those things that came in while you were down you you accepted those rights you have to process them having better than real-time throughput capacity or an ability to on-demand create additional through book capacity is really valuable as a way of dealing with this okay we've we've pulled the upstream system apart enough that we're able to extract our asynchronous system what's the new one look like well we mentioned earlier that creating and querying we're gonna be distinct so this is an example of CQRS command query responsibility segregation or separation depending on who you talk to Martin Fowler is a fairly concise definition that I like of this the core idea is that we're trying to separate code that changes state from code that only evaluates state it's very common in event sourcing designs it simplifies adoption of variable concurrency and simplify can simplify your code because you can use a different model for what your data looks like at creation time than at query time and apply different models where they're appropriate than having everything using a single model we took this constant and ran with it we deployed the two as separate services entirely so CQRS is a very nice fit with elixir for our application we as a single codebase single umbrella app that has both shared components like the octo repo and ekdum query models some additional share business logic and in separate things like the the Kafka consumption creator logic or the Phoenix app that presents the queries can be separate distillery makes it very straightforward to do separate releases with different configurations extra efforts pretty minimal we haven't moved to 1-9 style releases yet and not anticipating issues there but a cool side effect is elixir makes it easy enough for other teams that they can run both services as one in the same coupled system using gen server calls to pass things into creator instead of via Kafka to have a lighter-weight development environment for people on other teams who just need to test the results of their system previously with some of our micro services in heavier weight languages and ecosystems let's say some other VM don't know what it could possibly be um we had to develop in memory stubs of our applications in order to let developers run enough of a suite to validate their behavior with a lick Serna being lightweight enough we didn't have to and that was a big saving on implementation costs so the separation of the two applications was useful for a few things one of them was scaling like many many things log entries are read far more often than they are written a given log entry is read multiple times decoupling in two separate services let us scale the two very easily and very independently we didn't have to add write capacity just because our query load had gone up or we were doing more complicated queries that also simplifies a lot of your options with scaling out reads in a relational database environment there's a very conventional approach of putting a sink replicas of your database in if you've decoupled your service and have your separate this one will only ever query and read you don't need to have code to handle oh well this right needs to go to the repo that manages the writer Wan Merces this one can go to the the reader you still can have the same code that only has one repo and only knows about one thing and know that because that is a query or node it can be safely pointed at a read replica without any change in behavior another benefit benefit of the separation monitoring different types of nodes Quarrier and writer have very different load profiles you know their work function is going to be very different what that looks like we're at bottlenecks and resources is gonna see very different things by segregating them we can much tighter put much tighter monitoring downs on those and spot trends react quicker before it impacts users so you can couple with extra information from your OS and hardware level metrics because you have less signal in the noise so it's probably a parent from what I've said so far but to make it explicit we don't run the two as part of the same distributed distributed elixir or LAN cluster they're entirely separate in production it would certainly be possible to implement something very similar where nodes tick on their roll at runtime within a cluster I'd kind of characterize this as the OTP way of doing this of solving this problem in our case we leverage the container orchestration layer to handle handing out the right number of rolls ensuring we have the right number running it would just a better fit for us we'll get into a little bit why there are a lot of these OTP patterns that come with certain assumptions about the world they live in and many of those are really great for improving your availability but it's always worth thinking if it's the right option for your system and if the costs it comes with are appropriate they're tools use them under the right tool for the job okay so here's a simplified diagram of the system what it looks like so far let's get to the fun part let's make a run in multiple regions I feel really good question is why would you want to run in multiple regions fairly simple answer for us its availability we're also multi provider same reason we don't consider eight address raisers availability zone conceptual level separation sufficient isolation there has historically been too many instances where an incident has spanned an entire available entire regions availability zones for us to really count on that so running across the win in multiple regions it's just safer for us for our customers for our product a major infrastructure incident can be a driver a very heavy load so being able to handle that being resilient to that is very important to us I mentioned beam level clustering not being great fit for this app and here's kind of another case for that standing standard distributed elixir Erlang doesn't handle clustering across high latency or wildly varying latency across the land very well there's some super cool work in this area for layering on top of the simple clustering last is a super cool project I love their tagline a suite of libraries aimed at providing a comprehensive program system for planetary scale whoever queued the lights there thank you that was amazing I'm for planetary scale elixir in Erlang applications I love these guys they're doing super cool stuff they have a component called Partizan that haha sorry that deals with cluster membership over the LAN wasn't really on our radar three years ago I haven't heard a lot from people running it in production at scale but there are some things they Maksym fedorov from whatsapp had a great talk year to ago that you can find the video of about their work scaling a cluster to ten thousand nodes and talking about the issues they ran into there that you'll see some of some things that would pop up similarly without these approaches you've brought a lot of complexity with you by choosing to run as one clustered system you've encountered a lot of extra ways for your system to fail that's not necessarily the choice you want to be making so back to earlier point about one cluster not really being appropriate for us so let's look at options for going about doing the replication fundamentally given right is going to originate on one machine in one region we haven't figured out how to put the same hardware in two places at once at least not while it's working so once the timelines written we want it to end up in multiple regions even after one of those regions goes away so give us a few places where we could spread the data to multiple regions first we could replicate things at the end data sync at the datastore level I would kind of characterize this as the most traditional approach disaster recovery style systems have been doing this for decades the tooling for database replication asynchronously is very mature my sequel Postgres any relational datastore you're gonna have a way to do this the data will survive but we need be able to read it to so we put some query or nodes there too but there's some difficulty with creation if the first region goes down here we have to do manual work to point things at our second region so that becomes a system that we have to develop and have that capability and it also becomes a human reaction we don't have those systems running all the time also if our first region doesn't fail outright cleanly but it's having some sort of split brain network issue where we can't cleanly move traffic off of it you're gonna have a lot of fun with database consistency and reconciling them later when some rights went to one and whose a replica of who and oh god so there's some downsides there what about the other into things right where we write to the queue and stead of writing to a queue we could write to all of the queues okay clear downside is going to be the synchronous action now has to do more things that's to do more processing and some of those calls are gonna be across the lands their higher latency but the real kicker is what happens if writes to one the cue in one region fail but the others succeed you don't have simple transactional rollback across multiple clusters in multiple regions so you have to introduce two phase commit schemes something like what's on screen here in order to deal with rolling back your rights if they fail in some places that's an additional complexity to your system is this a thing to go wrong and it's additional work you have to do to create that's an additional place for your system to fail not ideal okay what about the middle well we'll go with the Goldilocks approach the middle will be just right so we need for the log entry service we stole a page from the event sourcing folks approach we realize that when we have two nicely segregated systems our datastore is strictly a function of the command stream fed to it multiple entirely separate stacks as long as we feed them the same input will all reach a valid state there is might be some difference in terms of speed but if your capacity planning is reasonable in practice you won't see this takes advantage of a neat property of Kafka it's not strictly a queue it's not just a queue someone claimed it's not even a queue it doesn't really do pop or those traditional queue like things it's a distributed log you can process that log multiple times from multiple consumers so each region gets a full entirely separate log entry service stack sewn crater some Quarrier its own databases serving local requests they're all fed the same commands y'all gonna reach the same state so it's really important to get right here is matching your progress between the datastore and the queue and Kafka so what happens if you restart the service you can't resume handling at the front of the log or you'll miss things that went down Kafka has internal support for tracking of consumer progress they call consumer groups that can let you resume where you left off that will handle restarting your service you'll come back and start reading where you left off what it doesn't handle is if you have to restore your datastore from a backup if our system fails the one at the top right on the slide it goes down and we have to roll it back to a backup made earlier in the day the one at the top left the stuff highlighted in blue there if we were using consumer groups so we would come back and resume where we left off consumption we would lose those entries we would lose those creation requests we deal with that in a pretty straightforward way we track our own progress in the database so the queue is a feed of commands but the datastore knows what commands were used to reach it so if we ever restore and rollback we're able to resume consumption where we were clusters are independent so this is always correct to the cluster the backup was made from you can always resume a cluster where it left off so I was the original design the log entry service into region independent clusters sharing an upstream published queue of creation requests reaching the same end state when do production close to three years ago since then worked up pretty well serving many times the data it originally had on a mostly unchanged architecture for the rest the talk we're going to get into how we leverage that architecture for maintenance and operations as we went on and about how that helped us out one of the times where things didn't go so smoothly okay an incident someone about two years ago Christmas Eve actually Murphy's Law knows our number for instant timing as well as it turns out we had issue with instant log with long entry creation something about Kafka message set sizes I don't really want to get in the weeds of what happened and turn this into a Kafka talk but to kind of give a high-level view if a single read from Kafka had too many messages we get more than we expected and we wouldn't be able to process them all as a batch before it timed out and we wouldn't make forward progress our sister our system got far enough behind it wouldn't come back up cleanly for this incident hafstrom system published a corrupted request oops by the time we were paged and manually dealt with and massaged the that creation request enough to get it through a backlog had built up we encountered this in production for the first time now we must some temporary workarounds to get our system back up Russ stuff up in stages got by the problem but ultimately we had to come up with some way to reproduce this properly and prove out a fix this is the kind of issue that's fiendishly hard to discover in your test environments without real traffic simulating full-scale production load profiles and doing so accurately as your load changes and under unusual situations like start and stop conditions is not simple it's virtually impossible to categorically rule out issues we always assume that our systems are going to behave linearly and that's what we had tested before you know we were responsible developers we had tested that we could recover and it did you know you double the amount of things that build up it takes us twice as long to clear them okay so we assume that as we keep doing that it'll just keep getting a little bit slower and it did right up until the point where it didn't the nonlinear point here systems can exhibit nonlinear behavior so our processing appeared fine until it didn't ultimately there's only one test environment that matters we've all got that one you might as well test in it too this is the bait slide by the way Freeman wants to take pictures and tempt people into watching the talk on YouTube so how to test to fix for this wid willows moments of realization because we had these nicely isolated segregated systems we could run an extra full stack of the log entry service using the new code in parallel with the existing ones a second stack in one region no interference choose a region that regions gets two stacks so supporting this we had some existing tooling for feature rollout that we could leverage to direct traffic around very quickly between the two which let us react quickly if there was any issue a user we could direct traffic away from it we could also direct all traffic to the old cluster and bring down the new one entirely and start it up and validate that we saw the behavior we wanted but didn't we did we test in load two in our staging environment no problems came back up you know we're able to reproduce the bug and we were able to see that our fix worked unfortunately when we tried in production we saw this the axes labels have fallen off I think someone in marketing cut them off and keeps them on our wall as trophies I think you can still see the get the idea though we reproduce the issue coo-all traffic has stopped we deploy our fix and we didn't see the speed recovery we wanted the traffic in production was a little bit different was structurally just a little bit distinct from what we had expected and created in staging had we tried this for real or deployed this fix without trying it in production the next time this happened there would have been another outage another incident but because we had this isolated safe system there was production data but not user impacting we could see that we needed to further tune it we did you can see on the right the rate picks up quite a lot because we tuned our batch sizes a little bit more and found the right spot so go on after benchmarking we fed data in and can be confident that we meaningfully reproduced the scenario if it ever happens again in production this became one approach we used operationally repeatedly troduce the risk of doing maintenance operations on our systems let us try to make changes to the system without risk to the customer tim is our SVP of engineering he's a favorite saying about maintenance windows or anything else that would require you to take your systems offline for us being there for our users when they might be down is crucially important this is our business so finding ways to do operations on our systems safely while still doing them quickly is a huge improvement this approach is excellent for database migrations entirely separate distinct cluster we can spin up a new one shut down processing run an offline migration on it catch it back up switch traffic over to it delete excuse me delete the old system compare the number of hoops we used to have to jump through with rails and doing online schema changes and the specialized tooling Percona online schema changed and like for anyone who's ever dove down that route this is so much easier and so much lower risk we use normal acto migrations what we run in tests is much more similar to what we see in actual production environments and quicker no risk to our users no risk to our customers because we can separate them because we can easily produce the whole stack again we won't one further dawn migrating we replaced some of our databases entirely originally the long entry service ran on Picone x4 DB which is my secret with Galera and some clustering fund we started to move re to s log entry databases to there RDS Aurora services good fit for a couple reasons they have some neat things there we did it this way brought up a new cluster on Aurora backfilled it caught it up checked it with half the traffic ramped it up deleted the old system entirely just gone at the point we did it the log entry service was the largest production databases in Pedro duty we changed the engine note from under it no one noticed no downtime I mean it's windows no negative customer impact at all a huge and I think underappreciated benefit of CQRS and event source systems is this ability to run them in parallels these operational flexibility the operability improvements it brings to be able to replay your commands and reproduce your systems in their entirety the discussion tends to focus around the code and complex the advantages but there's a lot to be gained operationally as well not all page review systems look like this but it is a pattern for suitable for certain requirements and we have continued to use similar patterns where we can using in more complex situations no trician or relational databases are appropriate we've leveraged TTS more complex event sourcing style approaches like snapshotting materialized views and the like other ways of building up highly performant systems while keeping them highly available maintainable and operable I hope you can - well take it to QA [Applause]

Info

Channel: ElixirConf

Views: 3,579

Rating: 4.8823528 out of 5

Keywords:

Id: -d2NPc8cEnw

Channel Id: undefined

Length: 32min 39sec (1959 seconds)

Published: Thu Aug 29 2019