AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good afternoon or good evening I'm not sure I think the Sun is set right I don't know this is one of those transition periods my name is josue Sorensen technically that's not my name my name is James Christopher Sorensen the third gesso is my Amazon given name so if you want to email me email gesso at amazon.com I'm a senior principal engineer at Amazon I've been with AWS for about 14 years as one of the original developers on s3 way back when I launched our IOT product and for the last couple of years I've been working on DynamoDB and today what I wanted to do was give a talk about under the hood kind of how dynamo works and the way I plan on approaching this is how we onboard engineers to the Dynamo team so the first thing that an engineer when they joined dynamo has to do is interact with the API they kind of take the role that you guys as customers have you know they probably haven't used dynamo you know especially that guys we hire out of college we have them do things like make a table create indexes set up streams a lot of the you know the public facing API of dynamodb and once they get done with that we move on to what this talk is going to be or the onboarding so what are the goals of this talk well wanna hopefully you'll learn about some of the features of dynamo and how they work but I think the real goal or that and I hope you take away from this is understanding how this tool works and can work better for you I think an analogy is kind of appropriate here you know if dynamo is a tool and the analogy I think of is a car right a car is a tool that you use to get to work and if if you start driving your car and you know nothing about a car and you're driving down the road and the engine's going right screaming really loud and and you're going along with traffic if you know nothing about cars you think hey everything's fine and dandy right but the reality is you've probably shifted it into first gear and you're streaming along and that car is gonna last you for maybe 5,000 miles total because you're gonna wear it out now you're not gonna wear out DynamoDB but clearly using the tool the way it's intended to be used can give you much better results and so understanding how dynamo works on the inside I think is will let you more effectively use dynamo so the way we're going to do this is kind of walk through these five different features of dynamo we're gonna start with like really the simplest and move on toward until we get to global tables so we'll talk about how do you get an item and put an item talk about auto scaling and provision how backup a restorer works in dynamo and we talked a little bit about streams and wrap up with global tables so when you get an item from dynamo you make a call and you come through the network and dynamo doesn't care whether you're coming from the public Internet whether you're coming from the ec2 network or a VP see a virtual private cloud we don't care how you get to dynamodb but when you get there you land on a process that we call the request router and this is the public facing API for dynamo and the first thing the request router does is it authenticates the request it makes sure that the caller is who they say they are and this authentication system is common among all AWS components we use the same same subsystem for that the other thing the request order do will do is make sure that you are authorized to do whatever you're doing or asking to do so in this case you're trying to get an item and there would be some policy in this can I'm just showing a sample policy here it gives me permission to get an item from from dynamo this is I just threw this up as an example the details are not really important for this but after we've authenticated and authorized the next thing that the request router will do is it makes a call out to our storage now and this is where your actual data is stored and the stories node will look up the piece of data that you're asking for the key I'll get the data associated data with it send it back to the request router and then on to back to you pretty straightforward a port gets to be a little bit more complicated the request router will talk to a storage node and it'll tell it to put this data the stories you know and will store it locally and it has because we need to durably store your data putting it on one server clearly puts the data at risk so what dino does is it writes it to tells to other storage nodes to store that data and the reality of what dinamo does is it waits for one other node to acknowledge it to reduce latency the third the third node is usually really close behind but we just have to get it to two nodes and then we the request the storage node will acknowledge to the request router send the best send you back the the results for them for the put the say that we were successful now I bet a lot of you have hopefully read the Dynamo paper we published his paper notes but 11 years ago 10 years ago something like that they talked about dynamo and the Dynamo that's explained in that paper is not the same dynamo that we use today dynamodb has evolved from dynamo and so in the Dynamo paper we talked about quorum and that's how Dino Dynamo guaranteed correctness is we did quorum puts quorum reads dynamo DB doesn't do that and instead we use something called packed cells well so Paxos is this algorithm that Leslie Lamport wrote this paper called the part-time Parliament a long long time ago in 1989 I guess it was and this paper didn't get a lot of recognition err or notice at the time in 2001 he wrote the follow-up paper called paxos made simple and I guess it was at that time that really people understood the the power of the thing that he had proposed in this original paper and what paxos is is a way of getting a bunch of distributed machines to all agree on a certain value whatever that value is that you're trying to agree on an in dynamo case that thing that we're trying to get it to agree on is a leader so Dinamo is running Paxos among the storage nodes to elect a leader for that partition or four-year table and so what happens when you do the put is the point is sent to the leaders storage node and there's the leader is always up-to-date definitionally to allow it to even become leader you have to know that you have all the mutations up until that point if you don't you have to go to you one of your peers get yourself caught up because you you can't you know like when you're doing a put a conditional put you have to know that you have the correct value to compare the put to so we elect the leader and then again we propagate that data out to the peer storage nodes then what happens is that that leader is periodically part beating I think right now we heartbeat once every 1.5 seconds out to the storage nodes to the peers and if those peers miss some heartbeats two heartbeats three heartbeats you should know that number exactly if they miss the heartbeats what will happen is that storage it'll say whoa the leaders go on and it will initiate a new election round and say I would like to be leader now I think the other guy is gone and if his peers agree he'll become the leader and take over the leadership of the partition modulo him having been caught up to the with the most current rights so as you might guess dynamo doesn't have one request router in three storage nodes we have thousands of them many thousands of them and like any well architected AWS application like we you know we give guidance to our customers say you need to be in multiple availability zones to have high availability well Dinamo is no different and we have these storage nodes in different availability zones we have the request routers in different availability zones so when you make this request you'll hit some arbitrary request router the request routers themselves are stateless so it doesn't matter which one of those green boxes you land on to make your request that request router then will make a call to the to the storage you know that is the leader of the partition that you need to talk to and then he'll talk in turn to other storage nodes in the other availability zones this points out an extra piece of complexity in the system that we didn't talk about initially and that is partition metadata somehow that request router even though he's stateless has to know which one of all those storage nodes is actually the leader for that partition so we have this other piece of subsystem called the partition metadata service and we'll talk a little bit more about that later the next thing I wanted to talk about is kind of how dynamo sets up tables now this is pretty basic stuff for dynamo but I want to make sure that we're all on the same page so when you make a table in dynamo you have to tell us a primary key a primary hash key as a matter of fact so in this example we have some data about some customers where they live and who they are if you really care those are all the members of my family my daughter is studying in London and but what we so we have and we have a customer ID and we're gonna choose that customer ID to be the primary key the primary hash key for this table so the thing that dynamo then does on in the background is it computes a hash given that that primary key and what that hash function is is you know we don't publish it it's arbitrary but it is always the same hash function we then take these hash values and sort the data essentially by these hash values now this is really happening as the table is being built but we'll sort them and as you can see we've decided to partition this table into three partitions so the first partition is going from hex value 0 to something like 0-7 something something something and so on the second partition is going from the 7 to I don't know maybe it looks like something like C I don't know and then we have our third partition we take these partitions and now they have to get mapped out into the storage nodes so again we pick a we have to pick a storage node in each of the three availability zones and then it's up to those storage nodes in the availability zones to choose who's going to be the leader via the Paxos algorithm this brings up a point or an opportunity to explain what eventual consistency means inside of Dynamo and you know because puts it like I said have to be consistent we have to talk to the leader but in Dinamo you can you can request that Dinamo allows you to do an eventually consistent read and the way that happens and why it's eventually can eventually consistent is because we let the request router randomly choose any one of the three storage notes that are hosting that partition so if for whatever reason this lower storage node is falling a little bit behind you may not get the most recent put to your data now the odds of this are actually pretty low because well low is some definition of low right because the leader we know is always up to date and we know that one other storage node has to be up-to-date too for us to acknowledge the put so the odds that you're talking to a storage node that is doesn't have the latest data is at worst one in three so most of the time even with a eventually consistent read two-thirds of the time you will still get a consistent read but you might not get the most consistent read and the thing about this is well usually the question then arises well how inconsistent is it and that is part of the problem with eventual consistency is eventually that node will get caught up I don't have an answer and it can depend a lot on network traffic how far behind it fell you know did it just recently get rebooted and it has a whole bunch of log to replay I mean so there are times where it can be quite a ways behind but almost all the time it's within milliseconds of the leader the next thing I'd like to do is kind of dig into what happens inside these storage nodes so the storage nodes have two data structures in them there's a B+ tree or a B tree and the replication log so the B tree is where we do all the query and user interactions so we have to put your item into the B tree when we do a get we look and we use the B trees index to go find the item that you're looking for or scans or queries those are all against the B tree and then we have a replication log and the replication log is recording every mutation that happens against that particular partition so these are the really the two internal data structures to a storage I mentioned our partition metadata service dynamo has this component we call it auto admin and auto admin has many many roles in dynamo one of its roles is to make sure that the partition metadata system is up-to-date and it's always being updated with the location and who the leader is for a particular partition or for actually for all partitions auto admin has another role and that is partition repair so when auto admin is he's monitoring he/she/it is monitoring all the components or all the storage nodes in dynamo and if there's a failure if it detects a failure it's an auto admins job to go figure out how to repair this and the way it does that is by in listing another storage node to take over that partition and so what we'll do is we'll copy the B tree from one of the store existing stories notes clearly we can't go to the grey one because for whatever reason it's down so we go to one the other storage nodes we start copying that b-tree to the new destination we copy the replication log then we make sure that the replication log is applied to the B tree and once that process is done the new node is essentially caught up with the leader and it actually is even eligible eligible to become the leader of the partition and so this is just an example of one of the things that our OTO admin does I wanted to talk a little bit about secondary indexes so let's go back to this same table that we were talking about a little while ago and now let's just build a secondary index on the name attributes for our items so the process is very similar to a regular base table and that is we take the the attribute that we're going to build the secondary index on we do the hash on it we sort it and then send it to different partitions so again the top the purple partition is going maybe it's split at zero eight zero eight zero zero hex something like that but one a key point here is that these secondary indexes don't have the same partitioning scheme the same number of partitions they're actually independent of what happened with the base table which sometimes is a little confusing but now what happens when you update that secondary index the process for the base table is exactly what we had described before but now we have to get the index partitions updated and we do that with this process in independent process called the log propagator and what the log propagator is doing is watching the replication log on the storage nodes and taking its aware of the schema for your table what things are the secondary index and it executes to put essentially like the request router would do for the base table and it sends that that update to the index partition but it's a little can be a little bit more complicated than that if you update an item say the customer chain who said you know we had our customer Bob and Bob wants to be known as Robert we would do an update to the base table to change that attribute from Bob to Robert well what would happen is we would remove Bob from someplace in the old index and rewrite that index someplace else so this points out that you can get amplification of your rights to your secondary indexes even though you're only doing one put to the table and it can get worse than that we allow you to have right now up to five global secondary indexes so a single put to a base table can end up hitting eleven different storage nodes or eleven different partitions and ultimately 33 different storage nodes can be involved in that single update to a put so I mentioned Auto admin and auto admin is kind of magic isn't the right word it's the heart of DynamoDB it's the piece that that you know we call it the DBA for dynamo it's a job is doing things like the repair that we talked about creating tables it does table provisioning it is involved in splitting partitions when it they need to be split lots of other things rebooting servers it's essentially our DBA like and and the kind of the design principle that we have is that if a human had to do something to dynamo we need to get Auto admin to do it because humans cannot manage a system at the scale of dynamo so yeah cool let's move on and talk a little bit about provisioning table capacity this is probably all of your favorite point part of dynamo when we build a table there's really there's only two things we ever asked you that you have to give us one is the table name and two is what is the primary key for that table but you can also have us or specify the read capacity unit and the right capacity units for your table that's essentially how many reads are you allowed going to do per second so a read capacity unit is is not just a single item read it actually is based on the size of your item so one RCU will allow you to read up to 4k of 4k object if your object is 20 K to read that item you will need 5 RC use in that second to do the read and the same is the true for writes so the thing about provisioning though is that this is a really hard problem for you guys and we know that because you know your goal is or desire is to pay as little for Dynamo as possible and so you want that provisioning number as low as possible but if you set it to low and your traffic changes in any any way at all you're going to get throttled and depending on whether there are applications that that are fine that with being throttled they're not that many of them but they don't care but for the most part people really do care they want to get to the data that's sitting in dynamo DB so they end up over provisioning their tables and like I said this is a hard trade-off and what I wanted to do today is walk through an example where where we talk about some of the improvements and how these improvements are made for provisioning and dinamo so this example is gonna go through we're gonna because right RC or WRs WCU's are completely analogous to read RC use we're just gonna look at the read side of the equation everything that we're talking about would be identical on the right side but we can just ignore it for this example so we go back to our table and we have this table provisioned for 300 our see us so what dynamo will do split that up among the partitions and each partition will get a hundred RC use to read the data and the way we implement these RC use is by a fairly classic algorithm called a token bucket or token bucket algorithm and the way these coke and buckets work is they have a fill rate and in dynamo the fill rate is the RC use so in our example we're giving it a hundred tokens per second is how many we add to the bucket and we take one token out for each read operation that you do modulo the size of the object that I was talking about a moment ago if there are no tokens in the bucket when you read comes in we actually respond with a throttle so this token bucket has a capacity of 300 times your RC you and the reason that we choose 300 is that gives you five minutes of tokens available so if your if you do not do a read against your table for five minutes we will actually you will Bank 300 X your our see you in tokens if you don't do any reading and once you get that token bucket full those tokens just start falling on the floor so it'll max out in our example at 30,000 tokens but again it's 300 times whatever year and provisioned our C user what's cool about this is that we allow you to burst into this capacity so if a spike in traffic comes in four you will have essentially five minutes worth of tokens to execute these gets against now dynamo can't actually a storage node can't actually support 30,000 requests within a second we wouldn't be able to handle that but if you say spread those 30,000 requests over about 20 30 seconds we will honor that and we will let your traffic burst too to accommodate that that increase in temporary spike increases but if you have sustained traffic like let's say Bob here is some busy customer I don't know we're doing lots of lookups against Bob and our RC use become unbalanced in on that particular partition what'll happen is eventually if that load is sustained first thing will no longer help us because the token bucket is only getting refilled at a hundred tokens per second so if the load stays at 150 per second 50 of those are going to end up getting throttled again not such a cool thing to have happen especially given that you actually still have 75 RC use I'm not a table in aggregate that you could be using but we're throttling you because you have a hot partition so what do we do here but a couple years ago we introduced this thing called adaptive capacity and and what adaptive capacity does is it changes how quickly your token bucket is getting filled if we could fill the token bucket at 150 tokens per second we would be able to sustain the load that we were that we were just seeing and have no throttling and the way we do this is by this thing we call the adaptive capacity multiplier and that multiplier is just a number that we multiply by your RC use and change the fill rate of that bucket the token bucket so here we're showing the token bucket is now getting filled at a hundred and fifty tokens per second now this multiplier is actually applied to every partition in on your table and so going back to our picture here so at this rate we even though each partition is provisioned at a hundred RCU we will let that traffic go through at 150 RC you so how do we do this well we usually think of big controller so pick controllers are you know super common in industrial design for controlling you know lots of different processes the pin stands for proportional integrated differentiated feedback something something something I'll just call it a pig controller and but what is doing is a pig controller is trying to control a process and and set a value such that the feedback loop meets makes it so that the system is in some kind of equilibrium and what we do with this pig controller is we give it several inputs we give it your consumed capacity how much is this table using an aggregate what is your provision capacity on the table what is the throttle rate and because it's a pig controller it needs to know about the the current value to compute the new value and then its output is a new multiplier for this table then we take that and we apply to all the partitions in your table and that will adaptively change the capacity the provisioning on your table so you guys are probably thinking hey I can game this right you know okay there's partition is now getting a hundred and fifty RC use that's a hundred and fifty over what I'm paying for at 300 RC you so this condition will last for a little while but what will happen is our pit controller will notice that your capacity is above your provision capacity and that multiplier will go back down to one and in a little while if your capacity or your load stays identical we will each each partition will get a hundred RCU and each of them will get throttled 50 we're back to throttling so this isn't so cool but at least we're giving you what you're paying for right so what do we do here well last year we launched this thing called autos Kaling and this is the next solution in the problem so when you set up your table you can set up auto scaling and we'll ask you some questions about what you want how you want auto scaling to work so the you can set a lower bound on your capacity and an upper bound and some target utilization 70 is usually a pretty good number for this if you know your load is actually really flat and doesn't have a lot of variants in it you can set your target utilization much higher if you know that your load is really spiky you might want to set that utilization a little bit lower but we could set this utilization and the reason that we have to ask you this is now we're changing the provisioning of your table and we're gonna change the amount of money that you get charged from dinamo by changing the provisioning so you obviously have to opt in to this and we really do encourage you guys to opt in to this I'll show you a graph in a minute of what it did for one of Amazon's internal tables so if we're sitting in that situation where we were at 150 our ciues per table auto scaling will take the provisioning to this table in this example to 640 provision and the reason that it's 640 is 70 percent of 640 is 450 are consumed capacity is 450 so it'll end up setting our provision capacity to 640 so how does this auto scaling thing work well hopefully most of you know or and looked at your cloud watch graphs for the tables that you have because dinamo is publishing these metrics to cloud watch and so the the metrics are sent to cloud watch and AWS has a service called AWS auto scaling and what that does is it sets alarms on on certain values that are sitting in cloud watch so auto scaling actually sets two alarms per table technically for but again we're just talking about the read side of the equation rights are symmetrical it sets two alarms one for your provisioned value and one for the target consume the value and if you go in a look at your out of scale in council you can actually see these provisioned alarms the reason it sets two is clearly the consumed one is so that it can change the value the reason it sets a alarm on the provision is if you go through the API or the console and change the provisioning that's the way that auto-scaling learns that you changed it otherwise you could it wouldn't be able to adjust the ratios correctly the other thing that auto-scaling will do is it when if your load goes down maybe it's the middle of the night and now we're running at 10 RC use per partition auto scaling will scale your table down and in this case that we're actually consuming 30 so we need about 43 provisioned I ops to have our utilization at 70% here's this graph I was talking about this is some Amazon internal tables and you can see what the auto scaling has done and how nicely it actually follows the curb you know at one point they're on you know around 9:30 or so we actually got really quite close and had the load been a lot spike year maybe we would have suffered a little bit of throttling but I'm really amazed and pleased at how well auto scaling tracks the provisioning of the table or the consumed capacity of the tables you know a little quick recap about provisioning when Dynamo launched we dynamo expected you to have a very very balanced workload each partition getting essentially the same amount of work and over the same period of time but you know we noticed or heard from our customers and understood that there's this problem with the imbalance of load over time you know it might be the top of the hour and you're gonna run some report and you make a bunch of queries against dynamo your your request rate is imbalanced in time and so we built bursting to help solve that one you can have imbalance in your key space in my example I said hey Bob was a really hot customer that partition was getting a lot of load because of the imbalance in our key key space so we did adaptive capacity to solve that you know our workloads change over time you know hopefully our systems grow and get bigger and we get more adoption and as that changing workload happens we built auto scaling to solve that and you know I think we're gonna have to continue to iterate on this because like I said we understand that this is one of the pain points of of using dynamo next I want to talk about backup restore and how we've implemented backup restore so dynamo has two kinds of backup restore we have a point in time recovery where you can specify a point within the last 35 days when you want your cable restored to or we can take on-demand backups and you can restore them at some point in time in the future where would you durably store backup data they wouldn't this is this is a little bit of bureaucracy here I'm not allowed to say s3 I told you I worked on s3 when I started out I have to say the first time I use this Amazon simple storage service I don't think anybody knows that it's called that it's s3 right we store our data in s3 dynamo does the same thing and what we do is we move those replication logs to s3 so we aggregate some amount of data essentially as a file we upload that to s3 before before we delete it off the storage node we make sure that we've uploaded it into s3 now we don't actually upload three copies of the data we'll see what the storage nodes will look at what their peers have done and if their peers have already uploaded it we won't upload it again but we upload that data to s3 now these boxes on the right kind of represent the different logs for each partition and if you'll notice there is no coordination about when these logs are uploaded the storage nodes make independent choices when they want to upload their data so if their discs are fill starting to fill up or whatever they can just say oh yeah I can make disk space I'll upload some logs so I can delete them it also periodically uploads a scan of the B tree now I say snapshot of the B tree because it's not a snapshot in the classic sense what we're doing is we're scanning that B tree getting all the data out of it and that scan takes time so the the the actual upload of the B tree is not a consistent view of of your dynamo partition and so kind of over time you'll see that you know just periodically dynamo will decide I'm gonna take a new snapshot of this partition and again these things are not coordinated among the partitions so later on if you want to restore each you pick a time that you need to restore your table at the first thing the Dynamo does is it looks at all the artifacts that it's stored in s3 for that table and says these are the pieces that I need to restore this table so it finds some snapshot in the past and then all the logs from when the snapshot started whatever log was active at that point through the any logs up and past the time of the point of restore and it then will make up partitions in this case it would make three partitions in the new table for you apply those snapshots and then or restore the snapshots and then apply the logs to those snapshots if you choose a point in time restored not that you would even know that there's a snapshot being taken at this point in time but if it does happen that it's a snapshot being taken well at the restore point we still can't use that snapshot because it doesn't have the consistent view and but if you do choose the one afterwards we can use that snapshot and then we don't need that stuff in the past it's a little bit about how we do point in time recovery but it gets a little bit more complicated so Dynamo will split your partitions if the data and a partition gets too large Auto admin will decide that hey that partition needs to be split and it will in this case we're showing the green partition it's going to get split and now if you choose to do a restore at some point in time we have a slightly more complicated problem to solve and that is how do we restore this light green partition well they're both rooted in the snapshot that dark green snapshot but now they have separate key spaces so when we do the restore we will take the whatever key range the partition split that that will get restored unto the first partition we'll use the keys from the snapshot of the higher part of the range for the light green restart partition and then apply the logs from there undemanding so on demand backups are kind of similar to point in time point in time restores for us clearly functionally they're very different for you but what it what a point or an on-demand backup is it's essentially like we want to do a point in time recovery at right now at the instant that you asked for the on-demand backup now the problem here is that we don't necessarily we were not going to have the logs that we need in s3 for this backup but we're gonna have pretty much everything almost everything up to it so one of the very first things that we do as soon as you tell us that you want to take an on-demand backup is we have to go tell the storage nodes upload your logs up through this the current point in time and now we have all the data that we need to do the restore sitting in s3 and so like in this backup of this massive fifty-nine by table I made for this demo you can see that my backup has become available and it becomes available once those logs are up inside of s3 and now we could go restore it to a new table if we so chose so some of you might have noticed that there were first of all you might be surprised to find that we're moving all this data to s3 all the time right and we don't charge for that data except if you turn pitter on excuse me we start charging for it and why is that well because these snapshots and the logs are kind of our insurance policy but we get to control when we take snapshots when we delete logs so we're always optimizing it to take the to do the on demand backup right now and so without pitter enabled if we take this snapshot here as soon as that snapshot is complete we can go back and we can look at all the artifacts that are sitting in s3 before that and we can delete them because we know we're never going to need them we have a current snapshot but when you turn pitter on from the time that you turn pitter on we start a clock and we say okay for this table we have to keep all these artifacts through the pettin into the past so that we can do pitter at any point in time that you guys request us to do a restore so now when we take that snapshot we have to maintain this data for at least 35 days the pitter window that we offer and the reality is is that these snapshots sometimes they're substantially older than that we have to keep them around because it's cost-effective for us to hold you know if you're not making a lot of changes to your table the logs don't grow very fast so we might keep that snapchat around for a lot longer period of time and have more time worth of logs to apply even though it might data wise not be a lot so that kind of explains why we how we do our doing the billing for pitter I wanted to move on to dynamo streams we'll just touch on this a little bit so dynamodb screams is a way of getting all the mutations against your table so we are every put update and delete anything that changes the table one of the cool things about dynamo streams is there are no duplicates in the stream and we'll tell you why that's I think that's important in a moment they're in order now there's an it I put a parenthesis around that by key every mutation in the stream will guaranteed for a key for a particular key to always happen in the order that you executed those operations against a key but they may not be in the order that that they were happening against multiple keys because the partitions aren't actually operating in sync but a key will always be in the same partition and the partition ordering will always be maintained the other thing I think is cool about this is we give you the new and the old image item image in in these records so the way dynamodb streams works is really riding on top of essentially the same technology that amazon Kinesis is built on top of so a lot of the concepts in Kinesis if you're familiar with Kinesis are the same as they are for DynamoDB streams we have the concept of a shard they have a concept of a shard you actually use the Kinesis client library to talk to DynamoDB streams I'll have an architecture picture in a moment here so we have things like records and checkpointing and so on because this is writing on that same technology but what we do differently is you need the DynamoDB streams Kinesis adapter so the KCl talks to that intermediate layer and then talks down to streams so the API looks from the read side is identical to a Kinesis stream but clearly we don't let you put into your dynamodb stream the way the stream is written is from the storage node so the storage node is applying these two shard so that in each stream is composed of many different shards but a shard is an inorder list of those mutations and it's the storage node that is writing those to the to the shard and again this is asynchronous you know if if that subsystem is having issues the leader of the story that partition stories node will know what was committed into the dynamo stream and what wasn't and we'll catch up a little bit later our typical latencies here are on the order of tens of milliseconds for that data to get through the stream we're into the stream I don't need the laser and finally the last thing I wanted to talk about was a little bit about global tables so global tables we launched last year I think it reinvent and the idea here is that you can get a system of tables in multiple regions to all work together and have the same data in them so I've set up a small global table here with in three different regions and oh I gotta go back I wanted to point out one thing so one of the things that that's interesting about global tables is you have to there's an IM role associated for global tables global tables for the most part is operating as a external service to Dinamo for the most part so you have to have we have to have permission to write into your table it's not like the request router the request router is it got direct access to a storage node global tables is going through request routers and it needs to pass the same authentication checks that any other user was and that's why we have this service role for DynamoDB replication so when we propagate this data from from one region to the other we have the we're essentially building a stream reader in the source region and that reader is is just consuming all the mutations and shipping them over to dynamodb in the second region but global tables is multi masters so we actually have to go in the other direction as well so in the other region we will have a stream reader that's looking at the mutations now this is kind of like the snake eating its tail right i mutate it here it goes into the stream it's gonna go to the other side that makes a mutation there that's gonna go into the stream it's gonna come back around well doesn't quite happen that way and we'll cut you a little bit on why we don't get into the circular loop Global tables is all the also multi region and we have multi region replication so when that stream reader reads that data it actually has to ship it not only to one region but any of the regions that are below that belong to that global table and so you know with three regions you have all these stream readers and you quickly get this complicated connected mesh where every stream reader is talking to all the other different regions so this is kind of the high-level architecture of what's going on now like I said for the mote you could you guys could build global tables yourselves relatively easily if you guarantee that your table only has one partition so that there's only one shard and your regions are fixed and so on but to really build global tables it gets a lot lot more complicated and so this is actually the real architecture for how global tables work so we don't really have a stream reader we have this thing we call it rep out the replication out engine and it is consuming from the streams API just like any other application would but the problem or the complexity for global tables comes in is when your partition split streams will have more shards in it and we have to guarantee that we have a rep out process reading the data from every shard of your streams and and so the way that is happening is we have some some metadata from our control plane the repple admin is watching what's happening with the partitions and it when it sees that a new partition comes along we and queue a piece of work into this sq sq and it sent that piece of work simply says hey there's a shard over here and you need to start replicating it one of the processes in the rep out will read that off and say oh I need to do some more work I will pick up doing the work here and this essentially feedback loop continues on so that we have a rep out reading from every shard of every global table stream and then because the streams API is actually a batch API rep out doesn't talk directly to the request routers in the other region it talks to a process we call rep in in the destination region and that process rep in is the one that drives the the request router locally and when the batch is done then we'll tell the rep out and the rep out will know that it can check points a stream that that data has been properly replicated so here's a little example where I've made a really small table and I put one value in it I said the key is key and the value is value and if you do this in the console on the global table this is what you'll see there's but as soon as I hit refresh I get these other three values this is that very same table and you know within if you refresh within less than a second these three values will show up and what has happened here is rep out has read the item from the stream and says oh here's a piece of data that the customer has added to the table or mutated either way and and it but it doesn't have any of this replication metadata on it so that's a signal to the rep out process that this data needs to be sent to the other regions so it adds the source region that it came from so I was doing this from us east or west - and it give us a timestamp of when this item was mutated now it looks like we're recording this as milliseconds but the route because it's got six digits after the decimal in reality what it's doing is it's keeping it accurate to the millisecond and those three-digit counters are how many events happened in that millisecond and that means that allows us to guarantee that a partition that the timestamp for partition is always unique because we reset the counter every time the millisecond clicks we start from one again so if you do a tag if you could do over a thousand operations per second on a partition which is the limit for a partition that counter would roll over but it doesn't it can't so we set this up update time you'll see that this deleting flag is set to false it's not really important why that why that is false I mean actually that's a misstatement it is hugely important why that is false but you'll never see it true because it only is set true during the time that rep out is trying to delete the the item and so it's a very very transient state most of the time it'll be false the way global tables does conflict resolution if you mutate the same item in two different regions at the same time we do last writer wins and again we have this down to millisecond timing with that additional counter at the end it is conceivable that you could make it happen and those two time stamps would be identical I believe the region is then used to disambiguate and one of them is guaranteed to win over the other one but essentially it's last writer wins it's the thing that gives you gives us the conflict resolution so today I've covered a bunch of things like I said if you know this is these are hopefully things that are helpful for you guys in understanding how dynamo works and maybe leveraging dynamo better for you in the future so we've talked about the the get input auto scaling the provisioning I hope that helps a lot and how global tables works there's a whole bunch of things that we didn't cover there are things like fleet management metering monitoring capacity planning and like I said these are this is a talk that we would typically give to onboarding new engineers to Dynamo they would clearly need to learn how this stuff works especially if that's the area that they're going to work in but I don't think it's gonna be really helpful for you guys to understand how this works in dynamo you know and the next step if you were developers on dynamo the next step we would be doing right now is we start digging into code we'd say hey if you go work on an Auto admin here's the code base start reading it and with that I think thank you for your time thank you [Applause]

Info

Channel: Amazon Web Services

Views: 59,662

Rating: undefined out of 5

Keywords: re:Invent 2018, Amazon, AWS re:Invent, Databases, DAT321, Amazon DynamoDB

Id: yvBR71D0nAQ

Channel Id: undefined

Length: 53min 42sec (3222 seconds)

Published: Tue Nov 27 2018