AWS re:Invent 2019: [REPEAT] Amazon Aurora storage demystified: How it all works (DAT309-R)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone thank you for joining my name is Tobias nattramn and I run product management for Amazon Aurora hello everyone I'm Laura Lee I run the engineering thanks for coming let's start okay so we'll start with just explaining what is a quick recap of database internals nothing too deep just to motivate yvg you know built Aurora cloud native database architecture and how we achieve durability at scale some performance results and some of the features and examples that the global database is fast database cloning and backtrack Tobias oh sure so what is Amazon Aurora well it's a database that gives you this the speed and availability of commercial databases but with the simplicity and cost-effectiveness of open source database it's compatible with my sequel as well as Posca sequel and it has an easy pay-as-you-go pricing model across both compute and storage okay thank you so just do a quick recap I chose B plus trees you know because it's widely used in the database so just to explain few concepts so the data within the B plus tree is organized in fixed set of pages for example an R or my sick but we used 16 kilobytes of pages and that is kept in memory and we refer to that as buffer pool and this is of the database maintains the memory state and periodically it's serialized into data pages on a durable storage and that is called as checkpoint so essentially each page in memory is persistent durably so that when you have a cold restart or you want to take it from back up that is data there and when you think about how database operates on these data pages the data is actually modified in place so you have a page in memory it is modified in place and there is a standard protocol which we refer to as do redo and do protocol and the log records that we've generated with before and after images are stored in a right head log so let's take an example so you have a wound State this is a page let me do an operation you get a new state and the before and after images that is stored in log records in banner and you when you want to redo this operation you take the old State you have the log record which contains the operation that was done and you get the new state there may be cases where you wanted to undo in which case you take the new State you have the log record you have the previous tape there and you do the undo operation and you get the old State so let's see how this is done for a crash recovery okay so let's take an example here so in the middle column here you can follow pages that are written to disk and on the right-hand side you'll see the pages that are written to the transaction log and we're starting here with one transaction and time moves down so the one transaction is run and it has committed now we've made the change in the buffer pool in memory as well as written it to the transaction log and we'll get to that why we need this or undo/redo protocol next we have a transaction t2 that has started while the transaction was run and as soon as this transaction ran it updated in the buffer pool and it's generated the log record and written that durably to storage now while this transaction was running the database did this checkpoint procedure that Murli mentioned so the checkpoint basically means anything that has been updated or changed in memory since the last time I used this is the last time I did a checkpoint I'll write to disk and the thought here is that when you restart you have to get the database to a consistent and then you can basically only you only need to worry about things that happen after the checkpoint so in this case the checkpoint happened this blue page that we wrote in transaction 1 it's been written durably to the data files of the database whereas transaction 2 wasn't because it wasn't committed yet okay so that's only in the transaction log then we have transaction 3 that runs it's written to the log and updated in the buffer pool and then we have transaction for that's started I never go to commit because we had the system failure ok so whatever it managed to do while it was running is in the transaction log but there is no commit record if you will in the transaction log so now the system restarts and what does it need to do well transaction 1 it doesn't need to worry about because that was written to the data files when checkpoint occurred so it can ignore that transaction 2 was started but was never committed so it needs to redo transaction 2 based on what was written durably in the log so that everything is consistent transaction 3 it needs to redo the whole thing because nothing was updated before the checkpoints and transaction for it needs to undo because transaction for never committed so this is where you can kind of see the redo undo protocol in action so a recap on the iOS required to actually manage persistence so again we have the same type of columns with the durable storage for the data files as well as the transaction load so when you write there is going to be one I oh that's writing to the transaction log if many things happen at the same time sometimes these iOS can be combined across transactions but one IO for that even if it's only a few bytes you need to run the write that because otherwise you can't guarantee durability you can't redo right then you need to do a page protection right if I call this suspenders and dealt right because of the fact that we're writing a 16 K block you could manage only to write some of it when the crash happens so we need to make sure that it actually is fully written so you write it once into a scratch pad space if you will and then you write it again when the check point happens so now we have actually three it's kind of the minimal if you will to write to a traditional relational database and as you can see here iOS probably mattered quite a bit for performance especially when it comes to writes to databases Murli did this slide so it says databases are all about i/o there might be some other stuff the databases do also but mostly about i/o what about the UX on top of it yeah there is yeah there is UX on top of the i/o this way of the my station processing stuff like that right thank you let's talk about you know the key motivation as Tobias mentioned is database does lot of i/o yeah so this is what the traditional database architecture looks like you have a compute instance which processes the sequel statements in those transactions does caching what we showed us buffer pool that there is logging which flows into the storage system and there is an attached storage and because it does a lot of iowa the best idea we had so far is increased our bandwidth or decrease the number of values this is all we could do right in our Adobe took a slightly different approach so we we look at log as the database so I assume that there is a log stream from the beginning of the database time so you have T 0 2 T now you have all the log records any version of a database page can be constructed using this log stream right so let's take an example ok you have a blue page at t5 and we can create that using the log records that were created at t1 and t5 our most pages blue or some yes right so you know that we also referred this as coalescing essentially bring all the log records for a particular page put them together create a new page image that's what we call a scala c right so the first problem right if we rely only on the log stream for page reads it's not practical it's it's going to be too slow why is that slow because if you have the log stream from zero you know assume that you have a database that is running for a year right so to do any page deed you need to go from time zero so one year worth of logs needs to be replayed so that's not practical so the solution here is periodic checkpoints this is what a traditional database also does it's like mini replays every once in a while once in a while so then you know one approach we could have taken is let the database do this this is what the traditional database was doing and as Tobias mentioned that's where the inflation happens for a small column update you end up writing 32 kilobytes at a minimum if 16 kilobytes is the page size right so the solution we came up with is why not offload that work to a distributed storage fleet which does continuous check pointing so essentially the Arora method is to just write log records to the distributed storage and the distributed storage does the continuous check bonding so when you do a page read when the database does the page read the latest version of the page is already coalesced and ready for server so how do we how did we achieve that all right so we we we looked at the overall stack there is a compute and there is a storage and there is a clear difference between compute lifetime and the storage lifetime compute instances they can fame they can be replaced they can be shut down to save cost they can be scaled up down out based on the loads load needs right but in case of storage we cannot lose the data it has to be long-lived so that the third process is the lifetime requirements itself gives us an idea of separating compute and storage right so the by decoupling we can also get the scalability availability and durability and we'll go through them and how are our chip stack so in case of Aurora we built this log structure distributed storage system which is multi-tenant that means that the storage offers the data from multiple databases on the same storage layer it is multi attach that is that multiple database instances from the same database cluster can attach to the same distributed storage and it is purpose-built for databases by that we mean that the storage intimately understands log records and the pages and in case of Aurora the storage system understands my sequel page format Postgres page format all the data structures and so forth and we leveraged several services in AWS to build this so we did not build or metadata store use dynamic DB for that means route 53 for our naming we use ec2 for our instances we did not have our own custom hardware we use instances that have local attached it says DS on it I twos and I threes and use Amazon s3 for storing our backups so let's dive deep into how IO flows in a storage node so okay so the the red square on your left there is an example of a database instance and then this green storage node we've zoomed into one of the storage nodes and if you've heard Aurora writes to six copies of every page or everything it writes to six of those things so six separate storage nodes to in each availability zone so we've sewn into one of them and then the other green box you see under the red one there is other bodies of this one so other friendly nodes or peer storage nodes so let's take a look at an actual i/o that happens and the first thing that's interesting so here we send log records over the first thing that's a little bit magical with Aurora is when do you send the log records so in a traditional database when do you send it you would think that you send it as soon as you you can have finished creating the log record well that's not practical because as we mentioned before as soon as you generate the log record and if you write it you use an i/o and you can only do so many Isles per second depending on your storage subsystem so you won't group them like you want multiple log records to come together somewhere and then you write them in batches so you typically have some sort of log buffer the buffers up and you have enough right or there's a laugh delay in the system there goes okay fine I'll write this to the log now and obviously once you have a commit it will have to wait until the the durability until that has been written until it says okay I've done with the commit with the rora we don't do iOS to disk anymore we do Network IO we send it from the database instance over to the storage node and that's a lot cheaper so we'll still do some buffering but you can have a much smaller network buffer if you will until it gets sent over to the storage node which means you can move faster okay so we send it over to the storage node now in fact we send it then to all six storage nodes at the same time so we've zoomed into one of them so it comes into the storage node into memory into a buffer on the storage node the first thing that stores know then that needs to do is it needs to write it to disk so we know that we didn't lose this log record okay so it writes it to this we call it this hot log here and then it can say okay I've got it back to the database instance and the database instance then only needs to track what's going on with its six reads and it can actually continue moving forward the only time it actually needs to wait is when it issues a commitment then it needs to wait for at least four of these folks to have written all of that all of the log records okay so and this is the only thing that's synchronous in the communication so as soon as been written to the disk on the storage node it continues okay so the next steps are all asynchronous if you will the database instance doesn't have to wait for them so next step after this is creating these pages so this optimization that Murli mentioned earlier so that we don't have to really construct all of the pages all of the time from the beginning of time so it does this with using this update log and obviously I can't create a page until I know that I have all log records that have been generated up until a certain point in time right there can't be any holes in the chain because then the redo protocol doesn't work and I get the wrong page so it will wait until it has enough of these pages types now it will also do this gossiping so what happens if I don't if I'm missing a log record so I have five but I'm missing four and I have one two three well then I can go ask my buddies my other peer knows hey this look wicked for do you have that and in fact the the storage node speaks Spanish Oh like a Dalek yeah it gets it I'm we're working on Swedish it's way more complex so then it gets its log records once it has the full sequence it coalesces and generates the page and writes it to disk so now we have the pages and this is kind of our internal checkpoint if you will and then as time moves on we ship those data pages over to s3 as well as the log records so I can use as three then for point in time recovery to any point in time and then obviously we generate these pages all the time so not having the page is a problem for performance because you have to redo everything having too many of the pages is obviously a waste so once we know that their pages aren't needed anymore we can go and garbage collect them okay we can never garbage collect log records those we have to keep right because you may need them for a point in time recovery and then we periodically also do a checksum validation to make sure that there wasn't a problem with a page and if we find the problem with the page it can go speak Spanish again to its friendly peers to get that page from them so again only the first two steps are actually synchronous or something that the database instance waits for the rest are asynchronous and again it sends it to it six of the storage nodes so all of this happens in parallel in six different locations or more but it's for each page if you will it's six different locations okay thank you Thank You Tobias so we looked at a single storage node so let's look at how we achieve scale so at scale there is going to be some failure in the fleet at any given point it's going to be failing nodes failing tasks bailing switches and so forth so how do we handle that right so the strawman solution for this is do replication it's a typical idea is let's use three availability zones let's put one copy per availability zone you know then do write and read quorum of two or three it's a fair idea let's see if it works for all the failure cases what about I easy failure so still there are two out of three copies because each easy contain one copy there are two more availability zones that are up so you have two out of three copies we can establish quorum there's no data loss what about AZ plus one failure what do we mean by AZ plus one now losing a node in any given AZ is possible at any given point in time so if you have a AZ down and a node is down as well and some other AZ what would happen essentially you lose toward of three copies that means that you lose the quorum you lose the data so how does the Aurora deal with that so we replicate six-phase tobias mentioned it briefly so we have three availability zones that we use for any database cluster we put two copies per availability zones and we use write columns or four out of six so let's go through the same example what happens if a AZ fails you still have four out of six copies there are two available available it is owns that are up still each having two copies so you have four out of six we maintain availability what happens if there is a AZ plus one failure right they still have three copies it's not writable but there's no data loss like why can't we write because we need four out of six why do we do that the reason why we need four out of six is is because when we look at the it's a long answer so the first the first part is why do we need the two copies we addressed it already because there is always one node that is failing so and four out of six works well to hide the tail latencies as well so when you have this large fleet of storage nodes there's always this SSST that is bad leveled and so forth you can hide the tail latencies if you wait only four four out of six so in this particular case we have lost three copies and we can reconstruct the remaining copy from the remaining peers there's no data loss and we recover right availability so the question then is if you have six copies okay do you do you have two how do we get the largest database size possible we support 64 terabytes now so you cannot obviously put 64 terabytes in a single SSD so we use segmented storage we partition the volume into n fixed segments could we support more than 64 terabytes yes we architectural e it's possible and we are working on increasing the size as well right so the once you partition at the end fixed segments in each segment is replicated six space so now the question is how big should these segments be right if it is too small you're just gonna have too many of these segments so the failures are more likely if you have it too big many parents are gonna become long so we we experimented quite a bit and then we settled with 10 GB as a size and we can replicate a 10 GB segment within a minute okay so how do we do repairs so let's go through that so we use quorum sets and a pox to do repair there is an example I can walk you through so let's assume that there is a six-way replicated segment on machines a to F and all nodes are healthy and this a to F this this quorum is available to the database node so database manages all these metadata as well yeah in terms of knowing that this production group has six copies and this is where they are located so when the log records need to be sent to a particular page it figures out which segment it belongs to it knows all its peers and the peers are given by this quorum so it sends to ATF and everything is good it runs the four out of six quorum protocol so we we do monitoring of all these storage nodes and with pings with instance health checks and so forth and we might find that machine F is in suspect state you know it may or may not be working in which case we create another quorum for the same application group by adding a machine G so now we created two quorum sets one with the A to F that is the previous one and the future quorum set which contains F replaced with G so in this particular state there are seven nodes in that production group and the database instance now writes to both of these quorum sets and waits for four out of six from both of them so essentially if F is up in in terms of database view it is actually writing to seven nodes instead of six okay this is good because we can always abandon one of the chorim's at any given point we are not losing any data right after some point the system can determine that F is actually not healthy and it can be removed in which case we can drop the first quorum and we have cleanly moved on to adding a new node into the replication set so we're being kind of proactive to make sure that we don't have to rebuild a node when it's acute correct so this adora is continuously self-healing and this this is a procedure that we used for heart failures information is down or SSD is broken or there is a you know any network problem all of that goes to the same process it's a single approach we use and this is also used for continuous heat management so when you do placement of the segment's across these thousands of SSDs you know depending on the workload of the database you might have some database which is heavily used in which case you will have because of the multi-tenancy you might have impacts in which case be move segments out to provide more higher bandwidth more space and so forth so the the whether it is heart failure or soft failure or maintenance operation everything goes through the same process here to talk about the performance yes so let's take a look at IO profile and just contrast a little bit what we've talked about Aurora versus traditional my C column most databases work roughly like this so you have here my sequel with a replica and all of these arrows that point in different directions here have different colors that you can see at the bottom so for example you can see that the the transaction over the log records or log writes are the blue ones the bin log used for replication traditionally are the red arrows then actual data replication is the is the grey one yellow is this suspenders and belt writes or double writes we call them here that must be something wrong and then at the metadata files so essentially you can see that we were writing from the replica we're sending all of these i/o over to the secondary and then on both sides were writing both two data files and the data files in a primary storage unit and then we're sending it also to a secondary storage unit if the storage units fails so there's quite a lot of iOS here so in the i/o profile here for a thirty-minute sistren suspends run there was 780 thousand transactions and you can see here a verage 7.4 iOS for each transaction so the number of iOS would be 780 thousand times 7.4 now if you look on the Aurora side you see that the colors differ quite a bit so the only thing that goes from the writable replicas to the read replicas as well as to the storage nodes or log records and so we send the local records like we said before to the 6 storage node so that they can run through this protocol that we saw we need to send them also to the read replicas because the read replicas have their own buffer pools and if something changes we need to update the buffer pool but all these these folks need to update the buffer pool is the log records so you can see the only thing that's written is the log records and then you can see you if you look closely on the right to Amazon s3 we have both the log records the blue arrow and then we also write the gray records or the data pages and these are the check posts to be able to recover fast in terms of a point in time recovery and so here you can see the example of the same 30-minute suspense run on the same database instance size if you will where you have instead twenty-seven million transactions and 0.95 IOT's per transaction which makes sense each transaction needs to do one average and i/o for the log record and you can imagine that this makes quite a bit of difference when it comes to performance especially on the right side so we say that a row my sequel is about five times faster than traditional my sequel database well if you split this into writes and reads on the left side you see the graph for writes where the the blue bars are the throughput of Aurora and the right bars or the throughput of traditional my sequel you can see it's it's actually way more than five times faster which makes total sense because the other one is writing about seven to eight times more so it would make sense for it to be by roughly seven to eight times more work and seventy eight times more latency and seventy eight times less throughput and then on the read side you can see that the difference isn't as great we're still quite a bit faster because of the fact that we've optimized parts of my sequel in Aurora to handle the read throughput so the one one thing la la' add is you know when the storage system scales and performs the database also has to keep up with that and what we had to do for adora is to also change quite a bit of a surgeon invasive changes into the database engine we produced a new lock manager we changed the way we do threading how we manage the work queues and so forth so that we can take advantage of the changes that we are made in the distributed storage the storage is too good yes I wonder when this fabric I wonder what early worked more ok let's talk about the global databases so for global databases we built the single as physical replication essentially we take the log records and we ship it to a different region and I'll explain it here so in this particular case primary database instance is sending the log records to storage nodes which to achieve the four out of six quorum and it is sending to its peers the read replicas within the database cluster and it's also sending it to the replication server which then takes it and sends it to a different region to the replication agent there and the replication agent now behaves as a writer and it sends to the replicas instances and also to the storage so essentially you know we are getting a pipeline of log records that are flowing between regions and we are achieving the durability in both the regions so two different RoR DB clusters and both of them are behaving as if the writers are writing locally do we send the log records in order over to the other side yes and you send them in serial or so we use multiple connections we you know vans van communication is very different than line communication we are making sure that the log records are in sequence but we are sending it in parallel reassembling them so all the goodness that you expect from a enterprise database we had to do and as you see the small arrow in case there is a break in network communication the storage is capturing all the log records so that the database is down for example or you're restarting our network is down we are not able to send the log records over immediately we have it stored in the storage node and the replication server can pull it from the storage and send it so how does this perform so here because the replication server is just behaving like it another read replica within the database cluster it's able to keep up so we are able to in this example the workload is able to keep up with 150,000 writes per second without much performance impact it's basically the replication server is just at another replica within the DB cluster lower replicas lag within less than a second you can do the replication and fast recovery less than a minute to accept readwrite workload on the region if there is a failure so this this part is same as existing Aurora logic we didn't have to change anything here so if you if you restart our database cluster the recovery happens in parallel because the storage helps in recovering it so instead of the database pulling all the log records and doing redo and undo we are able to distribute that work over the fleet of the storage node so when when you have to failover to a new region we are able to recover quickly because Aurora recovers quickly regardless of global database or not and it's a fully managed solution so you don't have to manage like a replication server and the replication agent that's just hidden from you all you say is I want to replicate from this region to this other region so let's look at how it performs so on the Left graph on the Left scale is basically queries per second and skill is the lag the right graph is a physical replication left of the logical replication in this particular case we used my sequel bin log replication the logical replication and you can see that once you reach around 30,000 QPS the lag shoots up so it's not able to keep up so this is this is one of the things that we see in non-synthetic workloads as well the bin log protocol is very heavy so it doesn't scale so you see in a global databases so the queries QPS can go to 200,000 the lag continues to remain low so it's below a second and recently we also launched ability to add multiple mirrors up to 5 mirrors so previa previously you could add only one mirror now we can add multiple mirrors and we also launched the ability to promote any Aurora database cluster Aurora my sequel cluster into a global database so if you have existing database you can make it global database let's look at a quick walkthrough so here we have the primary region which is us West and the primary instance is writing it's continuously inserting data and us East is we use it for reads okay so the data is flowing between these two regions continuously and let's see how the lag is so essentially the the insert is happening on the left you can see the timestamp and at the same time we are doing a read and the database level this is not the CloudWatch metric that the system shows this is within the database you can see in your schema that the replicas lag is 110 milliseconds so typically we we see across many regions the powers it is less than a second so this feature was quite useful if you have a distributed application where you want read scalability you can get this you can use it for disaster recovery because once you have the data flowing to a different region you get the continuous backup and restore as well so you can take snapshots you can recover and so forth so even if you don't use global database to start creating a failover and so forth you can use it for disaster recovery and you can fan out your other tasks read tasks that you might have outside the region let's look at database cloning Dobies yes and I might add on its there is a big difference in aurora in the fact that we can do all of these things in parallel so if you remember that the loss of those steps with the spanish-speaking went through after the - right when we've written it durably the database instance can continue right and you had this smaller network buffer so it means we can send things faster and then all of the system works like this so each of the each of the nodes can continuously is just sending these log log records and keep coalescing and keep keep looking for holes and as soon as it doesn't have holes it can create the pages so because of the fact that the system works like this you can do this at a massive scale so in parallel and that's really the biggest difference is there is low latency for the log records to be created and sent to storage from the database instance and in the database instances we can move in parallel and this applies both to the local replication as well as the global replication and that's why you can run this massive scale with low low replicas lag so database counting what's that well it's roughly what it's what it sounds like I can have a you know database it's readable and writeable and I can say please give me a clone of this database right now pretty fast you have a clone available and that's both readable and writable so there's two benefits here one is that it's fast to create this clone and the other benefit is that the clone now shares storage with the primary as long as that storage hasn't diverged so if I have let's say a hundred gigabytes or you know 64 terabytes database and I clone it now I now have the clone which has access to all of this information but it just points to the same pages in storage that the master did or that the original database if you will that you cloned and because of this also the only storage that you pay for on the clone side is whatever diverged so obviously now if you start writing on the clone but we need to keep track of what changed and those pages you'll pay for on the storage side but as long as they're not diverging you only pay for what you actually used you create to the 64 terabytes and then you're paying for 64 terabytes even though you have two fully readable writable instances leveraging this with minimal perform or knowing performance impact and you know you use this for many things it might be that you want to you know do something risky right you do want to practice first and just create the clone run through it see what happens you may want to run just tests create a clone or you may want to clone because you have a template database that you created that has a bunch of gigabytes or terabytes and you want to use that as the basis for a new database well that works fine you have the template database you create a clone and you run from the clone so all of these kind of valid use cases here and you can do it across a count as well so across accounts yeah so if you have a production account and you have a test account you don't want to use the production account for your testing so you can cross a control but then you have to pay you on both sides the storage still references oh oh you still only pay one so let's see how this looks how this works in the real world here so we have a source database it has you can see here the blue pages 1 through 4 and then we just created a clue clone it has the same pages because nothing changed so on the storage side if you see if you will here we have a friendly how here have 2 protection groups and again a protection group is a set of 6 storage nodes right there say 6 segments right into 6 different storage nodes and apparently one page 1 and 3 is on the first protection group and page 2 and 4 on the 2nd protection group and then let's see what happens when it diverges so now we make some changes on the source database so now apparently we poked at page two here so now it's another version of that page if you will and we apparently also created page five so you can see page one and five are now available as well as page two and the other versions are also available so now when we look at the clone database it can also diverge and again here you only pay them for the union of the two right so the source plus whatever divergent you did on the source and the secondary so let's just walk through an example here so here we have a database we ran a bunch of transactions to load it so we have about 40 million rows I still don't know why it's about 40 million so there is four rows missing from 40 million here yeah like the next time we should get this to 40 million even great and it's about it used about 25 gigabytes of space here so you can see the kawatche metric here went up to 25 gigabytes so I go and I click in the in the console I click the database and say I would like to create a clone so now I get a clone and you can see the top here we're querying the clone to see that it has almost 40 million rows now when we worked on this we made sure that this was small enough so you couldn't see but we're being honest it's not exactly 40 million it's 39 million 96 but when I look at the cloud watch metrics for the clone we can see that the only space that's being used on the clone is about a hundred megabytes ok so then let's go and update above about 10 milli rose again not exactly 10 million I think it was likely above this time so that compensates kind of for the 4 rows that were missing before and now you can see because of the fact that we diverged in the replica then the replica now needs to store a little bit more information so now it's 1.2 gigabytes instead of 100 megabytes so merly tell us a bit about backtrack right so let me also add a couple of things about the loan so when we create the Clones we reference the segments of the parent database and then we create a new segment and we make it resident on the same SSD so when you look at the database cluster you can create unlimited amount of clones so you take a database you can create a clone after clone after clone and so forth there is no limits on the number of clones you do but technically on the back end we cannot place all those segments on the same SSD and we transparently handle it so we look at the utilization of the number of segments the space utilization and so forth and we diverge and move it to different SSDs so that we can spread the load so from the DB cluster perspective we can create unlimited clones and the storage system handles it regardless of the number of clones you create let's look at database backtrack backtrack is a feature that we shipped that is a heavily storage dependent feature well you can quickly bring the database to a particular point in time without having to restore from backups so this happens when you for example unintentionally dropped a table delete a bunch of rows and so forth in your production database the previous approach typically people would use is to point in time recovery in this particular case with backtrack you can rewind the database and you can do it multiple times and every time you rewind that database is also writable so you can continue writing for example in this particular case the database moved till time t1 and appended reached t2 you decide that it is not the state that you really want so you want to rewind to t1 so what has happened here is log records have been generated and the page versions have been generated and so forth till t2 so the storage knows the log record sequence numbers and the versions that have been created and so forth but then when the customer asks go rewind it back to t1 we make the log records and the page versions between t1 and t2 invisible so we essentially create a tree of Ellison space okay and then we reach t2 the time moves forward we reach time 3 t3 and when you reach t4 the customer asks to rewind the database again - t3 the same thing happens so essentially we hide the log records it says you can see a tree of log records that have been found and you can always jump back into this invisible zone in which case we are essentially jumping into a branch of the Ellison tree and then we can show the database as of the point and all of this operation happens in parallel so if you have a 60 per terabyte database there's thousands of segments that are backtracked in parallel and the database is brought into a consistent state and you can start writing how far back can I backtrack right good question so the customer has an option to set the time interval so you can configure backtrack time interval and today we support 72 hours of backtrack and you can backtrack to within that period of time so when would I use this so you can use it on a disaster case are there are cases where you know if you want to I trait on a schema change so you want to make a massive change to your database so in your production you don't want to do it immediately so the typical approach is you take a clone of your production database and then you make a schema change see whether it works well for you if you don't like it you just go back to back in time make a lot of change and so forth the backtrack is very quick it happens less than 10 seconds regardless of the size of the database so you should be able to go back and forth so I can clone and then I go back tracking back and forth you can do back and forth let's do a example here so in this particular case there is a table with approximately approximately 10,000 rows but in this particular case it's exactly 10,000 rows okay but just because you were talking I promise you if it was me this is like ten thousand and two ten does that okay so you're looking at the descending fine top five rows and the time stamp at 7:05 okay this is the current state and we modified the schema we added a new column c1 we're not showing that in the output we show only the C but yeah the statement is right there we had a two rows you know hundred thousand one hundred thousand two is a value for the column it would have been too easy if we showed the column there right yeah this is a geeky demo so then you realize that this is not a state that you want to be okay so what do we do so we go to the console there's also a pas for this for simplicity I'm showing the console here you can select the database cluster backtrack put in the time at 7:06 and in few seconds you get back the good old state so I can backtrack schema changes drop tables create to insert yeah you can do you know as long as the backtrack window is configured correctly you can go back and forth is there a backtrack on my own by default or something or do I have to turn it on today I have to turn it on and you have to configure the hours that you want and the the same technology you see in terms of we talked about how we store the log records how we store the different page versions and so forth that is exactly the same technology that we are using for a completion backtrack is just that they're not garbage collecting the log records and the page versions we retain them as long as the BACtrack window is available so backtracked impact your build a little bit all right so the based on your backtrack window you would be charged for the storage that is incurred for doing that functionality so backtrack internally we talked about copy-on-write clone we use a user technological as copy and write where page versions are maintained and when a new version is being produced the reference count decrement the previous reference count create a new version and so forth and that is the same technology that we are using for backpackers also internally if a page doesn't change on the storage node it's just you know multiple copies are not retained across the LS entry if you were references we wrote a couple of papers Sigma papers I highly recommend you guys looking into it and many of this material is explained in greater depth there you know how we do persistence how we do commits distributed recovery and so forth that's it thank you [Applause]
Info
Channel: AWS Events
Views: 7,403
Rating: undefined out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, DAT309-R, Databases, Amazon Web Services, Amazon Aurora, Amazon RDS
Id: uaQEGLKtw54
Channel Id: undefined
Length: 49min 51sec (2991 seconds)
Published: Wed Dec 04 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.