AWS re:Invent 2019: [REPEAT 1] What's new in Amazon Aurora (DAT202-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right well let's get started well first of all thank you for coming to a Friday session I've been looking at the scores from all the various session and it turns out that the repeat sessions always do better I guess the speakers get a little bit of a practice and they look at the comments so they try to improve so I'll try to do the same but can't promise too much on improving already my name is Tony petrosyan I'm the GM for Amazon Arora I'll tell you a little bit about the organization just so you get a idea of how things work at AWS if you have a team in for aurora and the team has several parts to it one part is a an engineering leader that leads or a Postgres we have another engineering leader that leads the Aurora my sequel we have a leader that leads the Aurora control plane which is all of the workflows and coordination of keeping things up and running and going and upgrades and clusters and so on and we have a leader that leads the release management because you can imagine there's a complex set of versions out there and releases so that's kind of the overall aurora organization we have a director of engineering his name is Murali he did the deep dive sessions for storage I don't know if anybody went to that session but he leads he's the director of engineering for all of the Aurora so he's the person who brings it all together and I'm just the person who cracks the whip don't do a lot of engineering work so let's get started so this session what we're gonna do is just give it quick very quick intro and then dive into a little bit of the aurora architecture and the reason for that is i want to tie the specific improvements and new things and how they relate to the aurora architecture and why these improvements are interesting so and then we'll talk about the new stuff so how many people use aurora all right good good group or a Postgres all right or am i sequel all right so how many people are new to RTS in general all right see there's a reason to do a little bit of introduction so for for Aurora Aurora is our database for my sequel and Postgres it's it's compatible with the Postgres and and my sequel but it's really intended to simplify running very large workloads with high throughput where the availability durability and the security somewhat built in and as part of the infrastructure for itself so try to make it really easy to use my sequel and Postgres for high throughput large workloads and simplify you know the paint pay-as-you-go model for it and so this is just the general tagline for Aurora now one of the interesting things to know about all right well why is it how is it different from just Postgres and my sequel so let me just say a few things about databases in general when you look at a database engine it has generally four layers this is a gross simplification of a database but at the top layer you have the sequel engine which is the thing that takes the connection parses the sequel and then does the optimization and this is high that query is going to be executed you have a transactional layer which is responsible for asset characteristics of a relational database so you have your autonomous city and durability consistency in the isolation below that there's a caching layer which basically takes data pages from disk and caches in memory because databases tend to reuse data a lot so the memory helps reduce i/o and this is the model that's been around for decades and at the bottom of it is what actually deals with the storage and logging of databases now if you have a database you have terabytes of data you can imagine there's some processing related to actually getting the stuff on and off disks and managing that entire affair so traditionally people would run a database on image or you might put a machine and some sand or whatever to separate the storage but in general of what you get is does four layers are stuffed in a single engine single executable that runs on some combination of disk and and computer and that's how databases work what we did with Aurora we took those two components and we split up we took the compute and storage and push the storage in the distributed shared storage model for Aurora and then the compute went to what you provision as an instance of my sequel or Postgres that you interact with and you separated out the caching as well because the caching is not tightly coupled with the storage so this is the model that it runs this distributed storage presents a volume which is the database volume which the instance running in compute sees as oh there's my storage stuff the distributed storage gives you all kinds of interesting characteristics which basically manages your change record management the application of change records to data pages it manages replication detection of failures in disk or so on and rep and and healing whatever parts of the storage that is broken it maintains six copies of the data across three availability zones so that you have availability and durability so the data is always available in the data is always durable and nothing is lost and it manages the entire database as a distributed 10-gig chunks across hundreds or thousands of machines and we picked the 10 gig because it's a nice number it's small enough where we can do things fast as you can imagine moving at boxes a lot easier than moving a grand piano so these 10 gig chunks allows us to do faster replication faster healing faster deep recovery from failures and the other things that you get with these 10 gig you can add them forever and this is how you get this infinitely scalable infinite you know to 64 terabytes 128 coming soon and so on we just add these 10 gig chunks as your database grows so you don't have to allocate storage and you don't have to do anything as the database grows and these 10 gig chunks get distributed across lots of machines in order to give you the throughput that you need for your database now and then we take all of the change records and all the backups and stuff them in s3 because s3 is a very good place to keep things forever at a reasonable cost so when you do pointing time recovery or you have backups with the 30-day retention they're usually where they end up is in s3 now whenever you talk to or out people they're always very excited about this stuff and they talk about the architecture and a lot of customers say why do I care so let's talk about why we actually went ahead and did this not just for the the words that we've got in for building Aurora but because customers and so we've had RDS for a long time we just had the 10 year anniversary of RDS we got some nice t-shirts for it and throughout the years we've seen customers demanding more and more from their databases where you're getting into sectors such as financials and commerce entertainment media gaming healthcare automotive manufacturing logistics medicine and you know pretty much every sector and every type of business that we encounter with one of the interesting things they're doing is they're building a really complicated applications to serve their users and the user base is growing and all of our customers and regardless of the sectors they're telling us say hey we're building applications that run 24 by 7 and we need applications to scale because we have 500 million customers like I don't know if you saw some of the demos in some of the other sessions like Mario Cart you know how many hundreds of millions of people play Mario Kart all the time when is a good time to take the database down never it's similar with education like people go to school at night it's you know self-study all the universities have online courses when is it a good time to you know tell a professional I'm sorry the website is down because we're doing maintenance work on the database never and across all of these sectors the requirements for having a database up 24 by 7 uninterrupted and unhindered by administrative actions exist across so this model that we have what does it do for us so in the Aurora storage layer we do the processing of storage not just like here to put some stuff on disk and then get it back from disk the actual data processing for an the relational database engine is pushed into this distributed model and these are some of the examples of things that happens at the storage layer so instant crash recovery what does that mean well what it means is that all the log records are continuously being applied to pages right in the distributed storage model so when you're doing crash recovery it's fast because you don't have to go and do the redo and crash recovery for 5-10 minutes in a traditional database the fault tolerance when we actually detect something is failed some storage one of those 10 gig segments has gone away and being able to basically create a new one and bring it back creating database snapshots backups pointing time recovery clones the scalability of read and write being able to survive have a cache that survives that restart of a database engine because we have separated the cache layer out and the low latency replication all the replication work so this is all the stuff that happens in this distributed layer which is not part of the instance that you're using to execute your transactions and sequel and this is why when we do a clone of your database or you do a clone of your database or you make snapshots or you do any of the administrative functions you don't see an impact to your transactional workload so gone are the days where you had to figure out how much capacity you need on your system so that you could have a backup window between 2:00 a.m. and 4:00 a.m. on Saturdays so you could get backups finished without impacting the production workload or who has ever heard of doing backups in the middle of the day during the busiest times well nobody did that but you can do that now you can create clones that have zero impact on your production workload so when you look at how people use databases it's not just about writing an application it's about okay we wrote an application now we're gonna run this application we're gonna operate this application for the next five years and all the things you think through oh how do upgrades work and how backups work how would recovery work what is my disaster recovery strategy and so we built the Aurora distributed storage model so that we can push all of these stuff down to a layer where we can throw as much compute and as much storage as necessary to take care of those functions without impacting the instance that sits on top of it so that kind of brings us to you know the overall Aurora a promise which is you get performance and scalability you get availability and durability you know you get all of the security functions that you need to build your most critical applications and workloads and it's fully managed for you and so that is the promise and and this is why we went ahead and did what we did with Aurora so hopefully that gives you some sense and this is the end of my introduction to Aurora so we can dive into some of the newer things but I wanted you to have this picture in your head because as I talk about some of the newer future it relates to the architecture so let's let's go take a look at that so what's new I'm gonna talk about some stuff that we did over the last year so some of you might heard of them maybe you went to sessions over the last few days how many people have been to aurora sessions already okay so a handful of people so I apologize for some of the repeat stuff this being the Friday and the last session you know you can't really avoid some overlap so let's dive into the global database so what's the purpose of global database well Aurora runs in the data center in a region in an am a ws region and within the region there are three availability zones and all the storage is distributed across those availability zones and so on but we have customers who are saying listen I need 800 miles distance between my database replicas because I have regulatory requirements or we have corporate governance rules that says so or I just want it or it's part of my risk mitigation and we have customers with different risk tolerance so we built global database so you can replicate an Aurora database across regions so this came out some time ago in 2019 and you were able to make one replica and now what we've done is that you can create five so this is the new part of the global database you can create up to five regions that hosts your database and be replicated and you can also create a global database from an existing database whereas before you had to create a global database from scratch so if you already had an existing database you couldn't add replicas to it but now you can and before the end of the year hopefully within a few days we will have this available for or a Postgres as well as or a my sequel so what does a picture look like with respect to the architecture that we talked about the replication task of moving the data from wherever your primary is where you're doing your reads and writes in this case in this picture it's Ohio you have a primary in their red color and you have couple of reed replicas and whatever changes that are being made is being pushed down to the storage layer because of course the storage layer manages everything and then the replication task between regions is again managed by the storage layer so one of the benefits is that you don't see the computation the memory and space requirements of holding the backlog of things that need to be replicated or burning the networking or whatever resources necessary for the replication because again it's pushed down and so the replication has very little impact on what you do other than hey you have copies of your database in other places in other regions those replicas are read-only so all changes gets propagated but you can take one of these read-only regions and turn them into your primary regions so you can do a readwrite so you can do fast fail overs and you know less than a minute to a failover between say Ohio and Oregon or between Ohio and West Virginia in Northern Virginia as you add replicas going from 1 to 2 or 3 or 4 you see no impact to your workload because again those tasks are pushed down the other interesting thing is if you do replication either physical or logical it's probably because you're trying to build a system for disaster recovery or some type of a durability thing so you're assuming there might be a failure at some point otherwise why would you have it in the first place ok so assuming that there might be a failure at some point what actually happens when something fails so what if a region is down or the network is down for five hours what happens to all the changes well in a traditional situation you know you're holding all those changes in your primary instance and then hoping the network comes back and you do the catch-up and all of that in case of Aurora all of that it just lives in in the storage layer and you could have a day of disconnected without impacting the operations on your primary production region because changes get stored and the changes that we stored for replication are the same changes that are used for backups pointing time recovery right and you know like for point and time recovery you can have a thirty five day retention so we already store all the changes anyway so this is how we this infrastructure that we've built easily Maps into this new features that we're building it's just like okay another use for all the changes that we record in storage and now we can do global databases the other scenario that customer use other than disaster recovery with global databases is that these replicas that the existing of the region's are readable so you can run queries against them you can use them for any kind of a read activity so you can localize your reads so you could choose to say you know I have a product catalog and it's really big and the product catalog is maintained in Ohio and then be replicated everywhere else so in other regions you can access the product catalog really fast now the maintenance of the product catalog is centralized in one place where you do the rights but the reads are distributed everywhere so you can have a fast local reads so that's another scenario for a global databases alright we're gonna come to another new thing so cloning has been around for some time but we've added costs account cloning recently and the reason cross account cloning is important is because a lot of customers are telling us hey listen we we don't clone a database for developers and testers to play with in the same production account because that would be insane so please allow us to do clones across accounts so with this you can now take a database and clone it every morning and into a different account which is used for your analytics workloads so people can go do large queries or whatever is that they want or you can clone the database for production for dev and test work you can actually take a database every morning clone it and these clones have no impact on your production workload right because we don't even copy the data and they don't even cost unless the cloning starts to change because cloning is kind of like a fork you fork the two copies and but you don't do anything unless something has changed so all the clones point to the same piece of data unless the data changes and that's when they diverge from each other so in the case of analytics if you're not changing things you're just creating a clone which gets mounted in a different instance and you can hammer it all day long you can also make clones in a different instance and drop some of the data that you don't want to be seen as part of your analytics and then have that clone be available for query processing so being able to do these cross account is you know the thing that makes it really useful for these scenarios where you're using the clone for a different purpose than the initial production workload now people have been using clones for testing upgrades and things like that for a long time but now we think that this helps people actually use clones in other production scenarios okay another interesting thing which we've been working on and it's been out there for some time but we've also now added Postgres to our Aurora server less an Aurora server less is really cool because you can have an instance of Aurora and you have a database and you use it for iron or for hours and then you stop using it and the instance kind of goes to sleeps and goes away but because the storage is managed completely separately that storage layer is still active so let's say you have a Cologne you have a server less instance and you used it and you made bunch of changes and then you you stop using it and instance goes away and then backup still happen if something goes wrong with the storage still gets fixed because the storage layer is taking care of it and you don't have to bring back the instance to do any of the maintenance work related to storage right so the next day when the instance wakes up because you're trying to do stuff it's not like oh wait a minute I have to go and catch up with the backups and this that and the other right because that's all being handled so the way the server life stuff works is that we have a pool of instances of various sizes that are kind of sitting in a hot pool and as soon as a an instance gets a connection in our request router we've spent out we bring in one of these spinning instances we attach it to the volume from the storage and off you go so it's a nice scenario because you know there are lots of databases that aren't being used 24 by 7 because maybe you do some data processing for some business related activity 5 times a week or once a week or 3 hours a day and so this is really a cost-effective way of doing data processing where you're not paying for these instances that are just sitting around most of the day doing nothing because you have some 4 hours of processing work to do it's also really useful for development and testing because you know the number of time that I've personally spun up an instance then forgot about it because I did some tests and then you know I went to lunch and then came back and I forgot to drop it and then like at the end of the month I get my bill and go hmm right so if you're just using server let's just automatically those things just go away right and so you don't end up paying ok so if you're working for 8 hours a day you pay 8 hours a day and then it goes away you also have a lot of workloads which are very unpredictable in nature sometimes they're big sometimes they're small so using the server lesson auto scaling capability that serverless really helps like I said it's now available for Postgres as well as my sequel so if you haven't tried it give it a try see how it works if we have it has had an incredible adoption by the way the number of instances in Aurora that are services is a little more than we we expected but it's really cool all right let's talk about something which is really new so we just announced this on Tuesday it's an integration of machine learning with databases if you think about AI and ml the entire thing revolves around data right I mean machine learning and AI everything that enables it is about the data so integration of AI ml and database is just a natural thing so today you know you build a model that predicts something and you train the model using some data and then you write an application that I don't know connects to the database and pulls out a record for a credit application for Tony and then you run this through some model and it says Tony is not credit worthy right and predicts that I may not pay back and so you decide to reject right but the execution of that model happens outside which is good so as an application developer you kind of have to go to the database get the data figure out how to format it appropriately for the call to the AI ml model then make the call get the results and format it again as part of the output so there's a lot of machinery involved that makes it less convenient so running a ml model on a row of data in the database isn't as simple as saying select min max average in a database right where it's just a function you call so we thought well what if it was that simple so we decided that well we have sage maker and we have comprehend and if we could integrate Aurora with sage maker and comprehend in a way that allows you to access ml functions from sequel language then a lot of developed database developers would be you know quite excited about it and these are some of the problems that we were trying to solve right allow use of AI ml the database within the language of the database developer and not the data scientist because you know you have a conversation with data scientists and they start talking about their Jupiter notebooks and whatever and you know you kind of lose track of what they're saying so how do we do it in the database language which happens to be sequel and how do we do it so that it's high performance so this is what it looks like so we we now have a function in Aurora which is AWS comprehend detect sentiment and that function is kind of like min Max or average or any other UDF you might decide to have and so you can do select star from user feedback where you call AWS sentiment analysis with the column name feedback which has a text and then you say it's English and you're looking for positive comments so imagine that we do collect all your feedback and you know we store it in a database and I go to see how many you have you made positive comments about the session and how many have you made negative comments about this session and I'm sure you'll all make very positive comments because this is a great session right right all right so the way it works in Aurora you have an application an application runs that select statement the Aurora engine then takes that runs the query necessary to generate the data which is the user feedback column from that table and then takes batches of rows from that and format them in an appropriate format for Amazon comprehend sentiment analysis and then it gets the results back and then it does it until all the rows have been satisfied and then you get the response back in your application the same way as you would get the result for sum or a min or a max you would get a column in the results that's as positive right or you do a count or whatever so so this is the idea now something like comprehend which is the sentiment analysis is already built-in so it's not a function that you have to do anything with but we also have the api's for sage maker so you can actually go build your own models any model you want train it run it in Sage maker and create a UDF in Aurora that basically points to that and again will do all the work of formatting making the call doing the batch API is getting the results back and displaying it now one of the cool things about this is that the separation of the compute 4ml workload versus the database workload because you might have a model I don't know for credit analysis or I don't know something else which is really complex and you might be running on a hundred North clusters with GPUs right but you don't want to put that kind of competing your database just so you can do em out so you might have a small database that makes a call to a giant ml cluster to get a model execution and then you can get the results back so you can scale these things separately you can scale your compute separately for the database you can scale the storage separately for your database and now you can scale your ml infrastructure separately for your execution this is available in in my sequel Aurora and it'll be available for Postgres Aurora sometime before end of this year so the next latest version of our Postgres that comes out will have this payload so I'll take one question because you are really yeah so the pricing the way it works is there's no additional fees from Aurora because you know we're not doing anything right we're just making it call so whatever the pricing is for comprehend and Amazon sage maker that's what you pay for so if you choose to have a giant sage maker cluster or a small sage maker cluster we don't we don't we don't look at that so it's priced separately and and each service does its own thing all right think this this ml integration is really interesting because the way it's integrated that simplifies it for sequel developers another thing that we've announced is the Amazon RDS proxy it's in preview we will have the Postgres version coming out shortly for preview probably early next year and the idea of proxy is you have applications you may have application servers that create lots and lots and lots of connections to the database because maybe you have lots of users but these connections aren't all active at the same time so you might have a thousand users connected to some application that's connected to the database and I don't know the users having a coffee and you're still paying for that connection that's sitting in your database so being able to multiplex those connections so you have fewer connections that are active to the database right and the database isn't being impacted by opening and closing lots and lots of connections is a good way of you know thinking about the proxy so it does the connection pooling so and it does the auto-scaling necessary with the number of connections to the database to actually satisfy the actual work as opposed to just number of connections it also gives you the ability to have connections open and keep them open from the application so from the application perspective you don't have to go through connection and authentication all the time right so you can keep those open because they're now open to the proxy and they're not really impacting the database so now you can have I don't know 50,000 open connection to a my sequel database and the my sequel database might only see a thousand connection that are actively doing stuff other benefits of the proxy is that we now have built some security mechanisms in the proxy so the credentials necessary to connect to the database is managed at the proxy so you can basically use secret store to store the credentials for the database and the proxy uses the secret store for the credentials and then you create I am roles for the application to connect to the proxy so you no longer have to share database credentials with the application you don't have to put I don't know passwords in the app config files and stuff to connect to the database so that separation makes it a lot easier to manage security and build SiC more secure applications right because now you have centralized the management of credentials to the database independent of management of credentials and granting an application access to the proxy and they can also manage those so you can say I grant access to the proxy to this application so the application gets a token from I am connects to the proxy everybody's happy and the application has no concept of what the credentials were for the database now one of the reasons we go ahead and build stuff like this is because we see a tremendous increase in the number of server less applications that people are building and so in the I don't know few years ago if you were building an application and you had app servers and you were tightly controlling how many app servers how many connections from each app servers and connection pooling from the app server to the database and all of that you know all of that kind of goes away if you throw a bunch of apps in lambda and something you know happens and triggers and all of a sudden you have you know 4,000 lambda jobs running or wanting to connect to the database so you get this burst of activity that comes from lambda and you know hammers the database well in this case the proxy takes the brunt of that right connections coming in connections going away you know that know that dynamic nature of lots of connection open-close open-close open-close simultaneously the proxy is fully managed you don't have to spin-up instances you don't have to choose instance types and sizes we basically look at the traffic and what's coming to us from your application and we also look at the database that you have so if you have a I don't know a tiny database we just assume okay well you're probably not gonna try to get billion connections to the tiny database right so the proxies sized appropriately for the type of database that you have and then it's Auto scales and it manages that Affairs so that you don't have to think about managing the proxy you enable it and off you go so it was it was launched I believe on Tuesday so you can go and give it a shot and like I said the Postgres version will come and this is the beginning of a journey so it's in preview we're gonna put a bunch of additional features it's important to share your feedback with us because we go build this stuff for you not for us so knowing what it is that you need and what works and what doesn't work is always always appreciate it how many people use performance insight not enough people so performance insight is our mechanism to give you information about your databases the the mechanism is there now we have recommendation that also ties to performance inside and this is how it will continue to add more and more information that you need to build your application manage your application write and know what's going on with your databases so we're investing heavily in in performance inside and this is our path forward so every time you think I wish I knew what was going on with this database when we when we actually go and try to enable whatever metric that you want to see is going to be through performance insight we've made some new improvements now you can get sequel statistics for queries and see what is going on it's there's a bunch of improvements that we've done in in Postgres and now we're gonna also add those for my sequel sometime you know you do it a little bit of a staggered work you know shows up in my sequel then shows up in Postgres shows up in Postgres then shows up in my sequel sometimes it's really not possible to do it all instantaneously so instead of just waiting we do it opportunistically so we enable the Postgres stuff and then we'll enable my sequel stuff but I want to make sure that everybody sees and knows about performance inside and that I wanted to share with you that this is our path going forward for sharing actionable insight about your stuff and it'll be through performance inside those of you who use it if you like it or don't like it you know share your feedback it's you know it's either easier to iterate on this stuff we have database activity streams this is really a an important piece for lots of organizations that need to know who touched the database what they did with it and when they did it so it's your audit log now traditionally you go to your database and turn on audit log right and you have a high throughput transactional database the audit log just fills up instantly you know because generates you know an audit record for everything that's being done it's usually massive amounts of data and ends up in a file and then you have to go figure out how to process it so enabling audit log has been a complicated thing so we wanted to make it a lot easier so the way it works is you basically take all the are the activity that's going on with the database you know Bob did I select on Tuesday and Mary did a insert on Wednesday and we send it as a stream into Kinesis and Kinesis being the stream processing right and it's persisted and you can have it for a long time then you can do all kinds of interesting things for example you can say I want all the audited records but I'm not interested in looking at it unless somebody forces me to so I'm just gonna point Kinesis to s3 and just dump it there and just that's it that's the extent of what I'm gonna do or maybe you'll decide to I don't know run an Athena query on all the stuff that you dumped into s3 or maybe you'll have a redshift cluster spun up and create an external table points to s3 and you do some analysis of your logs or maybe you take that Kinesis stream and you basically point it to third-party database activity monitoring systems like and - to get consistent reporting of database activity with your raw databases as well as on-prem databases that you might have that uses the same system but the idea here was how do we simplify it so you don't have to manage the audit stuff and so that audit doesn't impact a database so much right now obviously you know the generation of the audit record does take some amount of resources from the database itself but we try to get those audit records off the node as fast as possible so it's just about the generation not the storage and processing and propagation works for Postgres we'll have it working for my sequel early early in next year so I'm like I said I go out and be people up to get the stuff done so hopefully it'll be it'll be soon so that's database activity stream other new things so we've had this read replica everybody's familiar with Aurora read replicas you can have up to 15 read replicas and you can run your read workloads against read replicas nothing new here but what I'm the interesting thing is that the number of customers that just have two instances a read write and a read is like huge a lot of people just run one read replicas and one read write replicas so we said hey why don't we take this opportunity and for this large number of customers we just have two instances and just turn them both to read write read write and so that's what we did so if you're just running two instances you can actually try out what happens if you both of them are readwrite and now you don't have to make the distinction of the read is right and so on and you could use it the exact way you do now like send a read read work look to the right one and the read write work look to the left one but the benefit is with this that you get continuous availability for writes because if one of them goes away you can just continue writing so it makes it you know your fail overs even faster so today when you have a read write replica and a read replica and do a failover and the read replica has to do some work to become a read write replica in this model the the replica just is read right and it just moves on you can also do some interesting read scaling by let's say you have I don't know 10 tables and you write to the and you have these ten tables are broken up between your own tenant a and ten and B and you could do all the rights for ten and eight tables in the left one and all the rights for the tenant be on the right one and these are non conflicting rights that can go just in a normal speed because now you're not coordinating any kind of a conflict between these rights because they're writing two different tables and if there is a conflict if you just happen to be writing to the same exact row from two different instances the storage layer in Aurora will serve the one that succeeds and rejects the next one so you basically get one of those areas it just couldn't be serialized right because it's an optimistic concurrency so there's a winner and there's a loser and then you retry so that's basically how the multi master works again this is something we just started doing not too long ago and the plan is to just invest in this path going forward and add more and more replicas and the way we will do it is is we will just add the capability as we go on right so if you have tried it and you have feedback let us know if you haven't tried it you could give it a try but basically the scenarios are if you have a read write and a read replica and you you could switch them to read write read write and continue on the way you work or you could do split read write for different workloads in the same cluster and get better right availability now if you have eight replicas and one read/write replicas and and you want to do eight masters you'll have to wait a while until you know we make sure that it all works and performs and so on okay what else have we done recently another interesting thing we've done is the federated query for Athena so one of the things that we sees a lot of our customers are building applications where maybe in the past the application had different data types and you put them all in the same database for sake of convenience but now those applications are moving to the cloud and the developers are looking at it and say I'm doing a select of a blob object from a database in a in my app and that's kind of not a best use of money because if that object was in s3 I would get better price performance for large objects and you're writing the code then you're thinking I could just do a get object just as easily as do i I do a select you know from a table so pretty soon what you see is large objects binary objects that used to be stored in the database start migrating to s3 because that's where you get the best price performance for large object storage management you also get really good set of api's for managing large object put and get probably better than any relational database could ever do for you so stuff starts to move out of the relational database to systems that are able to give you better price performance now it may be that some stuff moves to databases like document DB because you have a giant doc and you want to index it differently and so what ends up in the relational database is really the the business logic and core transactional workload that really needs the relational capabilities so as data gets diversified across different engines in the back to give you the best price performance you end up having data in multiple places so now if you want to query something that happens to be in three places about how do you do it and this is where this Federation comes in where we're trying to integrate through the systems that are capable of hitting multiple backends so in this case with Athena a federated query you can literally run a query that touches DynamoDB as well as redshift as well as Aurora and you could do this with my sequel or Postgres and do joins and whatever in Athena and get the results that you want we are actually moving the data so we are having to do the continuous ETL or whatever that sum-sum do to make sure that you know the data happens to be everywhere and there are copies of it everywhere so there are lots of scenarios that you can enable once you have Federation so I'm gonna talk about another one of these which is Amazon redshift federated query with Aurora and so this one just launched then again it's it right now it works fedora post Chris I was already as both Chris and the my sequel one is coming and so lots of people use redshift for data warehousing or do analytics so you have data in real relational databases and you copy it out you ETL it and you put it in redshift and lots of people do this because you don't want the historical data to be in your relational database causing bloat for your production application so data moves you also might choose to take really old data and put it in s3 because again like I have data that's 10 years old I really don't want to throw it away because somebody might need to query it but I also don't want to pay for it to be inactive databases so maybe the ten year old data moves to s3 maybe the last three years is in redshift or maybe the last you know eight years or seven years depending on whatever the requirements are for that business ends up in red and then the active data is in let's say or a Postgres like the up-to-the-minute now what if you want to run a query that does give me the Aaron of the revenue for last year last month last ten years last ten years by quarters and then at some point say I want the revenue up to the last second now the requires that you go after the active system so with this you can actually create these external tables in redshift that reference a table in Aurora so a table in Aurora is now accessible but by redshift and you can do creative using redshift that does a union of all the tables in s3 in redshift and there were a post graph so maybe most of your analytic workload just hits the tables in redshift but every once in a while somebody wants to run an audit thing from ten years ago and you can basically run the same query except you run it against the view that includes s3 and maybe every once a while somebody wants to see that up to the second data and then you hit the view that includes the Postgres table or you just reference the Postgres table directly from redshift and in this case whatever the infrastructure that you have in your analytical system is available for joins and other aggregations and so on with the active data that happens to be living in or Postgres so these these Federation's active Federation mechanism is really useful when your data ends up in different places and you don't want to always copy an ETL data back and forth all the time right so you could just leave the data where it is and and do the work the other interesting thing about this stuff is is you might have data that has different compliance and security requirements for the active production data that lives in say Aurora and that might be different from the way you manage the compliance or security of that historical data well in this case because the data security and compliance managed by Aurora whatever policies that you apply there remains even though you're trying to query it from redshift it's still being governed by the source compliance and security rules that exist for Aurora by the way the same thing is true for all of those AI ml integrations because the data actually never leaves only people have access to run the query in Aurora get to call an ml function and the security of the data is managed in the same place just as just as we have here so this is a difference about Federation versus copying stuff around because when you copy stuff around and you also have to manage all of the compliance and security rules and regulations and the governance around along with it we're isn't in case of Federation and integration with ml you know the sources where everything is governed so let me just show you a couple of things this will require glasses so this is the query editor oops this is the query editor if you go to redshift and you say you want to run a query I'm just gonna run this to show you so there's a bunch of tables and so you want to run the ad-hoc query and so I'm gonna let this go for a second and I'm gonna pause it so we can talk about what is actually going on here okay so I'm gonna select count star comma min ship date max ship date from a PG line item in a PG dot line item being Aurora Postgres that line item so this reference is a table which is actively Bank managed in Ora Postgres and I'm just gonna do a select on that table but I'm in the redshift query console and the actual query is happening in in redshift and so you get an answer there's like thirty some thousand rows in there and the date min date is 2019 January max date is March so there's like three months of data in there great so that's your active data we just created it from a redshift query editor and then we're gonna go ahead and basically run this query again but instead of or a Postgres we're just gonna do it from the schema that exists in redshift and in this case you have a few million rows and the dates are ranging from 2015 to eighteen and then we're gonna do the same thing except we're gonna reference that table from s3 and run the same query on s3 and you'll notice that this one is gonna take a little longer because we're not hitting a fully managed the index table in or or our redshift it's just a bunch of data sitting in s3 and results will come back when it's six billion rows and the date ranges are from 92 to 98 so there's about six years worth of data in in there so you can see that we're just sitting in redshift head query editor querying tables that happen to be in three different places so now we have a view that we've created so we're gonna look at this view definition and see what this view looks like so the name of the view is line-item all and you can imagine what that is so line item all happens to be a view which there's a union all of the table from Aurora a table from redshift and the table in s3 and then then you can run queries against this table and now you're querying table with historical data sitting in s3 with active you know data sitting in redshift and up to the second data sitting in or a Postgres and you never move the data and all of the security and compliance machinery applied to s3 and redshift and Aurora are appropriately for those sources and managed by those sources so here we're gonna run a query against the view and look at the plan so you can see the actual query execution plan on redshift and how redshift is reaching out to these different sources and pushing queries down pushing filters down to those sources bringing only the data necessary to satisfy that query so the the significance of this one is this is looking at orders that had a quantity of 10 so as you can imagine you don't want to bring all the data from Aurora to redshift just to filter out and throw it away so the query filter is being pushed down so Aurora will only return the Rose with quantity of 10 so we'll right and then that query then gets aggregated at redshift and and again it's an interesting thing because you might have a very large redshift cluster because you have very large volume of data or you might have a small cluster or whatever the compute is appropriately sized and it's independent of the sources so again you can cost optimize the best way for the workload so that's kind of the redshift Federation and and the reason I wanted to cover this at length is because I want you to understand that the direction we're taking with Aurora we're not trying to make Aurora everything to everyone for every workload we were saying within AWS we have analytic systems we have machine learning system so instead of taking all those things and stuffing into one database we're integrating the database with all of these services so that you can get the best of each service but without having to do the work necessary to integrate yourself right so you can run an ml model on a giant sage maker cluster from Aurora without actually doing any of the integration work you can run a redshift cluster to do analytics that touches Aurora without having to do the work and so this is a investment area that we will continue directionally to integrate the databases with all of the other services because the applications that you're building in the micro services that you're building for your applications do require different types of services and a variety of services that you're using so we want it to be as easy as possible for the data to be used across the application and all the micro services that you're building so hopefully that gives you a little more idea of where we're going and not just it's not just about the feature that have you happen to be talking about here I know that this is probably the last session so maybe not so relevant but the recording of all of these is out there so if you really want to get into the weeds with our storage works including how we apply change records to pages and often and so on that first session that 3:09 demystifying the Aurora storage and then we had a deep dive session for Ora my sequel and a deep dive session for a Postgres which just covers each of those subjects and then bunch of sessions on migrations and and then there's a session which is all about RDS that you may want to check out we have some new training site so this probably showed up in all of the sessions but visit aw start training to see what's out there we're gonna start putting a lot of training material there because we're building a lot of new services and then I'll stop here thanks a lot my name is Tony petrosyan again you can find me on LinkedIn or Twitter [Applause]
Info
Channel: AWS Events
Views: 2,311
Rating: 5 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, DAT202-R1, Databases, Amazon Aurora, Amazon RDS
Id: QJB1vUlkmWQ
Channel Id: undefined
Length: 59min 29sec (3569 seconds)
Published: Tue Dec 10 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.