AWS Nordics Office Hours - Databases on AWS with Gunnar Grosch and Tim Gustafson

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to the aws new linux office hours my name is gunnar girosh and i am a developer advocate at aws every week on this show i have aws aws experts on from the nordics and elsewhere within aws to talk about a specific topic and you our dear viewers have a chance to listen in hopefully learn something new and ask any questions you might have on the topic and we'll do our best to answer those questions and this week well it's no different i'm happy to welcome on to the show specialist solutions architect for databases tim hello welcome tim thank you hey i'm very happy to be here tell the viewers a bit about yourself tim sure so um uh as you say i'm a specialist architect uh solutions architect dealing with databases primarily open source databases running on aws um also the aurora offering which is tied to those open source databases um i actually started on databases like in around 1994 um so that'll kind of date me a little bit there i started on m sequel which is actually kind of a predecessor to my sequel so i don't know if everybody knows about mcclear or not but um kind of grew up from that uh over the last 25-ish years always kind of been an unofficial dba but never really had it in my title um i spent a lot of time doing app development and system administration and that kind of stuff and then um almost three years ago now i started working for aws um and uh have moved into a specialist role about a year ago now and have really been actually enjoying it quite a bit it's kind of where i feel like home to a certain extent so i guess i have to uh ask your name tim gustafsson it sounds really swedish you're located in sweden but you are not swedish yeah my father's side of the family came from uh kalmar and his uh great grandfather uh emigrated to the u.s in um 1871 i think it was or something like that um and then um about three years ago my wife and i we had a kid and we decided that we wanted to live overseas and um we decided we wanted to do that before he graduated high school uh so we chose to go when he was still a baby and we've lived here for three years now and we really love it sweden is actually a really awesome place and we're very happy to be here so on to the topic then tim databases on aws um that's a quite big topic to cover so first off and when we talk about databases on aws now there's a lot of talk about moving applications out of relational databases and into nosql and different serverless databases so should we move everything to nosql we probably shouldn't move everything to nosql i think that's one of the most common um misunderstandings that i hear when talking to folks there's a lot of perfectly valid workloads for relational databases still and i think it's a little bit this um it does us a disservice to kind of think about it in black and white terms like that there are definitely a lot of uh database engines that are better suited for certain types of workloads um but it's very rare when you have you know even a moderately complex workload that everything fits nicely into one paradigm um as i mentioned i started on databases at what i consider a pretty long time ago now and so i spent you know 20 years almost 18 20 years thinking about everything in terms of how would i fit this into a sql database the the nosql landscape has changed that dramatically and for a lot of really good reasons you know nosql offers a lot of advantages that you simply can't have in an asset compliant database um but that doesn't mean that it's the right place for everything um so you know if if you're if you're if you're if you're presenting data to users and you're you've got a web application or a mobile application or something like that you know nosql really um helps with a lot of those kinds of workloads when you start talking about things that require um really high levels of transaction isolation really high levels of relationships between different tables um sql is still a really good way to go um i would just you know one of my sort of standard cautions to folks is don't think about it in terms of should we move to nosql or should we stay in sql it's it's the answer is probably a little bit of both and some of your workload might be in sql and some of it might be in nosql um or in one of the other purpose-built databases that serves specific needs like time series data or cryptographically so secure ledgers and that kind of stuff yeah and you touch their own purpose-built databases and that's a term we hear quite often and we at aws use that quite often as well so what is a purpose-built database so i think probably the easiest way to describe it is to give you a concrete example probably one of the more popular purpose-built databases that i work with currently with customers is the the times time stream database on aws which is built specifically to ingest time series data and so if you think about things like iot sensors or click stream data or maybe um data about people coming into and leaving buildings or that kind of stuff that's really well suited for time series time series data has some really specific properties to it everything that happens in the last maybe an hour maybe a week maybe 90 days is really interesting to the application depending on what the application is that window can be a little bit different but the data tends to be a lot less useful as it gets older but it's still important to keep around and time series handles this problem really nicely by keeping the most recent chunks of data in memory and making them super highly available super low latency to the applications that need to consume them and also to ingest them you know because we're ingesting the transactions into memory instead of into disk we're able to do it much much more quickly than than you could if you were committing it to disk and then the system automatically takes care of taking that older data that's perhaps slightly less interesting and moving it off to magnetic storage but from the perspective of the application it's a single view of the data so when you query your data you're not having to query two different data sources you query the one data source and the engine takes care of bringing the two together as needed and so for the vast majority of your queries they're all happening in memory they're very very fast but when let's say a customer wants to go in and see their historical activity they're able to still access that historical data without needing to break out into some other system altogether so that's an example of a purpose-built database for ingesting and consuming time series data um there's a there's a handful of other ones out there qld is similar qldb stands for quantum ledger database and if you have the need to cryptographically verify and secure each transaction along the way um qldb can help you with that kind of stuff there's also neptune which is a graph database and so this is a type of database that's built to map relationships between different entities the common example that people give is like social media where i'm a friend of you and you're a friend of somebody else and that engine is able to resolve that that indirect link very very quickly and very very efficiently which is actually quite expensive to do in a sql database so you as the solutions architect then are there specific properties you look at when you try to choose which is the right database engine or database service to use yeah absolutely i mean one of one of the most critical things to look at when someone starts asking questions like that is what is the access pattern how is the data being consumed um if the data is being consumed in uh in a way where it's being written in and then read out and kind of key value store you know nosql makes a lot of sense if you're able to architect your schema in a way that you can retrieve data from a nosql database very efficiently then yeah that makes a lot of sense if you're doing you know a lot of relational stuff a lot of analytics type stuff light analytics workloads then a relational database like my sql or postgres makes perfect sense or oracle or ms sql whatever but on the other hand if you're doing more large-scale analytics and you're really just kind of um crunching numbers at a super huge level then you might move into something like redshift which is actually built for that kind of aggregate function type analytics um and and that engine in particular is really built for you know terabytes or petabytes of data um you know i i there's not really like a like a hard fast rule um but if you've got let's say less than 50 terabytes of data it probably can analyze it in a regular sql database without a whole lot of pain once you get past like the 50 terabyte mark again depending on the schema and the use case you know then you start to look towards data warehouse things like redshift um possibly athena where you're actually reading the data directly from s3 so that that's the kind of older data that you're analyzing still but is very static and doesn't change very much and you're okay with the queries having a slightly higher latency because it has to be read from s3 instead of from an in-memory index or something like that so i mean how you're using the data is probably the primary question the very first one that you want to ask and then other factors start to come in like the velocity of the data how often does new data come in how often does old day to go out how long you're keeping the data for you know i work with some customers who are uh you know financial payments customers where they have tons of transactions that are coming in very very quickly and like the last 90 days of transaction is super interesting but then they move off to a data warehouse after that and so that's an example of breaking the workload up into two different data engines according to the needs of the application um and then finally there's there is a certain amount of um you know developer preference i mean it would be nice to say you know everybody that can take advantage of a nosql database should and that's a nice idea to have but there also is a reality of a learning curve of you know what are people comfortable with um and we certainly don't want to uh stymy innovation and development for a long period of time um just because of a learning curve so we might ease into some of these other technologies a little bit over time instead of just cutting off to it and saying no this needs to go in a no sql database today because that's the right fit for the job um so it's a combination of those kinds of things it really is um there isn't really like a decision tree per se that i could put in front of you where you could kind of make every decision perfectly there's a lot of context and a lot of situational awareness that um needs to be taken into account when you're making the decision um but but those i would say those are pretty high level general overview type concepts that people can apply to the question how well aware do you see that customers are today about the different database services that we have or do you as a solutions architect introduce new services to customers when you have these these conversations uh the answer is both um some of the folks that i talk to um are really really knowledgeable really know our stuff in and out in a lot of cases they ask me questions that i don't even know the answer to and i have to go and find the answers for them um so so a good chunk of our customers i i don't want to say it's 50 but but you know maybe approaching 50 somewhere in 40 range perhaps um know the offerings pretty well and are coming to us with really informed decisions and opinions about things um and then about another 40 are are kind of new and and you know sometimes that they don't you know they understand nosql from like a i read it on google perspective but they don't really understand um you know the inner workings of it or why is no sequel the right choice in some cases versus others um and so those folks you know we take a little bit longer with them we spend a little more time explaining some concepts we we talk about you know the different scenarios in which one makes more sense than another um particularly when making the move from sql to nosql there's a lot of education that has to happen around thinking differently about data um again you know sql has been around you know since the 70s um so you've got 50 years of people thinking a certain way and it's kind of easy to get pigeonholed into that thinking and one of the worst things you can do is take a relational database because you're pretty much guaranteeing yourself poor performance when you do that um so uh you know for those folks we have to take a lot of time not just to parse the schema but to understand the business use case a lot of decisions about schema design were made for reasons that don't apply when you're talking about cloud technologies anymore and so we really need to understand like what is the business case you're actually trying to implement here and then work backwards from that instead of saying this is how it's set up today and then this is how it should be set up tomorrow you know it's there's a there's a lot of nuance that gets lost in just translating the the schema from sql to nosql and um when we talk about nosql we often talk about dynamodb of course as the service of um that we we often choose for for instance serverless applications do you see that that because a service or something that we're developing is serverless that we always go to dynamodb because of that or do you see customers use other services as well with serverless applications today it's really a mix i mean obviously dynamo is a really great choice if your data usage patterns fitted and if your application fits it really well um but uh there are plenty of folks who are doing serverless and are still connecting to a mysql or postgresor or some other relational database um in fact with the aurora serverless offering we have something called the data api that kind of is a bridge between the two worlds um one of the one of the downsides essentially of pointing a microservice directly at a traditional database engine is tradition traditional database engines tend to have what we would consider fairly low limits of concurrency even the beefiest postgres server the beef is my sql server can handle like eight or ten thousand concurrent connections if you've got a micro service that's running that's really really busy it's not uncommon to see fifty or 100 000 concurrent users and so the data api fixes that by allowing you to connect to the database and perform the same queries that you would perform over the traditional mysql connection or the traditional postgres connection but you're doing it over a rest api instead of over a tcp connection um and that allows you to scale your microservices out much much further against the relational database than you would be able to um if you were connecting over a standard um you know sql connection and there's other ways to fry that fish too you know you could do something like rds proxy um there's a number of third-party offerings out there that perform you know roughly the same kind of thing um but but i think that the short answer to your question is not really moving to microservices doesn't necessarily mean you have to move off of sql it's it's a pretty good path to consider but i wouldn't say that it's a slam dunk for every application okay i think that's a a good answer that it depends that's i'm gonna get a shirt that just says it depends right across it so that i can just hold it up whenever people ask me those questions we have it as a banner so yeah yeah uh so then you touched upon aurora now amazon aurora and tell us a bit about aurora first off yeah so aurora is basically um amazon's version of a postgres or mysql database um where we've fixed one of the the key underlying constraints of traditional databases and that is we basically rewrote the storage layer entirely so in a traditional mysql or postgres you've got the postgres binary or the mysql binary running in linux which is attached to some storage and then you've got your database stored on that storage and it could be an array or it could be a handful of ebs volumes or it could be something similar but but that design is is pretty cool you know it's essentially universal i mean it's the way databases have been architected again since the 70s um what we did is we actually removed that storage layer so that the database engine is not actually talking to ebs directly anymore and we've replaced it with a distributed horizontally scalable storage engine so when mysql or postgres goes to write a block of data it actually writes to the storage engine the storage engine takes care of replicating it so you get six copies of your data when you write the data to aurora two copies in each of three availability zones um and then uh we've actually changed the method in which the data is written to disk as well so rather than keeping materialized views of the database tables on disk laying around we only store the transaction log so we don't have to write as much data to the storage as you would in a traditional database system and then when the the client goes to query data back from the database the database engine requests the pages of data from the storage engine and the storage engine takes care of recreating those data pages from the transaction log and this has the the this change in architecture has a huge meaningful impact on how well we're able to perform at the storage layer level um for example crash recovery in aurora takes much less time than it takes um if you were to do the same thing in a traditional vanilla mysql or vanilla postgres um installation um we can do things like replicating the data between availability zones in essentially real time we can replicate data across the globe i'm using something called global database um the replication lag from one side of europe to the other with global databases less than a second um and the replication lag isn't dependent on the types and quantity of queries that are being run as opposed to logical replication where if you have a transaction that takes a half an hour your replica can fall half an hour behind the master with physical replication and aurora that doesn't happen and the replication time within a single region tends to be less than 20 milliseconds and within a geographic region like europe um tends to be less than a second couple seconds maybe um if you're replicating down to sydney which is about as far from europe as you can get um you you're still expecting sub one minute replication latency between the primary regions in europe and the city region so i mean it really solves rewriting the storage engine in the way that we have solves a lot of scaling issues with relating with relation to databases um you don't when you spin up a number our database you don't tell us how much data you want to store you just say how much compute do you want so you need to know about how much activity do you think you expect on your database to start with and then we provision the storage uh automatically on the back end for you and as you write more data the storage expands out horizontally it's actually being scaled to multiple storage nodes so every 10 gigabytes of data is on a separate storage node and as a consequence you have for a terabyte of data you have at least 100 storage nodes that your data is striped across and actually it's 600 because you've got six copies of each block of data so you've got a tremendous amount of right i o throughput and read i o throughput that you can access when you're running queries against the database and as a result the queries are much much much faster than they could be for for i o bound queries if your queries are memory bound or cpu bound you'll still have the same problems that you would have if an open source you know my sql or postgres but for i o bound queries um aurora can really um you know knock your socks off all right so that was was five minutes of positive things about amazon aurora and with all that in mind then why would someone run my sequel or postgres equal on rds or on ec2 instances yeah so that that's a really excellent question and it's again something that we talk about a lot with customers um there's a handful of reasons why um the first thing is aurora is a really great service and it offers lots of features but it does it is built along a dimension involving iops and so if you have astronomically high iops but you don't have the budget for that it sometimes makes sense to put your database back onto rds or even onto an ec2 instance because you have much finer grain control over i o on ec2 than on rds than you do with aurora aurora is also designed for massively parallel operations if you have a database engine that principally serves like a batch job that runs once a night or a couple times a day it's mostly single threaded the concurrency is you know perhaps measured in single digits or maybe low double digits um you actually can sometimes get a better performance out of an rds my sequel rds postgres instance than you can out of the equivalent aurora mysql or postgres instance so it depends a little bit on that um there are some cases so as far as ec2 is concerned people would choose ec2 for a couple of different reasons most of which revolves around using third-party plug-ins or needing to tune stuff that is um part of the managed service so if you're if you're if you're creating a lot of udfs if you're installing a lot of third-party plug-ins if you're really tuning them specific you know db parameters and that sort of thing you may still want to run it on ec2 um another use case that i've seen and this is kind of rare to be honest with you but i'll mention it just because it's worth people keeping in mind is if you have a relatively small data set but that has super super super high ios um on an ec2 instance you could actually load that into a ram disk and just have the database engine load and access the data as though it were a mechanical drive or an ssd but from ram which essentially eliminates ios for that data set because everything is happening in memory so there is no disk io and also makes the access times very very very low because there's there is no storage medium involved in that point again it's a it's a limited use case i don't see it very often but i have seen it deployed a couple of times um where people have like a relatively static data set um that is just being read like you know at an enormous volume um that that's another use case for ec2 as well it does you know it it comes you know the more you move from the aurora side of the spectrum to the ec2 spectrum you know that also is a a measurement of how much work you have to do as a database administrator um when you're deploying on ec2 you're responsible for everything when you're deploying an rcs sorry rds um you need to pick a couple of options but most of the stuff is handled in the back end for you and when you're deploying on aurora basically you say i want a database and we more or less take care of everything else on the back end for you so from uh i know this i don't like this word but from a lock-in perspective then if you start off with aurora are you locked into aurora then or what are the migration paths if you want to move to rds or to ec2 yeah so aurora supports a number of different ways to get data in or out one of the things that i really like to tell customers especially new folks that i'm talking to for the first time is i want you to be our customer because you want to be our customer not because it's hard to get your data out um and so along those lines um rds aurora and rds sorry aurora mysql and aurora postgres um because they are mostly vanilla postgres and mostly vanilla my sequel to the extent that we can make them you can use all the same data import and export tools in them that you could on a self-managed instance so you've got access to things like mysql dump you've got access to pg dump for a postgres um you've got access to inbound and outbound logical replication so if you have a database running on prem and you want to replicate your data from aurora to it in real time you can do that over the standard logical replication channel you can go the other way so you can replicate your data in from on-prem into a cloud database and then maybe your application hits the cloud copy while the the actual writing happens on-prem or something like that and you can replicate basically to anywhere that you have a network path so if you've got mysql running anywhere else on the internet as long as you can make a tcp connection between aws and that other database you can replicate data between them and we've actually got in addition to the standard mysql tools and postgres tools and in addition to standard logical replication we've also got the schema conversion tool and we've got the database migration service which takes care of moving data in and out of a number of different database engines and aurora is just one of them so you can go in and out of aurora you can go from oracle uh on prem to aurora my sequel or aurora postgres on aws you can go from aurora my sequel on aws back out to you know oracle if you decide you don't want to stay on aws or something like that um so there's a number of different ways to get the data to go in in either direction that's great all right so if you just join us welcome to the aws nordics office hours and my name is kuna ghirosh and this week i have tim gustafsson on he's a specialist solutions architect for databases and that is the topic for this week we're talking databases on aws and this session is recorded so you'll be able to watch it afterwards because there is a lot of information about databases in this session so go back watch it once again if you wish so if you have any questions for tim just put them in the chat and we'll do our best or he'll do it's best to answer it i'll just nod and look like you know what he's talking about uh so tim yeah we've talked about aurora now um and i as a fan of serverless i know that we also have aurora serverless yep and at re-event and we launched v2 version two of aurora serverless yep can you tell us a little bit about the difference what happened with v2 compatibility yeah let me let me start by saying aurora serverless has actually been my favorite aws feature since before i was even a database specialist um i i think it very nicely solves a problem for nerds like me and for other folks who are trying to use databases in a kind of a serverless way because it fixes one one of the principal problems with databases and particularly with sql databases we're talking about now so ignoring nosql for a minute because nosql kind of has the scaling problem solved to a certain extent um but scaling traditional rds or relational databases um has always been kind of challenging because you there there always basically has to be an instance somewhere that is the leader and that is responsible for arbitrating transactions and for deciding who wins in the case of a conflict and that kind of stuff and and scaling that's always been really challenging um historically it's if you need to scale up your primary it generally requires a reboot which could mean you know several minutes of downtime for an application or it could mean failing over to a replica while you reboot the primary i mean it's it's not it works and we've got a lot of tooling around it that makes it less painful than it used to be but it's it's still not ideal so aurora serverless v1 fixed some of that problem by giving us a mysql database engine that um and now postgres also that allows you to scale up and scale down the compute portion of your database engine without affecting the storage and it does this based on a couple of different metrics for aurora serverless v1 it's prime it's principally cpu consumption and number of concurrent connections and so when either of those two metrics exceeded a certain threshold um the uh database engine would automatically scale up to meet the demand for the new um requests that are coming in and so with aurora serverless v1 this worked pretty well but there was a scaling event that took place when you were scaling up or scaling down it was around 20 seconds or so um and it caused some trouble for some applications because the the the clients would experience this latency and this lag and and that kind of stuff when it was doing scaling which wasn't wasn't super great but it was better than the alternative but it wasn't perfect right um and then there's a few other limitations with the v1 that made it hard to recommend it for really high-end production workloads things like there was no such thing as high availability so with aurora serverless v1 if your master crashed for some reason you have to wait for a new uh new primary to come online um there's no uh quick failover to a replica because there's no replicas with aurora serverless v1 um things like cloning and replication and a number of other features just aren't available in the v1 offering v2 aims to fix all of that the the goal for v2 is basically to be on feature parody with provisioned aurora so if it's supported and provisioned aurora the plan is that it will also be supported in serverless aurora v2 so replicas will work global database will work rds proxy will work all of the other stuff that people have been waiting for and that v1 hasn't supported um will become available with the v2 offering that's that's the plan um as well as the the features that were serverless specific before um are are planned to be available with the v2 offering as well so things like the data api um should be available for v2 as well so it really kind of makes it hard to recommend provisions aurora anymore because at a certain point you know when if you zoom out a little bit for any database deployed anywhere in the world almost every one of them has variable load right you know you're you're going to be busy monday monday to friday nine to five let's say if you're if you've got an application that is facing um other businesses and other sort of weekday consumers um if you're running you know uh you know something that supports sporting events then evenings are likely to be busy for you but like you know there's going to be periods of days when you're really high and then there's going to be periods of days when your usage is really low and you're wasting a lot of money provisioning a database instance for that high demand that's sitting idle nights and weekends let's say serverless fixes that it allows you to scale up and scale down according to that demand you can actually define your own metrics so where with v1 the the two metrics that it used were cpu consumption and number of connections um with v2 you can actually define your own metrics so if you know that you need a replica for every thousand concurrent users um you can scale up according to that you can scale out replicas according to that or you can scale up the primary according to what your requirements are um so there's there's a lot of uh improvements along those lines um one other thing that's i think worth mentioning is that with serverless v1 um your capacity doubled at every step so you went from 1 to 2 to 48 to 16 all the way up to 256. um with v2 it actually has a much finer scaling increment so you're not doubling your capacity just because you're 10 over your threshold um you can add just enough to accept the 10 extra work um and you're not again kind of over provisioning in a serverless way so there's a lot of a lot of improvements around those kinds of things as i said it's going to be really hard for me to recommend provisioned aurora to folks once serverless v2 comes out i'm i'm struggling to think of use cases where it would be really appropriate to do so yeah that's really interesting to hear move into serverless as a way to to always adjust for that variable load that's that yeah as you said all workloads have they have a variable load somehow so we have a question uh from raging hamster any secret tips to avoid surprise billing based on scaling for example based on schedule scaling and so on yeah with serverless um so changes to the serverless configuration can all be made through the control plane api um which is the sdk basically so if you know that you're going to have a busy time you can build a lambda function or something somewhere that calls the data plane api to change the minimum and maximum scaling parameters for that window of time so if you know you know if your nights and weekends traffic is such that you are perfectly willing to let things take a bit longer you can set the maximum scaling window nights and weekends to whatever value you want to one let's say um to make sure that the load you know the um the capacity never goes above that regardless of load um and then you can you can set them to more you know um uh generous uh proportions during business hours when you're expecting the load to be um higher and are willing to incur that cost that's great next question then can aurora my sequel be added to x-ray for tracing so if you use the the data api i believe the answer is yes although i would have to double check that for you i don't have it off the top of my head but i believe the data api is does support x-ray i think if you still go over the um traditional mysql or traditional uh postgres tcp connections i don't think there's a way to instrument that with x-ray because those protocols are defined by those vendors by nearby the open source projects and i don't think we can add annotations to them that the x-ray demon would pick up on the other side to be able to add them to the traces all right a more generic question from alex hey tim what was your biggest challenge you remember database related like in my career challenge yeah um yeah so as i mentioned kind of earlier in the call i was a developer for for a very long time 20 years almost before i came to aws and i actually had developed a a content management system for a small i.t company in new york when i lived back out there and i made a very poor architectural decision at the time because you know this was this was 1998 or 99 i think um and so you know cloud didn't exist all the stuff that we're talking about today was just the pipe dream um and so we built a cms that basically used a single database engine for all of its tenants um and that database grew to like the 60 70 80 gigabyte size which doesn't sound like a lot of data today but in 1998 it was um and moving that much data around became super challenging so the the um the takeaway that i've always kind of learned since then uh was don't don't put all your data in one database especially multi-tenant databases um you know spread them across different data even if they're in the same database engine put them in different schemas um so that you can you can treat them all individually from one another because uh yeah that much data at the time was really problematic the same thing is basically true today um i'm not sure for given the kind of scalability and the the serverless nature of a lot of the stuff we're talking about today there's really not a lot of incentive to put more than one application into a single database instance there are some caveats here if you've got a bunch of microservices that are all really low throughput it's probably okay to put them into a multi-tenant hotel type database engine but for your primary facing microservices put them all into separate databases as much as possible um so that you're you're limiting the blast radius around maintaining that much data because once you start getting up into terabytes you know aurora supports i think up to 128 terabytes of data right now which is great but if you have to restore 128 terabytes of data or if you have to transfer it somewhere else it's going to take a long time um so think about uh you know breaking things up into smaller chunks that are more manageable and i actually think that that's actually good advice for basically anything cloud related you know break up the monoliths into smaller chunks so they can be managed a little bit more easily um you know even with multi-megabit multi-gigabit uh uh you know internet connections in our houses and things like that it's still um it still takes a while to transfer data sometimes and and breaking it up into small chunks helps i remember that i forget who said it but there is some uh funding statement from back in the day where um that you should never underestimate the bandwidth of a truck full of hard disks driving down the freeway um and that's that's still even true today and we have service offerings built around that snowball and uh uh um um oh my god why am i getting the name of the one there's there's snowmobile thank you i can have my essay license taken away now um yeah snowball and snowmobile are are basically the physical manifestations of that funny quote which is that is still often faster to trip to put data into a car and drive it somewhere um than it is to transfer it over the internet yeah so um we've talked about several different um types of databases and database services now can we talk a bit about uh in-memory databases yeah absolutely so what are the offerings that that aws have right now in regards to in memory databases so elastic cache currently supports memcached and also redis memcached i think is is older and and a bit simpler it's a little bit less featureful than redis but it still is a you know it still is a valid option redis adds a couple of nice features though that a lot of people find attractive um redis can actually store its state to disk so that when you boot it back up again you're not starting with a clean cache it supports replication i i think so i'm not i'm not uh fully up on the in memory database right now but i believe i remember seeing recently uh uh a note that we support cross-region replication with redis now if i remember correctly i wouldn't quote me on that but i'm pretty sure that i saw that come out so there's stuff that you can do with redis that um is more challenging to do with memcache um you know multi-node clusters that kind of stuff um caching stuff in memory is absolutely a valid and and very useful paradigm especially if you have stuff that's being accessed again that you know velocity is very very high and the cost of computing the answer to the question is very high um so if you can construct an application in a way that it can deal with that kind of cached data instead of hitting the raw database every time absolutely go for it um redis and memcache are the the sort of roll your own do-it-yourself caching options um we also have times type stream i talked about before has an in-memory component um and so does uh we we have the dax offering as well the the dynomodb accelerator which does some of the same kind of acceleration in front of a dynamodb nosql database um and that kind of serves the same purpose basically to reduce the latency it takes to retrieve data from in that case the nosql database engine do you see the same type of questions there as well in regards to using it as a managed service like elasticash or running it in your own instances to to a lesser extent are there because the database engines themselves are a bit a bit simpler um there are i mean frankly there's fewer configuration options and so there's less people there's less that people really want to do with them um i i think you know with either redis or memcache basically what you're asking for is a large chunk of memory that is network accessible and both of those database engines support that they both support the key value lookup kind of paradigm um there's there is some tunability to them but for the most part i think people know how to use them they're they're pretty straightforward um there's there's there are much there are many fewer questions i feel like around uh in memory database caches like those than there are around the relational engines and also in dynamo and the other nosql ones yeah so then um what questions i see quite a lot today when building applications is talk about the global application we're talking about global scale and this really applies and and perhaps brings a lot of questions to mind as well in regards to databases so yeah if you want to go global with our application and then of course we need to use some sort of data store some database yeah what what are the options uh in regards to global databases so there's principally two managed options from aws that i think are worth knowing about here the first one is that aurora supports global database and global database in aurora is a single primary writer region and up to five replica regions and these regions could be any of the aws regions that support it and so you can wind up basically with six copies of your database distributed globally around the world global database for mysql also supports something called right forwarding so the applications running in the individual region so let's say your primary is in dublin again and let's say your replica is running in sydney an application running in sydney when it wants to update the database connects to the sydney endpoint and sends an insert statement or an update statement and the aurora database engine takes care of forwarding that back to dublin dublin does the actual work of committing it to disk make sure that the data is durably stored and then responds back to sydney and says okay you're done go ahead and move on and that gives you this really global globally scalable database and this is still a relational database so you get all the joining and all the stuff that you would expect with a traditional relational database the replication lag is is essentially tied to speed of light again for across um uh europe you know it's measured in milliseconds maybe double digit number of milliseconds perhaps 100 milliseconds um when you're replicating down to sydney it could be longer than that because just the of the the distance involved in the speed of light um but but it's actually pretty acceptable replication um and because of the way physical replication works in aurora it's not tied to a serial application of the replay of the reader log so even if you have really high concurrency on your primary the global replicas are really able to keep up the i have a graph that i share when i do a deep dive on aurora that shows what their their right latency is like or the replication latency is like for different transaction volumes with logical replication you get around thirty thousand forty thousand transactions per second and then the transaction uh the the replication latency starts to go way up um you get as high as a couple of hundred seconds um of replication lag with the physical replication in aurora you can get up to 200 000 queries per second and the replication lag stays around 0.5 seconds so around 500 milliseconds so i mean the the global replica in aurora is super cool super good technology and really really really fits that use case for applications that are using a relational database um for applications that are not using a relational database dynamodb also supports something called global tables and so when you create a table in dynamodb you set it up as a global table again let's say we're going from dublin to sydney so you create the table in dublin you create the global table in sydney and the dynamodb engine takes care of replicating data between the two of them the difference with dynamo global tables is that you can actually um write to either endpoint directly and the dynamodb engine takes care of replicating the data to the other region so there isn't this there isn't this latency to go back to the primary and then come back to sydney again um it actually can happen in region and then there's a right conflict resolution protocol that happens on the back end that that handles situations where two um databases update the transaction in short succession so what are the considerations to for when to choose to go global with your database and not so i mean it it depends on what your application is doing i mean if if you're building an application where people are you know let's say placing orders and checking the status of their orders uh global databases make a lot of sense they bring the data close to the users the the kinds of concurrency that you see in those sorts of applications is low enough that there's very little chance of a person uh stepping on somebody else's foot when they're when they're updating a record and so i mean it works it works really well um it's it's probably worth the um the replication you know the the hit that you get for going those distances um one thing i didn't mention is replication lag with aurora global database is mostly related to speed of light so the speed of light from dublin to sydney and back again and remember the tcp packet takes a couple of round trips um so i think the latency from from sydney to dublin is like 80 milliseconds uh just by speed of light but because of tcp it's more like 320 milliseconds because there's a couple of packets back and forth and then if you stack that you know a single tcp packet can only be so large so if you're updating a megabyte of data like there's a bunch of packets that get sent back and forth and the replication latency can or rather the transaction like latency can go up a little bit for that um but yeah for for most for most applications i think it makes total sense and there's no reason to not do it the only situation where i think it probably makes better sense to use a single master as as in aurora global database does as opposed to dynamodb is when you're dealing with things like financial transactions where you're taking money from one person's account putting it into another you really want to make sure that you've got the right consistency there you really want like the the the conflict arbitration protocol for dynamodb makes that kind of transaction a little bit more challenging because you need to make sure that when you transfer money from my account to your account that i don't simultaneously create another transaction in sydney that transfers money from my account to a third parties account because now i've double spent my money so that that kind of stuff for those kinds of situations where you really want super high level of assurance over the the correctness of the data i mean it probably still makes sense to have a single writer but for most other things i mean the dynamodb global tables works really really well it's really great at keeping the data in sync across different regions it's really an acceptable level of latency for most applications so um then talking about aurora again as as a person multiple times set up my sequel replication in the past i know the ease of using aurora instead for for having global databases but what about multi-master configuration so there isn't currently an official multi-master offering for especially for cross region for aurora at this point it's all based on the right forwarding technology which is a very close approximation to multi-master but the difference is that with right forwarding you're still dependent on your primary region so if your primary region goes offline for some reason um rights will stop until either the primary region comes back online or you promote one of your other regions to be the new primary and you can do that generally speaking in about a minute once you detect that there's been a failure and you realize you want to cut over to another region as the primary you can do that pretty quickly um about a minute or less depending on a couple of different factors there is a multi-master for within a single region with my sequel um i i think that that i know there are definitely use cases for it um but i think that people who are looking for multi-master it's also worth having a conversation about what they're trying to accomplish with having a multi-master because it does complicate things um and it raises a couple of other questions where like i would want to understand why like what is what is the requirement to have multi-master and what problem are you trying to solve um before we actually really implement it because it is more complicated um things can break a little bit with it if you if you've got conflicts that can cause problems for you um so again you know it's supported there's a feature for it by all means go ahead and use it i would just make sure that you have an architecture conversation around the appropriate appropriateness for it in a given solution yeah i think that's that's sound advice for for many pieces of our architecture to have a discussion with with subject matter experts and and that brings me on to my usual tip of the week that you as a viewer you're able to schedule one-on-one sessions with aws experts i sent a link to the chat right now and you're able to schedule a session to to talk with a solution architect about your use case what you're trying to build so so make use of that another well a part of most applications that we build is the ability to to search within the application and we have um of course you're able to search within most of the databases but we also have a service for that elasticsearch how do you see where do we position elasticsearch in in this map of databases and data stores so ironically elasticsearch is actually i think counted in the analytics group at this point so it's so strictly speaking it's not part of my team but but i can talk a little bit about it um it definitely uh you know it is a very standard part of applications elasticsearch provides a bunch of search functionality that is absent from basically all the other databases and it's not it's not really so much a database technology thing as a kind of language processing and analytics thing um elasticsearch does some stuff with the data that other database engines just don't do things like stemming which i don't know how much you know about this stuff but if you take let's say the word rain and you want to search for the word rain in a database you also want to search for even though you don't ask for it you want the database to search for rainy rains raining and the elasticsearch application takes care of that for you it's a process called stemming where it actually searches for the stem of the word and then searches on that and indexes on that instead of the literal word that you put into the search box um and that's not i mean postgres has some full text in this thing that does that to a certain extent my sequel has full text in this thing but it doesn't do stemming right now um i think there is a or there was anyway when i worked on it a couple years ago um there was an effort to put forward the ability to have a stemming plug-in loaded into my sql um but last i checked it wasn't even there so if you want the the kind of fluid search that most users are looking for you probably want to use something like elasticsearch because that's that's the level of uh search competence that your users are going to be looking for um even without knowing what they're asking for they're going to want stemming they're going to want the ability to do correlation of bigrams and trigrams and all the other kind of stuff where it's not just a word that i'm searching for but it's two words and their adjacency within the text makes them score higher than if they're in two different parts of the text that kind of stuff and elasticsearch takes care of all of that for you the other databases you can do it i mean there there are plenty of people that use mysql's built-in search and uh you know full-texting there are plenty of people that use postgres built-in full-text indexing and they work well enough but if you really want the polished super responsive super user-friendly search engines you're going to want to go with something like elasticsearch because of that functionality that it gives you that's great so so elasticsearch isn't part of the data source or databases anymore no it's in analytics because it's used for a lot of analytics type stuff if you think about like rafa and all those kinds of things that access elasticsearch that actually is one of the primary use cases it is good as a text search engine though uh so that that use case is still there yeah makes sense right so you touched on graph databases early on um can we just talk a bit about graph databases again because that's absolutely um it's an interesting topic one that isn't talked enough about i'd say yeah amazon neptune is our graph offering so what is neptune so neptune is a graph database engine that actually supports two different protocols um it supports gremlin and it supports um sparkle so these are actually two different uh query languages that actually in the neptune engine are actually stored currently as two different data sets and which one of them you choose one of them you know the rdf-based one rds stands for research resource description framework is kind of an older and more formal way to define the relationships between different things um and uh um uh my brain is telling me now uh the the newer one allows you to actually just kind of make ad hoc connections between things and it's it's more i i feel like the new one is more applicable to um what most people are doing these days basically the idea is this you have two nodes and you've got paths between them and the paths are one-way relationships so uh tim is friends with gunner but gunner is maybe not necessarily friends with tim if gooner wants to be friends with tim you create another relationship that points back the other way you create these relationships not just between you know different people but maybe tim likes sequel and gonna like sequel so by you can search across these multiple dimensions to find the relationships between different people that you know this is nothing new in terms of like a data science experiment but what happens differently with graph databases is that they're optimizing the way the data is stored differently and so queries that connect between different relationships and different commonalities between nodes um are more performant and it's able to handle like billions of nodes and billions of connections between them in a much more efficient way than you would see if you did this the same kind of schema in sql um it's useful for like lots of i mean social media is the example that everybody gives because it's something that people can digest pretty easily um but graph databases are used in things like fraud detection um for you know credit card transactions and that kind of stuff they're using things like recommendation engines so if you buy something on a website the recommending something else that you might like sometimes happens to a graph database um so there's there's a number of different use cases that are other than social media that allow you to to take advantage of this kind of data structure it is a different way of thinking about data and so you you need to build your schema again in the same way that there's a jump to go from sequel to nosql there is also a jump to go from sql or no sql to graph you need to think about your data in kind of a different way um yeah thanks for the graph guide there uh in the chat um so uh you know it it can solve a lot of problems for applications particularly when you have tons and tons and tons of nodes and tons and tons and tons of relationships between them um it again as i said before with the difference between sql and nosql i i don't think neptune is going to supplant the other database engines anytime soon it's not designed to replace and do the things that the other database engines are doing already it's a different paradigm and just like you might wind up with an application that uses both sql and nosql databases to achieve its purpose um you'll often find neptune thrown in there as a third database option to store the parts of the data that that benefit from that kind of organization yeah and building the same type of relations with with a relational database um that would mean that we need more more instances perhaps or more power to be able to do it perhaps i mean it but it's but it's actually more insidious than that because it's not so much the data engineering as it is the query complexity um when you if you think about the way um you know queries work in sql imagine a scenario where it's you know tim and gooner and let's say johan right for me to relate from me to you directly is a fairly straightforward sql query you know select star from relationships where the source is tim and the subject is sql and you get back a list of everybody including guna right it's when you start to jump to multiple generations out in the graph database that it gets more complicated because then using traditional sql i have to run that query a second time to get all the links that you have and a third time to get all the links that they have and so on and so forth out to the edge of the graph with the graph database it optimizes that so that when you're writing your query first of all you don't have to write a recursive kind of thing like that you just say give me all of the relationships to tim who are interested in sql and go out four layers like you actually specify how many layers out you want to go and the engine takes care of building that and querying that for you um whereas if you did that in sql it would be a lot more code it would be a lot more round trips between your application and the database engine so not only is it consuming more i o on the database engine but it's keeping your application server busier and it's creating a lot of additional traffic between the two of them because of all the work that you're having to do um so i mean it's yeah it's a very it's a very different way of thinking about stuff but it's definitely worth it uh to engineer your data that way if you have those kinds of questions that you're trying to answer with your data yeah so once again depending on the use case and this fits a specific purpose built for for that type of relationships questions um yeah look we have a very specific question as well sorry that one there we go so i have a single region aurora cluster and i need to enable tls is there a way to do it without downtime so by tls i assume tls on the mysql tcp connection protocol there is a parameter group setting for that i think that is a dynamic parameter group setting but you need to have your your database already connected to a non-default parameter group to change that um so let me let me explain what i mean by that when you when you spin up a new aurora instance or any new rds instance you choose a parameter group and there is a default but the default is read only so you can't make changes to it if you haven't selected a custom parameter group when you spun up the aurora cluster you will need to create a new parameter group and associate associated with the cluster before you're able to change the parameter and that requires a maintenance window style reboot so you you can't attach a new parameter group to a running uh aurora instance you need to have it attached at boot in order for it to recognize it once you have that all of the parameters in the parameter group that are labeled as i think it's called dynamic on the the ui you can change while the server is running without having to reboot it so the answer to your question is if you booted your aurora instance with the default parameter group then you're going to need to do at least one reboot to pick up a new custom parameter group once you've done that this and other changes similar to it um become things that you can do without having to reboot in the future all right great answer so we are approaching top of the hour which means it's time to say goodbye to tim and goodbye to to you all the viewers so thank you very much for being here this week tim and talking about all things uh databases on aws my pleasure quite a lot yeah my pleasure i know tim you're not on twitter but if people want to reach you somehow um is linkedin a choice or do you run your own social network in neptune perhaps no linkedin is probably the best way you can find me on there pretty easily and uh that yeah i'm i'm not a social media person but uh but feel free to reach out on linkedin if that's interesting and i'm more than happy to take questions that way yeah and notice that he is missing one of the s uh in his last name because chris is is not gustav's on his gustavson uh all right otherwise feel free to reach out to me as well and i'll make sure to forward it to tim absolutely thanks everyone and have a good day ahead and hopefully see you again next week yep you
Info
Channel: Gunnar Grosch
Views: 28
Rating: undefined out of 5
Keywords:
Id: XoKbfHTF19k
Channel Id: undefined
Length: 59min 28sec (3568 seconds)
Published: Mon Mar 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.