AWS re:Invent 2017: [REPEAT] Which Database to Use When? (DAT310-R)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right well they'll get started my name is Tony Patterson I work in the database engineering team at AWS I have my friend Ian here he's gonna join us later and we're gonna talk about databases and which database to use when it's a it's a loaded question we've got a lot a lot of loaded answers for you so one of the things we're not gonna start with is we're not gonna start with these generalities so we're not gonna say well if you need relational or non relation or sequel no sequel we're just gonna dive into some more detailed stuff the idea is that most databases that were ever developed built invented we're built for a purpose and when you find a database which was built for the purpose and that your application really needs that stuff then you have the best match so when your application needs meet the purposes of the database when it was invented you have a good match so this being a database conversation and it's being about storing retrieving and processing data let's take a look at the data first so usually the things that I look at when looking at databases and which database to use in the application I look at the application with respect to the data and the shape of the data in the size of the data and the computational requirements for that data and these are really three important things that help you decide which database is the right database for your application so we'll spend some time talking about these three things and then we'll dive into more details so shape when it comes to databases you know relational databases have been around a long time since the 70s and they were built around the rope so the concept of that row is a fundamental component of an RDD a mash you know if you go read the papers from the 70s you know they talk about the top pole which is the row Rowe has many columns but rose doors are generally built and optimized for doing things at the rural level so the smallest unit of operation is a troll you get the rule you update the row you delete the row you insert the row and and most of the stuff is you know based on that so then you have your records which in the form of a row and a group of rows most applications that were built for payroll and HR and Finance and so on model the data in form of rows which was great and it had good performance when you wanted to basically deal with that record like a new employee joins you create a record for it in the record that fits in this row and all the columns are there and you operate at the row and then years goes by and people have lots of data in their databases and they're doing queries now people want to do sums and aggregations to figure out that well how much stuff did we sell last month or last three months or last year or last quarter so you start doing a lot of query processing and one of the interesting things about query processing is that usually don't run a query that requires the entire row because for example you're looking at the sales figures so you want to sum of all the sales or you want to do sum of all the salaries or something like that maybe you need two columns or three columns so doing the analytics workloads on row store became expensive and inefficient so that's why column store was invented and column store stores the data in form of a column it takes the row it breaks it up and stores each column separately and this is gives you great performance and efficiency when you're doing a query which goes out and gets 10 million or a billion rows but only interested in the one column which has the sales figures and you're doing a sum or a min or a max so you only work at the subset of the data that you need and that's why you know when you do aggregations scans joins column or databases work really well and then we have key value stores and it gets a little simpler because key value stores were invented to store keys and values where you are doing queries based on the key right and the idea was I need to do a lot of key value lookups and I need them to be really really fast and I want them to be always fast regardless of the size of the data and that's one of the reasons key value stores are used in so many workloads where you have vast amounts of particle data where you need like I need every time I go get a value using a key I need that to happen in three millisecond whether I have a billion or a hundred billion or a trillion items in my table I need that consistent performance document stores were invented to hold guess what document there's a lot of data which fits really nicely in the form of a document so for example if you're building a an application to keep track of patient records well a patient doesn't fit in a row you know every time you go to the doctors they take notes and stuff that's a lot of data is generated it doesn't really lend itself to the row format so trying to define a person and their medical history in rows is just not efficient whereas you can say a patient is a collection of document and we add to the document another document every time you go see the doctor so it becomes an easy way of modeling things which is a document so document stores are in it for that they're really good for storing documents indexing and storing documents querying documents on various properties of the document graph store well when you need to persist relationships graph store is a really good thing now a lot of people say but relational databases should be really good for relations right then the answer is well no graph is not really a relation graph is a graph and so the interesting thing about graph store is that it actually persists the relationship so that it's really fast at retrieving and processing relationship data so social graphs recommendation things like that fits really well in a graph store and then time series is of course is the data which is sequenced in time you almost never updated it's always an insert workload and what is a good time series data stock data what was the price of particular stock at a particular second during the day and you have one of these records every second forever right and then you usually look at time series data not based on a particular point in time but usually a set of computation applied to a range of data like what was the min max an average CPU utilization for my database servers in the last three days right so you go get the range of data you apply the computation and you return the answer so the job of a time series store is to return answers not give you the data so you can go compute it later by non-structured store is exactly that it's for storing unstructured data where you impose the structure on the data yourself sighs the interesting thing about science is that the size add limit is far more interesting than this current size of your database and what do we mean by size that limit there are some workloads that generate data where the size of the data data is fairly well bounded for example if you build an app to track employees of a company and said well we have 10,000 employees we have aspirations to grow and have hundred thousand employees okay so you're gonna have 10x more data a hundred thousand you're not gonna go and and hire every person on the planet right so the size of the data is fairly well bounded as opposed to something like I'm gonna keep track of sensors in in a particular car and cars used to have 1012 sensors 40 years ago now they have thousands of sensors and probably self-driving cars will have tens of thousands or hundreds of thousands of sensors per car it is an unbounded problem you have no idea how big it's going to be but it'll probably be big and so that is a really important element when picking databases I'm working set size and retrieval size partitioning versus non partitioning there are some workloads that are lend themselves really well for being partitioned a bowl because you store the data based on certain partitioning keys and you compute the data based on the same partitioning keys and a good example is like the location of a car the location of my car whoever it is has nothing to do with the location of your car so it's a nicely partition above thing each data element is its own thing as opposed to the payroll data is not really partition above because you kind of look at an employee a group of employees or all the employees so Union be able to apply processing to all of it so take these things into consideration and then there's compute when it comes to compute we have stores that are really really good at retrieving data and storing data but they don't do a lot of computation for it right so that's not their job there get put but do it really really fast I want you to go and get the rows that match a particular predicate filter it's sorted right that's very different than saying go calculate the sum of these ten billion rows a particular column and give me the answer where you are perhaps processing a terabyte of data and coming out with the answer seven verses in the other case where you're basically saying give me the row give me the row give me the item give me the document right throughput latency change rate rate of ingestion these are all things that you have to consider when choosing a database for example if you are picking a database to keep track of the location of cars every minute then you get a lot of inserts because there's a lot of cars on this planet and I don't know maybe the particular brand of car you're selling there's tens of millions of them out there and you want to record their location every minute when they're changing well that's that's a lot of data to ingest as opposed to keeping track of employees maybe you hire ten employees a month I don't know maybe 100 employees a month maybe a thousand employees a month it's still very small compared to the location of five million cars every minute okay so shape size and compute when you take into those consideration and then you go pick a database and go okay my database my favorite database can do it all it's it's it's an idea that exists out there and sometimes it's true if you have a very small workload and not a lot of data maybe you have I don't know 50 gigabytes of data you can probably use any database relational or whatever any brand of database you want and on a modern machine the whole thing will fit in memory the performance will be fine that's okay but what is the efficiency at scale that really matters so I'm not suggesting for every workload you go find a specialized database because there are some databases that you can generically use but when it stuff gets big when when cost is important when operational efficiency is important when you're dealing with 100 terabytes of data as opposed to you know 5 gigabytes of data then the choice becomes really really important so which database to use one well before we can answer that there's this other thing that we need to talk about which is managed database services versus unmanaged database services and why does this matter it matters because it changes the cost and TCO and convenience and developer capabilities why if you were gonna build an application 30 years ago on-premises you would go buy an enterprise class server you would go buy some database you would pay for the licenses per CPU core or whatever you would hire a DBA and a system administrator and that machine would be taken care of really well well-managed the backups are always done and so on and the developers come said I need a database to go put it in this one because it's well-maintained right so the developer says ok I've got videos I'm gonna put it in a database I've got invoice images I'm gonna put it in the database I've got data roll logs hist everything goes in the database why this is the most convenient thing to do because nobody wanted to go by 10 servers 10 different software packages 10 different DBAs so you try to standardize on one thing but when you move to the cloud and you start using managed databases a lot of those inconveniences go away so everything is just at an end of an API call so whether you make an API call to put an image in s3 a blob store or you make an API call to insert a JSON item into dynamo DB or you do an insert into a Postgres database like Aurora it's just an API call it's not like you have to go manage those servers and do backups on them and and fix the memory when they break and replace the disks that die so developers now have the choice of using the right set of api's to build an application and what that's at least this leads to is most of the modern applications that we see people building use a multitude of databases so really which databases to use for what components of your application you take this one step further into unbuilding micro services and I have a micro service for shopping cart because I'm building a retail website well maybe the shopping cart stuff goes into one database then you do the transaction when you actually sold something that goes into something and then you have you know the recommendation maybe that goes into graph so it becomes a lot more convenient to use them all so for us at AWS the strategy is to purpose-built databases to satisfy the particular workload for the best performance price convenience and programmability without having to tell the developer i know that you want to use different databases but you should just pick one and stick with it because it's easier at the end of the day it's not easier because you have this mismatch workloads and mismatch databases so back to some generalities most workloads we look at you know you can kind of split them between analytics and operational wares analytics state of workloads are about getting insight it may be retrospective as in how much stuff did we sell last quarter last year last five years you look at history it could be streaming it's like the data coming in from some devices you know and you need to raise an alarm when something gets too hot or too cold or whatever and there's a predictive stuff where you look at a body of data and you try to figure out what's going to happen tomorrow which is very different than a retrospective stuff which to us what did happen yes what happened yesterday now why is the retro predictive stuff important because most applications are now starting to employ some kind of an ml machine learning algorithms to improve like I don't know is it gonna rain tomorrow well you probably have to go look at the history of the weather data and the current condition and maybe you'll guess if it's gonna rain tomorrow or how many people should you have employed tomorrow at the Starbucks downstairs because there's 42,000 people running around right it would be good if they could predict some of that stuff better on the operational side you have your transactional workloads you know I took five dollars out of my account for the five dollars out of my account put ten dollars in my account there's a system of record where's Tony's car where the employee records where is this patient record and there's content management stuff I don't know I spin up a blog use WordPress there's probably a my sequel database somewhere there right so let's talk about operational stuff there's some basic characteristics that go to the operational databases usually a good fit for caching compute size is usually small like you're dealing with one patient five patients you know a doctor can see so many patients a day they they generally require low latency nobody's willing for their transaction on their credit card to take one second that needs to happen really really fast high throughput because there's a lot of people trying to do a lot of little things high concurrency there's 10 million cars where are they they're all emitting telemetry saying where they are and you need to insert them and a lot of them at the same time and then almost all of operational data and not have this mission-critical HADR data protection don't lose this stuff you know backups and then the stuff which is you know useful to think about when you're making decisions our slides at the limit is it bounded or unbounded are you going to deal with rows are you going to model your data in form of rows or key values or documents is it a graph do you need relational capabilities do you need to be able to push compute to the database or do you need the database to just give you the data so you can go do whatever it is that you need to do in your application what is the change velocity so for example the telemetry coming from a car it never changes nobody goes back and says no no Tony's car wasn't parked here it was parked there right it's an insert only workloads as opposed to Tony's salary is continuously updated and increased because I'm really good or I like to think that so that's an update workload it doesn't happen it really gets updated so and then ingestion requirement again is the ability to take in the data so I'm gonna ask you and to come up so we can actually have a debate about what databases to use but this is a good summary kind of way of thinking about it relational databases are really good if you need into a referential integrity if you need strong consistency if you need to do transactions across five tables right key value low latency key based get input high throughput partitioning unbounded data size documents really good for documents and anything that you model like I have a 17 deep JSON document okay that's probably a a document for a document store and then there is graph so en and I work together at AWS and so we're going to talk about some by the most of the slides were made by Iain so if they're good it's his work so if you've got Amazon RDS we're not gonna go through and explain what Amazon RDS is and what are the various databases because chances are if you're interested in databases you've probably heard of them but if you haven't you can go to the website and see all of the stuff about Amazon RDS but we do have a lot of engines so tell me what would be a good workload for say Postgres and why would I pick Postgres versus something else sure so RDS is you know a great general-purpose database that has the ability to start very small and then grow with your business and offer all these different engines and so you know we see already asked being a great place to start for when you're building an application where you want to use common frameworks like ruby on rails or Django and you may have a particular engine that you choose on the basis of skills in your team's types of frameworks that you happen to be using may be third-party packages that you're bringing in from the outside and so you can really tailor it knowing that the underlying principle of these applications are the things that we've touched on with row stores where the database is going to validate some data for you it's going to help you manage uniqueness constraints it's gonna allow you to push computation down to the application rather than having to write it within your code you know you're gonna have a database centric model so so if I have some workload on premises I don't know maybe I built an app and I don't want to touch the app but the hardware is getting old and it's starting to not be so available anymore maybe it was written with entity framework what do you have what would you recommend do I rewrite it put it in a row out I put it in dynamo what what do I do sure yeah so bringing applications into the cloud certainly we see most commonly that people will start with RDS and the reason is that they can tailor the database and the engine that you choose to exactly these sorts of require so we have you know a sequel server option for when you're bringing in an IAS or.net application and so it is very very common that what you're looking at achieving with this kind of operational database is looking to get out of the business of doing so much management so maybe you can focus your time on making that application better so you don't want to do backups and you don't want to do patching and you don't want to do you know all the hard work of running a database so you say look I know it's compatible and that allows me just to focus on my business no I'm not I've got developers they have their preferences with respect to languages I know a lot of people love to write code in Python Django what do you have for them I'm gonna build some application to keep track of I don't know surveys in my company it's not gonna get very big but it's important stuff sure yeah so you know again one of the reasons that you choose one of these particular engines is going to be about the skills that you have in-house so for instance we see a lot of affinity between the Python community and Postgres whereas as Tony you said you know a content management system will often have a my sequel disposition just because that's what's been used in the past you may have an application that you're running on premises that runs on Oracle you just want to be able to bring that in so it really is about focusing on what your team's can do what's gonna help you be most productive and then you know the nice thing about this type of service is it cuts across all of those different engine considerations and just gives you like really simple to operate database so so then tell me what is the difference between Postgres running and RDS and Postgres running in ec2 and Postgres running as part of Amazon Aurora sure so so when you pick an engine certainly you can choose from different types of storage that would be the contrast between using Aurora storage for Postgres versus the standard RDS Postgres which which uses EBS based storage but the real difference is that you're not gonna focus on managing the database yourself you're not gonna think about patching the server you're not gonna think about doing backups yourself and then really thinking about the difference between storage and Aurora or not is gonna be about what are your availability characteristics how quickly do you need to do failover and certainly Aurora is a fantastic storage environment because it's always multi see that's a real differentiator for these types of engines traditional relational databases they work with a primary node that takes responsibility for all the rights and then in RDS you can spin up read replicas for certain engines and as we heard obviously we we now are going to support multi master for where you have very very high availability requirements so for massive scale relational workloads which you know can get pretty big yes in writing apps app from scratch so I don't have preferences I'm not using anything on pram yeah what would be that go-to place if I need relational capabilities yeah so so Aurora every time and then choose which engine to use on the basis of the features of your application that you're building so a very common request is I'm building an application that's gonna have some some ability to track location and so we'll see people gravitate towards Postgres because it has post GIS the ability to run geospatial queries in a relational context in other cases you may say again my team is really comfortable with my sequel but I need that massive scale I need the ability to scale up dynamically on my storage volume and I needed to give me a huge number of read and write I ops so in both cases Aurora is a great place to start it allows you to scale to extremely high scale right over time so so relational workloads I need relational capabilities I like sequel as a language I use Java or whatever you have the variety of things to say but I mean there is a limit there is like if I want to record the location of every car every minute right and have a massive input coming in yep yeah exactly so RDS is one of these systems that we talk to it's bounded at limit you're gonna provision a certain amount of storage and you're gonna grow into that storage certainly with Aurora and you're gonna be flexible but ultimately there is a size to the database where some of the principles you mentioned like the ability to have completely predictable performance regardless of the size of the database where we may need to look at something that's a little bit more sophisticated and gives us some features that relational databases traditionally haven't been able to support so dynamo dynamo yeah so dynamodb then is gonna be give us the ability to really break out of the boundaries that we may find from a relational database with the relational validation that's occurring within the database and really gives us the ability to operate much more efficiently with unbounded scale okay is it one or the other no in fact you you'll certainly need to look at what kind of data sets really do act like unbounded at scale and choose the right database for you know the kinds of requirements so if you have a huge amount of pushdown computation you may put that against the data that sits in an RDS database and then for certain types of things the example you used with a shopping cart is a fantastic one putting your shopping cart in the ability to transact with really high availability really high throughput where you'll never run out of capacity that's what makes your business successful you put that on DynamoDB maybe the customer records stay in a relational database so that you can do fast mutation and validate that their addresses are correct and these sorts of things so so that's a good break down I mean I was talking to a customer earlier and they were telling me that you know they have a bunch of data and it's relational and they put it in in Postgres but they have some massive read requirements and it needs to be global because they're a global company and they need it everywhere and they need it to be synced so they were telling me that they basically read from RDS Postgres and then they stuff it in dynamo and then they scale the reads off of dynamo for computed results that comes from Postgres so being able to mix and match easy absolutely absolutely so as soon as you want to try and scale relational data sets in the way that you're describing for instance using things like globally replicated dynamo tables then you do have to think about that in the similar way as you would think about caching for example where you're going to take your application takes the responsibility of pulling the data out of an operational store that's good at storing those data elements long-term and then moving them into dynamo DB managing the consistency between those two systems so it is something you need to work with but it is then very trivial to scale that working set to virtually unlimited size and presence around the globe so if I end up using a couple of different engines but I need the data to easily move between them what are the options sure some moving data between data sources is something that the the data migration service is particularly good at and then you may also be looking at the ability to use things like dynamo DB update streams update streams allows you to kind of listen to changes that are occurring on your Dynamo environment and then replicate those down into relational database for example so there are absolutely integration patterns that allow you to move data in and out of these different environments without having to do a lot of ops can you trigger a like a lambda function with these update streams to do stuff yeah absolutely so so that's one of the native integrations for dynamo DB update streams you can run computation on the basis of an update event coming out of DynamoDB and then you know do something useful with that so that could be an example of moving data into cold storage on a different database and implementing replication for example cool so you have this thing called Dax for dynamo DB caching just give me a 15 second story I'm sure so we talked about DynamoDB being able to be you know running you a scale if that's unbounded docks really then gives us virtually unbounded low latency so what dax allows us to do we create a layer with dynamodb or we implement our application on dynamodb so that we get really high availability and really high performance but we can also now add Dax which drives the latency down to microseconds but where we don't have to worry about managing the complexity of cache management where we don't have to think about what is the data up-to-date in my dynamodb table versus my cache and the reason is that Dax is what we call the write through cache so I just do the writes to Dax and Dax takes responsibility for writing to dynamodb so I get all the benefits I get API compliance with DynamoDB but I get this massive acceleration and performance really nice thing to have where you want to have microsecond high performance what about elastic cache when do I use that sure so elastic a base load thing yeah so ElastiCache is then where you want to add caching on but you're using a different type of operational store and the real difference between something like Dax being a write through cache and ElastiCache where you're adding it on to one of these other operational systems is that your application is gonna take responsibility for making the cache and the data store consistent and there are lots of frameworks that will do this for you we talked about the choice of RDS just very general purpose database because you might be using some type of an entity management framework well very consistently those entity management frameworks will allow you to plug in a cache on the side and they'll do all the hard work of keeping that cache up-to-date really fast response times so ElastiCache gives you a memcache and Redis interfaces so you can have an open approach to bringing in your entity management frameworks use a standard engine but we do the hard work of operating it so low latency reads I need from dynamo I can use Dax Dax is a specialized caching for dynamo that's and then elastic cache is a generic caching mechanism with radius or you can in front that in any any of the RDS databases you probably put it in front of the dynamo as well if you really want absolutely so alright so if you need low latency reads and your data set size that you're operating it fits in the cache then caching is a good thing if your data is bigger than a cache and every access is a Matt cache miss then you're just wasting your time missing on the cache so there are some intricacies so not everything is cashable obviously and you have to be careful about that all right so we announced neptune because we have recognized the fact that graph is really an important thing and maybe it was you know I don't know Facebook uses graph databases because well yeah that's a pretty good app so you know why would you do this why would you use a purpose built in specialized database for something like relations is because it's actually really hard to manage graphs and relations in any other type of operational store we have some other analytical systems that support graphs like elasticsearch but with the kind of performance that you're gonna get from Neptune you really do need a specially built database for that and you know as you mentioned in the intro things like making recommendations to be able to answer a question of what pieces of content are similar to a piece of content that I just said I liked well I have to have a relationship between those two and it's probably going to be a similarity or when I want to answer a question like people who bought what you just bought they also bought these other things well that requires the ability to create links between people and behavior and then those relations themselves actually have traits they have properties that are interesting and meaningful and modeling data in that way in a relational database is really complicated modeling it that way in dynamodb you can absolutely do but graph databases are specialized for then querying that data and there are specific algorithms for graphs that are really important I recently got a lecture on this because it's saying I can model relations in their relational database and write queries and then somebody showed me that at the sequel text of computing the relationships between family members and the graph version like in gremlin which is like a one line you know statement and and so the idea is that yeah it's really difficult when you have to compute the relationships between say myself and every family member in a in a relational because you have to compute that relation at the time of query as opposed to using something like a graph store where the relationship is persisted and so you just get the relationship and you can walk the notes so so quick quick show of hands in the audience is anybody in here ever written a recursive query in a database common table expressions connected by Prior yeah those hands you need a graph database [Laughter] it could help in some cases so operational databases and the dimensions that you look at so we we kind of look at okay the size that limit is it bounded is it unbounded if you really have data which is unbounded you really should think about you know a system that can give you constant performance regardless of size if you have a system of a database which the performance gets slower and slower the bigger it gets and then you're not gonna have constant performance so that's a problem you know if you if your data models best for key value then use a key value store if you're always operating on a record and its entirety and record is a good thing right so these are some of the things that you should look at when you're going through this decision process and and and now if you have to admit that it's not a simple thing go okay here's the seven question you answer at the end of the day you will know exactly which database to use for every workload if it was that simple then a lot of us wouldn't actually have a job so so let's talk about analytics so this retrospective predictive and streaming thing I think it's fairly you know if you look at it historically the retrospective analytics came first because people were able to then query transactional databases and see what happened in the past streaming analytics came in the way with the machine learning and people looking at telemetry coming from devices and of course the predictive stuff you know if you don't think you need ml for your applications well you should really consider why is that that you don't think you could benefit from that a lot of workloads do so on analytic workloads almost always you're better off with some kind of a column or format as opposed to not because it is the most efficient why is a column or format most efficient for analytics workloads because data in columns tends to repeat therefore it's very compressible as opposed to adjacent columns in a row are not related like my age is not related to my first name where is the of everyone in this room a who repeats a lot there's probably a lot of people who are 37 years old no no no one that's really easy to compress right it's not yeah there's only one all right so they're usually large Anna Lake workload we require a lot of computational memory like when you're looking at a billion rows and you're computing stuff and you're doing joins and so on analytic workloads are rarely updated it's an insert and it's a delete maybe it's a bringing a new partition get rid of an old partition but you don't go and change history very often so they're loving memory capabilities for most analytic systems and the things that you really have to then consider is it do I have need to do streaming analytics do I need insight from a stream of data coming from somewhere it could be devices it could be an ax maybe you're on a train system and you're emitting telemetry from the brakes and the temperature of the brakes from all the trains that are running in the city and you need to know and the brakes on a particular train get too hot and that's a real-time thing so you need to analyze the stream of data all the time that would be a streaming analytics do you do ETL do you move data back and forth how much processing you do do you want server less or do you need servers that are always there because I don't know you have a data warehouse which is being used 24 hours a day every day and then what is your data format so these are some of the things that would help you guide so we have Amazon Athena which is one of our interactive analysis products so when would I use Athena it's not a database though is it no because it's I don't know what tell me what is this thing oh it's that I did it fantastic because it allows you to treat data at rest unstructured or structured data like it was a database and that's a really powerful concept we started this conversation around analytics with Athena because it is the place that you will probably start when you're working with your data a very very common pattern is that you're gonna ingest your data into a format that is good for the type of data that you're storing if you're working with lots of log files then you're gonna store those as log files and you should compress those every time if you're working with much more structured data than using a format like park' or orc it's very useful because it implements that compression but once you've written it down and it's sitting on a storage system Athena just gives you the ability to start working with it it's not a database it's an engine for working with data it allows you to write sequel as though it was a database alright so I'm a little confused let's say I have an OLTP system and I do transactions I sell stuff and I have the last five years worth of sales data and I export it out of my OLTP system in a comma separated vial and I put this file on s3 and I have done nothing else you're saying I can query this file now with sequel that's right so you're paying for the storage on s3 for the data that you should be keeping anyway and then you're able to issue a query using an C sequel there's no service to provision there's no babysitting of that environment over time you simply pay for the data that streams through Athena and you get your answer back you can either use that in an interactive way with the Athena console or you can use an API to save that data down to another location whatever so why wouldn't I just take this export file an immediately imported it into a column or database and start querying it there yeah because well maybe you don't know what the shape of it is maybe you're integrating with a third party that has been known to add or remove columns from time to time that's a really common use case that we see and in some cases you will import data where you know the schema and with the glue data catalog and glue crawlers that data will be analyzed on s3 and lift it up into a format that Athena then can work with or you can work with data that has no format at all we call it a model on read semantics you apply a schema at the time you issue the query and so you don't have to only work with data that you understand and in fact you might not actually know if it's useful got it so you not want to import it until you know what's useful so if I get a giant data dump from some group in my company about every failure in the brand of cars that we make over the last 10 years and and this is like I don't know 15 terabyte pile of data and I dumped it in s3 you're saying I could start querying it to figure out what it is in there if there's interesting stuff in there before I actually start buying databases and provisioning stuff and loading it into yeah and that's why we say you know it is really the first step on the journey of doing analytics you should absolutely be investigating your data and doing data exploration to establish its value to start to mark data sets as to their value to enhance them potentially with other data sets so that you can really do that exploration process if I have it if I have a data warehouse and I have 10 years worth of data in there but I usually just use the last I don't know two three years of data can I take the old stuff and just dump em in s3 and it's there I can query it anytime I need to but then it will save me a lot of money because I'm not paying for databases that are running all the time to hold this data that I never used but I'm not willing to throw away and I needed to be queryable if somebody wants to yep cool that's pretty handy all right redshift yeah so you just said if I had a data warehouse and I wanted to be able to issue queries and then I wanted to do some cost optimization so redshift is really the counter side to where you're doing data exploration with redshift you select redshift because you understand your data you understand its and you understand its value and you import that into a system that is designed for extremely fast query times at scale and there's also the ability to integrate in an Athena model with redshift spectrum so you don't actually have to choose either or you can use Athena to establish that some data is actually useful you can generate new data sets and immediately start querying those from inside of your enterprise data warehouse in a hybrid compute and storage model so you get the best of both worlds of extremely fast query response times from data stored within the spectra the redshift data warehouse and then the ability to also use this data at rest so this is because the data which is in redshift has been consumed converted from a bunch of text in a CSV into internal database formats as numbers and whatever it's been indexed a fan and now it's just queryable you don't actually have to process the the text or anything you're just doing that's right so you've probably modeled your data in a way that's sympathetic to the types of queries that you're gonna get as opposed to Athena which is issuing a query that you can expect and you might actually be modeling your data using a regular expression or a grok expression right with redshift you know what that data is you know that it has value to add and you've done all the things that you described so that you can get that really really fast query time so you know it's very likely that the CF always wants to know we made the numbers what was the numbers was the sales was the profits and those are all well-understood questions the VP of Sales always wants to know who are the top salespeople did I make quota raised so if you're gonna build a dashboard for people to look at these reports then and you want the dashboard to be really fast and people don't wait more than a second then you want the data to be in a database which is already processed and index so for for fast queries as opposed to something like Athena where you say okay go open this terabyte of stuff and figure out what the schema is and do some conversion and when I say to a sum then you go read all of that stuff it might take longer but then of course you didn't pay for unused servers I'll try to sit there for days and days and days that's right okay so so both of the fina and redshift are about doing retrospective analysis of data that you've imported into an environment you've written it down you've said I want to keep this and I want to be able to establish trends and I want to be able to train models but then take those retrospective measures and predict them in the future Kinesis analytics is about that now it's about the ability to do analytics second by second by second as the data comes in and again the common thread that runs across these analytics tools is that they use SQL athena's using ANSI sequel redshift uses sequel ANSI sequel and Kinesis Analytics use this streaming sequel so it's really all about moving historical analysis into that very very short term of decision making and you need to have an answer very quickly and your business needs to be able to react to that there needs to be business value associated with you making a decision quickly so an example would be a customer support case where you have an unhappy customer the sooner you know about that the sooner that you can react to it the sooner that you can understand maybe that that consumers device is failing you can then proactively address their issue and make sure that they have a good outcome and then after you do that the better off you are and/or that telemetry coming from the trains when the train is going up the hills but the brakes are hot and there's no reason for the brakes to be hot while you're going up the hill exactly because when you start going down the hill something Bad's gonna happen they want to know that and raise an alarm before yeah right you don't have time to index that data into a data warehouse I got to eat later that's right that's right so so in the case of making a decision in the now making a session decision retrospectively we've been doing a lot with sequel there are other cases of where you want to be much more flexible about how you can ask a question of your data so so tell me why elasticsearch even comes up as a database here yeah well it does store your data in the form of specialized indices for doing a specific type of query so you might have an index that's particularly good at doing a natural language search you might have another type of index that's really good at doing a time series expansion or interpolation you might have another type of index that's particularly good at doing geospatial manipulations and so in these cases you want to do analysis that isn't about select star from whatever it's about having a lot more flexibility to explore your data in different ways that are probably more natural for again that exploratory mmm a case and certainly with the case of elastic search given that it has cabana bundled in you have the ability then to create graphs and dashboards that are based upon these much more interesting types of queries and and much more powerful for certain types of problems so if I was storing documents in say dynamo and I have I don't know big giant JSON so there's free text you told me that update stream from dynamo can actually send the changes that are happening and I can have a lambda trigger that calls elasticsearch and indexes the data in dynamo so that my application can go query elastic search for a particular keyword and find all the records in dynamo right exactly mmm exactly or to apply a time series expansion on a stream of IOT data that you're storing in dynamo DB because it's extremely high throughput rates and you need to be able to find anomalies for example alright so the things that matter in picking the right database for analytics is it streaming or not is it server less ad hoc queries do you have do you find value in that the process prepare and index in place or not pay per query or you have dedicated servers that you pay all the time so these are all the interesting things now that the last line here just spending a few seconds on if you have data sitting in s3 which happens to be SCS vu you query with Athena of course you have to convert the text into numbers and stuff because you want to do sums so that you pay the computation that brute force computation you pay that while you're querying but something like Athena can also open other formats like JSON and and and and part K and part K is a column or format which is very handy so if you actually store your data in the format of parquet in s3 when you run Athena on it then you don't have to do the conversion pay for that at the query time because that you're already converted to data and you can query so there's a lot of interesting combinations that you can get to with these workloads and one of the things that's very very large customers I have yet to find someone says no we have one data warehouse it's one kind of a database that's all we need we get all the insights from that thing and that's it right not-not-not these days where people are trying to find insights from all kinds of data so what we do see is a lot of people who have a hybrid systems you might have redshift you might have Athena you might be doing streaming you might have your operational databases wired to your analytics systems you might be going through s3 maybe your operational databases occasionally export and dump stuff in s3 which then you can query with Athena to see if there's anything useful in there which then if you do find it you might actually write an ETL job to go from your OLTP system directly to your data warehouse and the nice thing about these combination is that you basically start using the right set of tools right because maybe you you don't need to have servers that are up all the time because you occasionally query some data and that Athena comes in really well but for the dashboard that you need to be active all the time you use redshift or you use a combination of all of them which really brings back to the portfolio choosing the right database it's not about choosing the right database for everything you do is choosing the right set of databases to apply to the right set of data for the particular applications that you have and I think you'll end up using a combination and this is why we kind of are pushing for the managed stuff because you're gonna use a combination of databases you certainly don't want to have to figure out how to do backups on DynamoDB and Aurora and redshift because that's really painful so we do that for you so you don't have to deal with the manageability part so this is kind of the stack if you look at the stack how do you move data where is the terminal place for data something like glue where you can do your ETL you have your operational databases you have your analytics databases and then you have things like quick sites that sit on top for visualization dashboarding and your reports and such and then the announce stage maker which is basically our machine learning capabilities that you can now use to apply to your analytics work clothes to even build models and maybe you even start using some of your models in your operational databases so I don't know for scheduling staff at the restaurant on busy days so this is kind of the picture that we're driving again purpose-built databases are more efficient at scale so if you're building big stuff don't try to stuff everything into the one database because you think that's more convenient to just have one I think the diversified backends for databases ultimately end up being better we try to make sure that everything is managed so you don't have to take the inconvenience and we are always trying to stay open so elastic cash for example with Redis or Postgres or my sequel this or none of these are AWS thing you want to take your my sequel database somewhere else you can and so that is the idea of remaining open so that you can bring stuff in and you can take stuff out you want to bring your my sequel or sequel server or CRO or take it out you know that openness allows you to do that and of course the customer session is you tell us what doesn't work for you give us feedback at reinvent Vigo spend the next 12 months fixing all the stuff which you want us to fix we come back here and then we start the cycle all over again so I'll pause here maybe well we have four minutes so we'll take a couple of questions if if you have any questions please use the mic so people can hear your question you had mentioned Dax Dax and in the case where you're using Dax is there any need to use ElastiCache and especially if you're wanting to run geospatial queries sure so so Doc's is really then going to give you the ability to query through to dynamodb you have other types of queries you need against other data sources and you mentioned geospatial for example then you may run that query against Postgres database and then push the geospatial coordinates for example into a cache so you can consume those over time so pick up my most recent route or route options for example if you were doing a mapping application you might say go compute a few routes cache those so if somebody wants to come along because there's no traffic works then they can pick up and pick those out for example you can also use elastic search for that and then put a cache in front of that for the geospatial types thank you yes where do you see a data Lake sitting in all this sure so data Lake really is going to be the the foundation of data rest for all of the different types of data that you're ingesting and give you the ability to make these choices about which engine to choose for a particular problem so we started to see that data Lake is the foundation of data storage and then you start to look at projecting that data Lake data into an engine to solve a problem and so I think they work very well in concert and the data Lake gives you the ability to change over time to take advantage of new technologies as they come about and a great example is a Neptune if you have a query that you're servicing today with a relational database you have the ability to push that through your data Lake and load in Neptune and start using an a-star query for example okay yes what is the difference between Athena and it's a spectrum or what is a use case for each one then you use it yeah great question comes up very very frequently so redshift spectrum is for data that is optimized for working within a like a star schema model for example because it is a data warehousing workload and spectrum is the ability to separate compute from storage within an enterprise data warehouse architecture Athena is very different in that you're applying a schema either through the data catalog or when you're actually issuing the query and so you have a lot more flexibility in terms of the types of data that you can issue queries against with an engine like Athena where a spectrum is for giving you very well-structured well modeled data where you're getting the price point of s3 because probably that data is not as high a value and in fact you can share data between clusters using spectrum and you can query the same data that you're querying from spectrum in Athena and vice versa so you don't only have to make one choice versus the other but certainly the source of data for spectrum tends to be well modeled data warehousing type data you also have to have a provision redshift cluster to use specs that's where that computer can we assume that the structured data instead of loading interactive keep it in s3 and then use the spectrum for querying it yeah absolutely you know it's it's a very common pattern where in order to keep the we talked about the hot working set for an operational database well analytics has the same thing you may say we really only tend to query a year's worth of data so you keep that in your data warehouse you keep everything forever in spectrum and then you only need to size your redshift cluster to deal with the hot data you can still reach through into the data store we're using spectrum thank you yes on the left side what do you say about Cassandra is there any recommendation of shifting to one of the Amazon supported stuff or so of course DynamoDB is is a good place for for that especially now that we have global tables and that we announced today where you can have replication across regions you can do master master and a lot of people use cassandra for that and lots of people use cassandra in ec2 and one of the reasons a lot of customers that we see that move from cassandra to dynamo is because they just don't want to manage the cassandra cluster otherwise you know you have the option of running Cassandra and ec2 or use dynamodb that would be the place to go and it's also important to call out that DynamoDB is is quite unique in that it supports both a document oriented interface for working with data as well as something that's more like a column family database like cassandra to kind of get both in this this which API you use yes actually following up on the document database part of dynamo what are the use cases we have is we have JSON stores or XML stores which we'd like to store and be able to query those by different attributes is that possible with dynamo now you have the option of using global indexes and local indexes so you would be able to index on a property for that but could we do all the properties to be Auto index we don't have an auto index for all properties so and then and you have to take care about just indexing everything all the time because it you know it has consequences right you know things in a big you have to pay for it so we would recommend that instead of Auto indexing everything to index the things that you really need and then if you really want to you can also do the elastic search to index some of the things which you think needs an indexing in a different way than the indexes that a dynamo provides and using the dynamo streams to then send updates to elastic search is a really good way to go thank you yes last question so our team has started exploring it too predictive analytics for strategic planning how they do currently is get a data dump CSV formats but these are multiple spreadsheets and they manually kind of do we look upon spreadsheets and combine into one so how can a tina helped me in combining these spreadsheets do have to do it manually and dump it in to put it on s3 so that i can query the data that i want our do i have to use some kind of lambda to kind of combine these spreadsheets and then use a tina over it go ahead yeah so there's a lot of different options for for how you would mix that data together and enhance it and certainly the ability to import CSV s via your data catalog and then issue queries that would mix that data together you don't necessarily need to go fro U at that problem with an ETL like approach you may be able to query it at rest combine it enhance it and then write a more structured modeled set of data than meets some business needs so yeah a very common use case it's also worth mentioning if the CSV s are useful in their own right those can be imported straight into quick site and you can just start using them thank you alright so we're out of time I'll be hanging around here if you just want to come and chat for a little while - thank you
Info
Channel: Amazon Web Services
Views: 32,833
Rating: 4.9154334 out of 5
Keywords: AWS re:Invent 2017, Amazon, Databases, DAT310-R, Aurora, DynamoDB, ElastiCache, Redshift, Database, Migration
Id: KWOSGVtHWqA
Channel Id: undefined
Length: 62min 54sec (3774 seconds)
Published: Thu Nov 30 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.