AWS re:Invent 2019: [REPEAT 1] Amazon DynamoDB deep dive: Advanced design patterns (DAT403-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I welcome everybody now thanks for taking this time out of this evening here in Las Vegas where there's many many other things to do to listen to me talk a little bit about data modeling and no SQL who knew that topic would be so popular huh my name is Rick Houlihan I am a principal technologist for no SQL for Amazon Web Services I've done a lot of no SQL data modeling over the years I worked on the specialist team I helped build the north-american practice there and I've worked on the blackbelt team internally to help migrate over 12,000 Amazon Services or some some odd number around there over several years as well to from Oracle to know SQL how many people have heard of that migration we've shut down 3,000 Oracle instances all right few you folks out there and in the course of doing all this we learned a lot of things and that's what I'm here to talk to you a little bit about tonight and some of these things might be surprising to some folks and maybe less so to others a lot of this stuff we talked a little bit about last year we're going to get more into some of the complex modeling that we can actually do with with no SQL that people might not be so aware of so the first thing I like to talk about in all of these sessions is a brief history of data processing why are we looking at no SQL I know a lot of you have already heard this but some of you haven't and the real reason then the reason I do this is because if you want a message to be understood you have to repeat it it has to be broadcast so for those of you who've heard it I ask you for a minute or two for those of you haven't hopefully you'll get something out of that conversation we'll get into a little bit of an overview of Dinah mode to be more from a you know structural perspective what is no SQL I'm not going to talk a lot about api's features and functions that's not what we're here for really here to talk about NoSQL data modeling that's what this session is really about how do I actually model data how do I do things in the real world with real apps in a way that actually makes sense and works for my application and works for me from an economic efficiency perspective and so that's what we'll talk a lot about is data modeling most examples I use might be synthetic but they're going to use real access patterns patterns that we use in real application services and I'll actually call those out as we talk through those and we're going to get into some of the common patterns and we'll talk about some of the real-world applications we built that really again those access patterns that we use so again the first thing we talked about a little bit is the history why are we looking at what we're this new technology because we've had this relational technology it's worked so well for so long and to understand this we really need to understand that if the history of data processing was laid out for us to look at it would be a series of peaks and valleys and data pressure right and data pressure is what we define as what we're asking the system to do and how fast and how cheap it's happening if it's not happening fast enough or cheap enough to cheap enough then we've got a high high data pressure situation and we're going to do a lot of different things it's a technology trigger we're going to invent a lot of new things to fix that and we have over the years the first one that we had again was the one between our ears it's a great database highly available when my eyes are open it's online but not necessarily you know highly durable fault tolerance is questionable single user system the whole nine yards not exactly something I'm going to build an enterprise database on top of so we actually had to start learning how to do different things and we learned how to write things down right we created a system of ledger accounting it was the first method of storing structured data it ran public and private sector workloads for millennia until the 1880 US Census came along and a man named Herman Hollerith was tasked with processing all the data that was collected it took him eight years of the ten-year cycle using a ledger accounting method to do so and he realized he had to do something new and he did he invented the machine readable punch card in the punch card sorting machine so in 1890 it took him a lot less time now this actually was an interesting turning point in the evolution of data processing we started inventing many technologies paper tape magnetic tape distributed block storage random access file systems and when we got to the 1980s it's very important to understand why we invented the relational database was to decrease the pressure on the storage subsystems in the data center right this is really the reason we invented the relational database because it uses a normalized data model and that normalized data model deduplicate it's the data on disk it makes things nice and efficient when your storage is expensive we don't really care about CPU in 1980 not so much fast forward 30 40 years and the opposite is now true right when I deploy applications data center today the storage is pennies per gigabyte the CPU is dollars for CPU minute I don't want to use technologies optimized for yesteryears you know situations right they want to optimize for today's situations and this is why we use are going to know SQL database so now it's under it's important to understand when you use new technologies that we need to understand how to use them before we deploy them or we will not have good experiences we found this out very quickly at Amazon as we started to migrate many of our workloads and saw what we thought was great success then we started moving bigger workloads and found out that you know what it's really easy to do small things badly and think that you're doing okay right as you move those bigger workloads you start to find out that wow this thing's not working very well we're not having the experiences we expected it's not cheaper it's not faster why is this happening you start to analyze the system and you really when you figure out you didn't know how to use the new technologies you tried to use them the same way you use the old technologies in the case of no SQL that's about data models if you use a normalized data model you are going to have pain at scale when you when you when you expand and you scale out your operations so this technology this chart really kind of like this because it talks about the adoption curve in the beginning we have a technology trigger people are running around trying to solve problems in our case it's data pressure we invent this new technology people go to use it some people have success others run to the new technology and they have a miserable experience because they haven't figured out how to use it now as the technology skills evolve in the market and distribute across the market we have better experiences because people understand how to use the technology today we understand I use relational technology tomorrow we will all understand how to use no SQL technology and modeling data and no SQL is not going to be any harder tomorrow than it is today to model your relational data I get this all the time it's so hard it's so hard to use these models no it's not it's actually just learned a different way of thinking about your data and hopefully this conversation today is going to help you get there when we talk about no SQL it's important to understand it's not good for everything right it's good for a certain class of applications and that class of applications are those that have repeatable access patterns and the reason why is because I need to denormalize my data model if I'm going to be able to simplify my access patterns we'll get into this when we talk about the modeling but that denormalize model is very tightly coupled to the access pattern in the case of a relational database the ad hoc query gives me some flexibility if I don't understand how I'm going to be accessing the data then it can be very beneficial for me to have an ad-hoc query engine and that's really suitable for an OLAP type of workload online analytics this is a really not a good workload for no SQL so understand the difference here good for us in know SQL and that 90% of the applications we write are written to support common business processes which represent OLTP applications so it's actually one of the most relevant technologies you can learn as a developer today going forward all right DynamoDB again we're not going to get too much into you know the top level of what is dynamo it's a fully managed no SQL database and this is the biggest value ad you get from dynamo it's cloud native you don't have to spend resources managing nodes rebuilding storage devices you know replicating in new shards figuring out how long is it gonna take me to get the shard online and believe me that can take a long time we have customers the largest customers in the world using legacy no SQL technologies now working with us because you know it takes them months literally to add new shards you know a MongoDB cluster when you get up to 1520 shards and you need more it starts taking weeks months to get the new capacity online and this is a common problem that people have run into it's scale no only that the amount of resources you're going to dedicate to managing your new SQL cluster can be much better spent actually building your business so no SQL databases we'll get into the performance aspects of this have a lot of advantages over your legacy no SQL technologies that you're just never going to be able to realize by rolling your own if you want to talk about scale it's fast and consistent at any scale scales to support any workload we have examples of single tables in excess of seven million transactions per second like I challenge you to bring me another no SQL database that's scaled out to that level 50 4.5 million transactions per second was what the Amazon CDO tables did on prime de 2019 I heard we shattered that record over cyber monday Black Friday I don't know what those numbers are but they're massive you know this is just not something you see from the legacy no SQL providers all right when you get into modeling data and no SQL it's important to understand a couple of constructs the first thing is a table as an object repository it's where I'm gonna push items items can exist on the table they must have a a single attribute called the partition key as far as that any other attribute is not required in this particular configuration what I have is a collection of objects that identified uniquely by those partition keys and supports a key value access pattern if I query this table with the partition key equals X I'm gonna get a single object back if I define and this is actually you know this is cut similar to a document database right in document DB I'm going to define a collection in the collection I'm going to push documents those documents are gonna have underbar ID as a unique attribute and I'm going to query the system on a key value access pattern give me the document with an underbar ID value of X right I will talk a little bit about the difference between document models and wide column models but one of the biggest difference between the document wide column model is the ability to add a sort key to the table it's one of the part on this collection now of objects what I've done is I've defined a partition key that defines a collection or a group of objects and inside of that partition now the sort key uniquely identifies the objects in that partition okay and I can use range queries to query those objects selectively out of that partition so example here might be a partition key as a customer ID a sort key is an order date and I'm going to query for customers orders by date let's say everything in the last 30 days write a date range condition of greater than 30 days partition key condition equals customer X brings back that list of orders for that particular customer right and I could have a thousand orders in that partition but those of the ORS are going to come back conversely I can also create other constructs we'll talk about that can I put other types of items they don't have to be homogeneous collections of items they can be heterogeneous collections of items but what we really have done is we've created now this object repository where our primary table creates holds a aggregation of objects that satisfies a certain collection of access patterns so in essence what we've done is joined objects into a partition and we're going to start to decorate these objects with additional attributes that we're going to index to create additional aggregations or grouping of objects and so if you think about what I'm describing here I'm really describing the ability to join objects so this is what this is what we do in no SQL it's not that we can't join objects we create model joins right we model our data so that we can group these objects into partitions on the table partitions on the GSIS and we're going to decorate these objects with common attributes that are going to tie them together when I query on various access patterns we're going to show you how to do this but this is not so different than the way actually relational databases work when they actually have to join data that's not indexed right they create a hash table and then they do lookups against the hash table as they iterate through the inbound collection or the or the outer collection so what we're really doing is storing data in that in that distributed hash table structure ok so we talked about being able to since cert multiple items into the table this is how we're going to do that this is an example of a table where again we have a partition key which is a customer ID we have a sort key which is indexed by time we're and it's actually got several items in there some items have a time prefix others don't and if I query this particular partition with the query that says give me everything that's greater than 30 days ago I'm going to get all the orders all the order items as well as the customers metadata because that's going to satisfy that query condition as well so I don't actually have to query multiple tables this is the crux of no SQL design what I'm trying to do is create collections of multiple objects or multiple types of objects then I can query with single queries I don't have to go to versing multiple tables to get that I don't have to issue a query against the customers table to get the customers metadata a query against the orders table to get the orders metadata then iterate through the orders result set and then go get all the items for those orders from another table this is an amazingly complex workflow that I'm avoiding by grouping these items into a partition like this right so this is the kind of games we're gonna play they'll satisfy those queries right so now when I go get those results sets oops sorry I have a much simpler access pattern right so again with indexes this is what we're going to do to support a secondary access patterns right with a with a and you can think of an index in DynamoDB as a a table that is replicated as you insert and change data on the parent table right we take care of that replication there's a hundred percent SLA guarantee on GSI replication these GSIS are eventually consistent but they're 10 millisecond latency p99 when properly configured so it's a very very good system to be able to support secondary access patterns in off-the-shelf no SQL databases you can't necessarily do this with some of these guys right like Cassandra they even tell you don't don't use indexes because there's no way to maintain consistency on their indexes DynamoDB indexes are guaranteed to be consistent so this is one of the biggest value ads of dynamodb others over some of the other no SQL providers out there one of the other things we do with DynamoDB index is we support projections so if you need to have an access pattern let's say that needs to identify the items that match on the table and when identify those items I also need some of the other attributes those items contain I can choose to project all some or none of those attributes on to the GSI now remember when you do this you're increasing the storage cost essentially I could be duplicating the data if I project all the attributes I've duplicated or double the storage cost and doubled the wcu cost of the insert of any inserts I make to the table so and you know this is a good way to understand when you're getting to understanding your access patterns you want to start to control costs by controlling what you project to the indexes if I only need to know the items that matched and only project the keys if I only need if I need to know additional data then project what you need but don't project more okay so partition keys in no SQL are used to spread the data out so to speak and dynamodb we create an unordered hash index we spread those items out across this arbitrary key space and we chopped this key space up as we need to add additional capacity either storage or throughput this is important because an individual storage node on the array only has so much throughput so when we designed for DynamoDB or for any no SQL database we need to take into account the idea that we need to get that workload spread out if we want to scale in scale the system in throughput and the way we're going to do this in DynamoDB is to use a technique we call right sharding for highly dense aggregation so the example here using our customers partition key and the data that we have laid out might be that I want a secondary index that's gonna tell me give me all of the orders that were made from a given source right so if you think about Amazon retail when they want to get all the orders from the odd that we're sourced online that could be a very dense aggregation in DynamoDB you only get about a thousand WCU's per second so if the on the GSI I'm trying to aggregate by source and all I'm doing is using a source as a partition key then the only way to increase throughput is to create additional partition keys by salting those partition keys I'm indexing so in this case I'm aggravated the orders by source I'm aggravated the items by product Aysen and I'm looking for additional data right the state of these products in shipment I'm looking for the source of the orders and support additional access patterns now when I do this on the read side what I'm going to do behind some sort of data layer API is create a parallel processing system that allows me to query across those partitions assemble the data for my clients and produce a nice collection if users have to query across those partitions to gather all of that data right but I'm not gonna I'm not gonna require all my developers to go out there and understand how things are right sharted I'm gonna do this at the data layer and the reason why is because oftentimes when I'm designing for DynamoDB I need to write shard some of my customers maybe but not all of them right and if you look at Amazon tables and aw s tables configuration tables sometimes you need to come to us and ask us to increase the number of configuration items or where I am policies or or whatever it is you're using and some of those reasons wire because maybe because we need to do some DynamoDB write charting if you go beyond X you're gonna exceed a thousand WCU's on one of our indexes or maybe even on the primary table based on our estimated write traffic if you go beyond a given point so you're going to want to conceal all of this from your users by creating some sort of data layer API that's going to handle the right sharding behind a request to go get the data users provide some conditions the API then shards the data or retrieves the data from multiple shards now just like the table we want to use generic keys because when we use generic keys we can install it we can slip we can insert multiple items into comment into the GSIS right like on the table we have multiple types of partitions the weight and multiple types of items and the way we make this happen is we use these generic keys called PK and SK well I showed you the two patterns we want to index right by orders by source items by a state I want to be able to maybe reuse my GSI for both of those access patterns because if I create two GSIS I got allocate to capacity buckets got a set up to come no sets of alerts and alarms and I can't reuse the capacity that's allocated to one when these access patterns are probably not acts not happening in parallel so it's more efficiency to reuse the GSI so I reuse a lot of GS is to reduce the number of constructs we have to maintain in production now when I query the GSI it's the sort key and/or the partition and sort key conditions that define what it is that returns I don't have to worry about those objects colliding so to speak it's but you know if they have the same potential for collision on the partition key I can prefix the partition key with the item type to avoid that but you know in this particular use case I don't because there's no chance for collision between an ace in and a source for an order when I query with the source is the partition key I'm going to get those order items when I query with an ace in as the partition key I'm going to get the order items that are actually parts of those orders right so depending on the use case depending on the access pattern the keys are good to find the items that come back all right so when gets into scaling noise ql again it's about getting that data spread out when you don't spread the data out this is this is what happens we get these what we call hot keys this is a heat map we now have a cloud watch feature that allows people to look at the the these keys we don't you don't get heat map views like this but you can look at the keys that are being experiencing pressure like this in this case we have a number of servers on the y-axis we have time on the x-axis and you can see no server number 11 there is doing a whole bunch of work and where everybody else is doing nothing we want to avoid patterns like this this is an anti patterns the high velocity access patterns it's firing on a small number of keys or a single key so the way to get the most out of Modi be out of any no SQL database as a matter of fact is to spread the workloads out right as we want to scale the system we want to get more and more shards more and more storage nodes you know participating in the work and so we're going to spread that data out across the key space and we're going to hopefully have our requests kind of line up well over time sometimes that's not the way things work right we get thundering herd's we had Cyber Monday and Black Friday just recently so we know a little bit about that if you do it right this is what ends up happening you get pictures like this this is a great heat map this is about 450 servers they're pumping about 40 to 50,000 WCU's consistently over time across their key space and you can see how that access pattern is just lighting up like a Christmas tree up and down the key space this looks like the white noise on the TV when you change between channels right that's the I don't know if any but that is even happening or does it okay I have an old man they can't help alright so these are the things that happen when you go to cloud native I had no SQL services you get the ability to kind of just in time provision your tables it's been in both dimensions storage and capacity and throughput in this case what we're looking at on the left hand side there is your relational database that's your legacy no SQL technologies right this is MongoDB this is Cassandra you are provisioning for peak you you are hoping you are consuming a high percentage of that capacity that you have provision but the reality is the average data center utilization in the enterprise is 10 percent if you're doing better than that over time then you're doing really good so the reality is you're wasting a lot of money on the right hand side this is dynamodb this is cloud native database technology with auto scaling this is a real service it runs in one of our fulfillment centers and you can see the difference between they turned on auto scaling and before it's night and day the amount of cost that they're saving is around 40 percent I think most of our fulfillment center applications are saving around 40% and it works very well when you have that kind of nicely shaped demand curve for applications we have other pricing models if you don't and we can talk a little bit about those I'm not really going to talk too much about pricing models but I like to talk about performance and performance at scale because nothing else performs like dynamodb or other cloud native no SQL databases when you look at in dynamodb I don't know what other database you're going to look at and say it gets faster when you throw more load at it all right that's like counterintuitive right what most databases this is a synthetic workload ramping up over a million requests per second and what you can see is the low latency consistent low latency at the peak utilization and the reason why is because we have this massively distributed request router fleet and that request router fleet needs to know about your table to service your request so if it doesn't know about your table it has to hit a configuration service to understand your policies your your stores node locations and all the various information needs to know to factually process the request over time as that information cache is in the request router we don't have to go and back and look at that data so it's actually going to get faster and that's what happens over and over again and we have many customers they have come to us and said is this real you know their APM instrumentation may be telling them things that they don't believe and we can say yes it's a common phenomena the other thing we see with dynamodb that's really amazing is the ability to deliver more than you have provisioned we have this thing called the burst bucket this is an example of a message service provider that around the Superbowl they were provision close to 6 million WCU's at the Superbowl when it ended they peaked at over 10 million WCU's they didn't notice this until the next day they came back after looking at their cloud watch metrics and oh my gosh what happened here how come we didn't throttle and the reason why is because you have a five minute burst bucket with dynamodb you can go to this database and you can say hey look I know you weren't busy five minutes ago can you give me that now please and what happens with with MongoDB when you do this what happens with Cassandra what happens with your relational database when you do this right it goes offline okay that's what happens it goes bye-bye and so and you're going to start to see your requests and response times go through the roof as you put the memory pressure on those systems right not so with DynamoDB it's built to be elastic to buffer that throughput and give it to you when you need it okay so let's talk about data modeling that's really what we're here for all right and we talked about some of the basic data modeling but it's really about relationships when we talk about data modeling okay and relationships I Drive every application I've ever worked on I've never been able to build an application without understanding the relational model the entity relationship model and that doesn't change when you get into no SQL right then you know what what changed was us thinking that we can create these big giant blobs of data that hold these relational structures right the idea that hey these giant JSON documents are good ideas you know what that big documents are bad ideas we'll talk a little bit about that as we get into the modeling but it it doesn't matter what type of application you're building social networking document management IT systems management anything you're working on requires us to build relationships even if it's a key/value access pattern that's that's a relationship there's a one-to-one relationship between the key and the values that are associated to it and and you know if there's not I want to know please tell me is there any non-relational data that you're aware of I haven't found it as a matter of fact I would love to have an on relational banking application because I would log in over and over again until I saw the numbers that I liked right and then I would write my checks I'd go okay hopefully as Jeff Bezos account I'll be really happy with that all right maybe not that's okay all right so this is what we've done when we model relational data we use a normalized data model this is a example I've been using for years it's a simple product catalog there's a one-to-one relationship between products and books albums and videos one-to-many between albums and tracks many many between videos and actors and and the various queries we have to run are quite complex we're gonna dive into that in a second but what we might do with the relational database in this way where we would use ad-hoc queries to join these tables and a lot of CPU power to assemble the views we're looking for we're actually going to go ahead with no SQL and this might be a simple approach of doing this would be to kind of create that those views so to speak or there's hierarchies of data that we're commonly querying by assembling them into documents you can think of all the rows of these relational tables is sticking them into these documents or partitions as we will demonstrate in a second which can be very interesting and how we're going to access them but let's take a look at the complexity of those queries right so let's populate some data tables let's take a look at the various queries we're running so the first query we're gonna run is a single one-to-one relationship not terrible the time complexity you know that particular query is a login and end login or M login can't really see n log M whatever don't have my glasses on when we when you start to add more tables that time complexity just gets worse right there's no there's no adjusting this there's no fixing this and this is assuming that we have all the proper indexes and that we're we're running you know nested loop joins and we're not having to do terrible things like Elmer's joins and hash joins but but but we're actually doing things efficiently that time complexity just goes up it goes up it goes up the more the more tables I join in it goes up when we start to look at no SQL right now SQL is basically built on distributed hash table so no SQL queries theoretically have no time constant right so this is taking all those same joins right and creating partitions they're more or less pre join the data now if you think about the worst case scenario a relational database what is it doing it's probably going to hash join right I've got a large input set I've got a large table let me go ahead and table scan create a hash index on the sort dimension that I need right let's go ahead and then start to iterate my input collection I'm going to query the hash table repeatedly to draw out the the items that match right well this is why not do that first because obviously the database believes that a hash table is the actual you know the most efficient database data structure to query when I need to do repeated query so let's go ahead and do that and that's really what we're doing here we're creating partitions these partitions have the rows or the items from our relational structures and as we query through this in the particular case here I'm querying for a book it's a time constant if I query for the album I query by the album title it's again time constant know sort condition so I'm just going to the databases give me everything with this partition key I give me everything with the movie partition key now the interesting thing about querying the movie here is that what I've done is I have both movie and actor partitions and the movies are related to the actor partitions through those items that are sorted on the actors name so this is a graph this is a directed graph I have movie partitions those are movie nodes I have actor know I have all these items inside of that movie node that or that are in that movie partition that describe the relationships between the movie and the various actors the the movie partition understands what it's related to in this particular data structure that the actor doesn't yet we'll get into that but when I query this partition for the movie I'm going to get a set of items the one the first one describes the the movie it describes the movie and it's sorted on the directors name if I query if I get a list of actors I can put together summary view here that shows me the movie and the actors that are in the movie what their names are what their ages are what their genders are what role they played and I'll project the dates of their birth dates and their genders because that's relatively immutable data it doesn't change and I wish I could change my birthdate but that's not happening the in any case it's okay to protect the mutable data on these edges to make the read more efficient I'm optimizing for the read but I won't protect the bio because the bio changes I don't want to have to update the bio across all the actors roles and whatnot but son I'm using those edges to describe to tell me some things about these actors if I go to this to create an index and I flip the partition in the sort key around right so I take the sort key and it becomes the partition key of the GSI and petition key becomes the sort key now I can actually query on additional dimensions right I can query for all the books that an author wrote I can query for all of the albums that a song shows up on I could query for all of the movies that an actor's been in well now what have I done I've completed that many-to-many this is now an undirected graph both sides understand what they're related to I'll query the table for one side I'll query the GSI for the other right this is how we model many to many relationships in no SQL and avoid massive duplication of data right the only data where I'm duplicating is denormalized across those edges and I can decide what I'm going to project onto those edges oftentimes those edges need the extended metadata describes what is the relationship what is the role what are the permissions what are the things about this association to this object that this person has or that this other object has in my system I can query for the movies that director has directed I could query for the albums of the musician has produced so on and so forth right this is how we're going to model additional access patterns one side of access patterns on the table another set of access patterns on our indexes and that's the primary way that we'll do it okay let's get into documents versus wide column and this is really important because there's a lot of hype out there there's a lot of people talking about how it's all different and there's use cases for document there's use cases for wide column in and I'll tell you when you're getting to actually modeling the data no it's not there's not a lot of differences as as matter of fact there's none we're gonna get into that let's talk about this document model we have an employee item this this employee item might be indexed on a couple of dimensions in a document database under par ID is a default index and we're gonna be able to query it on that key value access pattern we're going to use a meaningful value for underbar ID I know a MongoDB if you insert an object without underbar idea gives you a UUID what that really does is creates a data index for most practical purposes most applications I don't care about an object reference by UUID maybe some do but I no one calls up customer support and says I'm customer UUID right they call up customer support and say I'm so-and-so what's your email address is the first question so use something that's meaningful as a key as a primary key and then we're going to create a secondary index a compound index on the building dot floor to support subtree aggregations of our employees in this particular example as what we're trying to produce all right so query where you know the building ID equals x and the floor starts with you know four seven give me everybody out seventh floor of this given building or so forth so on and so forth so let's take that data model we're going to move that data model into a wide column data structure and we can see how different this is and really it's just a different way to display the data as a matter of fact when I query the data in dynamodb what comes back is how many folks have queried DynamoDB but what I see is JSON right it's just a data structure okay data structures can be flat like this they can be hierarchical and DynamoDB we support JSON attributes so you can have hierarchies of data as a matter of fact the fastest way to query your DynamoDB table is to put your unindexed attributes in JSON all right so if you have lots of attributes in your documents create a big JSON document insert that in is one attribute and then project above into the root of the item all of your your indexable attributes because we can't index the attributes inside of a JSON object but DynamoDB will treat that JSON object like a big blob a single attribute okay we actually have to store each attribute in these storage engines so if you create a JSON attribute it's one attribute if you split that all out into a lot of attributes it actually takes a little bit more time and there's a blog post out there for it you could look at all right getting into the differences then what we're really talking about is indexing and how do we query write the system and that's it's really the difference in that kind of dimension it has nothing to do with the data model in document DB there's a default index on underbar ID and DynamoDB the partition key defines that default index in document DB we've got a query planner that we can actually go ahead and query it chooses the index you know when I was at MongoDB I used a lot of dollar hints on my queries because I don't want that thing to use indexes that I don't spend I want to use the index I specified I certainly don't want to collection scan for me when you query and document DB you're gonna see two very relevant metrics on the result that you get back when you do and explain it's going to be documents returned and document scanned if documents scanned is zero that means that and document returns greater than zero it means that your index covered the query okay every attribute you had defined on the index was covered when you made the query if it says documents returned to document scanned or equivalent then it means that well I found that the documents on the index but I had to go back to the collection to get the extended attributes remember I talked about projections right in order to project attributes in document DB I need to include those attributes in the index definition that actually makes your inserts more expensive okay so that's one of the nice things about a wide column database like dynamo DB with projections on the index I don't have to you know those those those those inserts don't become more complex they may take more wcu that doesn't cost more CPU to do those inserts when we go into indexing and defining indexes when I query a document database I better make sure that the shard key is included as the first attribute of my index because that's what's going to be used to turn where to route my query and I better make sure when I make the query that I include a shard key condition otherwise the query gets sent to every node now you may want that but you know most of the time you don't and most of those queries would be coming back empty has just wasted bandwidth in the system so you need to find compound indexes and include that shard key as the first metric and include the shard key values when you make the query to be able to isolate your queries to given nodes in the system with dynamodb I chose the index I design the index for a specific access pattern that's the one I want to use I will then use a partition key quality condition every time I query that GS higher every time I query that table that routes me exactly to the right storage node I don't have to worry about making sure I do it right it just forces you to do it right again I'm gonna optimize with compound indexes in document database I'm gonna use projections to preload my indexes in DynamoDB those are the differences there are no other differences I'm going to model this thing the same I'm going to create objects I'm gonna index them I'm going to create those objects as documents I'm going to create those as wide column data structures one or the other but that's what I'm really doing with no SQL so anyone who tells you that there's a difference between these systems is wrong there's just not okay we're getting to complex queries it's about understanding KPIs and metrics in the system that are useful to the application because most applications require things like top in last in average counts sums you know all these types of complex queries that don't really get supported in the DynamoDB API so how do I support those queries and we do this with change data capture processing right I'm going to create a pipeline off of what we call the dynamo DB stream the stream is a running change log of data update and write operations that occur on the database every time you do write to the table you're going to see that object on the stream and will be able to process that with lambda what do people do with their lots of different things the primary thing that we're going to do with this is we're going to do roll-ups write summary metrics kpi's things that I you know with I kind of need in my application right one of the things we found at Amazon we started to work out you know how we're gonna make these migrations and actually how to scale our relational databases sorry before we even determined two migrations I was we offloaded the cost of these complex queries by creating you know summary aggregation tables and once we started doing this it was a clue right we said you know it's too expensive to run the query to get the count the summary the average the max whatever the min and we're and we're maintaining these is running aggregations maybe what we should do is no SQL and that's what we do here right this is how you get aggregation to scale anyone who's using aggregation framework in MongoDB or in document database is gonna find really quickly as you scale your database that those queries don't scale with you you can end up remodeling your data you're getting implement change data capture pipelines like this to maintain those summary metrics summary aggregations the other thing people do is they're going to go ahead and you know sink to elasticsearch right if you have indexing you know requirements that we don't support in dynamo DB index intersections geospatial full-text things that DynamoDB doesn't support let's use change data capture with streams and lamda again one of the nice things about cloud native this is a guaranteed contract that happens with dynamodb between the table and the lamda process when you use lamda to process the stream there's a trigger that occurs it's guaranteed at least once processing make sure you understand that it's at least once code your ID your lambda functions be idempotent because there are rare conditions where container failure might occur you're in the middle of processing maybe I just processed the item I didn't update the configuration tables yet container refires Landa function picks back up right where it thinks it left off so understand that these functions should be idempotent and when you code them but there is a guaranteed contract of at least ones processing feed that data into Kinesis firehose roll it up into s3 and park' files run athina queries on it for real-time analytics a lot of use cases where people do exactly that it's a wonderful solution there get your data off the table doesn't impact and that's an asynchronous process you have to worry about impacting your DynamoDB table with any of this it's like the world's best stored procedure engine you're not going to knock the server sideways by running a bad stored procedure so you just get paid more I'm okay with that no no I'm not a lot a lot of my job is spent trying to get people to pay less trust me and I do I'm pretty good at it I think I'm pretty good at it all right so let's talk a little bit about composite keys composite keys are important part data modeling when we get into no SQL it's about creating sort keys that matter in this particular example we have a a game sessions table users have games sessions they have different state we're looking for sessions that are pending for giving users sorted by date we might have an index on the sessions table it's partitioned on the user it's sorted on the session date and it's and what we're doing here is we're using order by condition as the sort key condition which is basically no sort key and then we're using a filter condition filter conditions and DynamoDB apply to extended attributes any attribute that's of the items you can say you know items that have a particular attribute value in this case we're saying where the status is pending and so what happens here is that we're reading we're reading three items and we're we're returning - I'm paying to read three in this case reading three is the same cost as reading two so that filter condition is great but if I had 10,000 items in this partition and only wanted to read those two that could be pretty inefficient read/write I'd have to get a whole bunch of items and and pay for that and then only return two so the way to fix that is to use composite keys in this case every time we insert the item we'll take the status the date will create a new sort key called status date now when I query the table I can say begins with pending and I get only the pending items right so we're gonna use a lot of tricks like this to create selective reads out of very dense partitions right we've got a lot of items in our partitions but the reads are very thin so it's okay and I have a lot of people say oh we're not supposed to put lots of items in the partitions no we can put all the items we want in the partitions it's about the velocity how fast am I writing how fast am i reading and DynamoDB into a single logical partition I can write 1,000 WCU's that's kilobytes per second I can read 3,000 WCU's at WC I read 3,000 RC use and RC is 4 kilobytes a second so I can effectively write at those rates and read at those rates just be aware of that is we're starting to lay the data out now what we're going to talk about next is how do we actually model complex relational data right and I hear this a lot right where's where you know how do I do this without creating normalized data models I need look-up tables I need you know all these these constructs that we're so used to using and and the reality is as I showed you the time complexity of those queries is too high right we don't want to use those types of data models because I mean if it's it's that hard for a relational database to do it how hard is it going to be for your app server right I mean think about atom traversing the table I'm gonna be iterating result sets at the a player it's going to be ugly it's going to be grossly inefficient this is exactly what we found out at Amazon so learn from our mistakes and start to look at how we actually model that relational data in a way that makes sense so in this particular application we're going to be talking about employee portal last year my service had delivery service had does an application acts a dozen access patterns I'm going to show you one now that has 23 different access patterns just to shame and the other question I get a lot is how you evolve the application right isn't it hard to add new access patterns well this model started with this employee table and it was that document model that we saw earlier right and I said you know let let's add different things I added tickets I added projects I added you know a whole bunch of different entities right I added that we're going to talk about when we get into looking at this data 23 different access patterns we're accessing this data on multiple dimensions in multiple ways and you know let's take a look at what some of those access patterns look like so on the table we created many partitions right may types of partitions then we loaded these partitions with different types of items right so the first partition we created I was a building partition buildings have meeting reservations they have room definitions we have employee partitions employee partitions have meeting objects right they get invited to meetings they have some employee metadata in associate with them then we have project partitions and project partitions have time cards that are submitted by employees employees have roles on those projects and different things that they did and whatnot they're submitting projects also have project metadata and then we have tickets write tickets have and you know the you know the subject the the item that's created when we open the ticket they have messages they're associated the tickets they have owners they have assignees and we're gonna try and access this data in multiple ways tables gonna support nine access patterns the first access patterns gonna say get me the employee metadata or get me the employees meetings by employee ID so I'm going to query that table by employee ID I'm going to give it a date range condition and I'm going to give it a filter condition that says greater than zero because you know I might have you know items in there that end up being date ranged and they're not meetings but that filter condition is going to knock out anything that's not a meeting and most employee partitions are not going to have a lot of items so we're gonna be okay with you know filtering out those items and making that less than selective read the next access patter we have is getting me all the given meetings for a given building during given time range so we can go ahead and select by building ID in our building partition again with a date range condition and you know we're going to use a contains condition on an attribute that's a copy of the sort key because in forward I don't know why but in dynamodb API you can't use contains on a sort key so if you actually need that condition you have to create a copy of the sort key and use contains on that as a filter condition which is what we'll do and that that copy of the sort key contains the building the floor or the room we're looking for to be able to support that access pattern next one is going to be get the employees metadata by employee ID every employee has a metadata item starts with E that's our query condition so we query by employee ID starts with e gives me that access pattern the next pattern is going to be get me the ticket history right so I want to get the ticket history select from the table from the ticket partition and gives me the entire ticket history for a given access pattern now we're gonna be getting project information right select my project name with a sort key equals project name because that item there's only a selected condition inside of my partition that returns only my project metadata I can query the project partition with a date range condition to bring back all the time cards that were filed on that particular project by different employees we can also add the filter condition by role and bring execute the same query and say everybody was a technical product manager on this particular project and given time range so on and so forth the next ya access patterns interesting and a lot of customers use this we use this internally for reservation systems when we have these use cases at Amazon oftentimes we know some things about you know what reservations have been made what things are available but we don't necessarily know what things are available until we make that calculation right get the things that are reservations that have been made and get the things that are available and then figure out you know so on so and so forth so in this reservation workflow what happens is the client comes in the first query goes to the building it's going get the rooms that are available in the building as soon as I open the portal to reserve the room I'm gonna go and select the time and say get me the rooms that are available I'm not gonna get the rooms are available and get the reservations on that building and floor and whatnot that they've selected during that time window and I'm gonna let the client figure out what rooms are available because guess what I already sent them down the room metadata those are both very small queries but what ends up happening at scale and we have several customers who have used this they've come back and said they've actually been able to decrease the size their application fleet server fleet because they're not triaging that data anymore the application server there's shoving it all down to the client let the client figure it out alright the clients have an abundance of CPU I'll show you another use case where it's is a version control or versioning workflow where we need to see versions and just send the deltas to the client instead of round tripping back to the server to get the you know older versions but anyways the idea here is get the logic down to the client don't necessarily push this logic into the application server because that's going to be expensive right so the other use case they are getting to support that is we get the room metadata and then of course get me the the reservations for a given building room whatnot alright so next thing we're gonna do is index these items and what you can start to notice is these tables get smaller and smaller because on the primary table some of those items will only get accessed on one dimension which is the primary access pattern that they're stored in on the table other items on that table need to be indexed and this is what we're gonna start to see as we start to extend this out so this index supports an additional 9 access patterns the first access pattern that we're gonna get here is get me all of the employees meetings by email right one access pattern was get my Eve get me by employee ID the other ones could get by email so I'll index all those meeting objects on GSI one using the email the next is a primary use case this is saying I log in to my portal my portal gives me everything for the last 30 days which is all my meetings all my time cards all my tickets all my messages on those tickets everything so I would query my employee partition by email and say give me everything greater than 30 days ago it brings back everything the employee metadata all my tickets one query five different access patterns solved with one single query and this is what we're going to try and do with no SQL databases we're gonna try to create queries that support the retrieval of multiple types of objects when this is required group these objects into partitions on the able to support these joints right these are joints this is really what this is this is this is flat-out a joint okay the next access patterns going to get me the employees metadata by employee ID or by employee email as well and then we're going to say this is get the ticket history by employee email right so we've got all our ticket objects are going to be stored in those employee partitions using a employee email as well so that we can query on that dimension we're going to want to then go ahead and query our projects by star oh this an interesting use that use case I have one customer that needed to have a start case of get me everything that started within the last three months it's going to finish in the next three months right so it sounds like to sort conditions and it kind of is two conditions but you can satisfy this using a sort condition of filter condition all right if I query the table and say give me everything with this project name or everything across the the the active projects in my in my company that was started three months ago with the greater than three months ago and use a filter condition that says an target date target delivery is less than three months from now then really what I've done is I've done kind of a a none selective read from the partition gotten every project that's currently active and the filter condition knocks out the ones they're you know not going to deliver in the next three months this was kind of a budgeting use case where like I need to know what projects I'm going to cut so I don't wanna cut anything is gonna deliver in the next three months but anything after that's on the table for talking about right I don't like those but they sometimes you got to do it right alright next access pattern is going to be against GSI - we're starting to deliver a smaller number of of access patterns here because we have a fewer number of objects that actually need to be indexed on these partitions right so the first thing we're gonna do here is going to be getting the tickets by assignee email right so we had to get the ticket by the owner email who opened the ticket I want to get all the ticket history and when I get ticket history by assignee email so every one of those ticket objects I insert whether it's the opening of the ticket or the messages that are associated ticket I'm going to add additional decorate those objects additional attributes so I can pick them up by the owner and by the assignee I'm going to want to get all the employees for a given manager right so in this particular case we got all those employee metadata items they're going to pick up another indexed dimension so that I can you know index them by their employees manager a query GSI - with the manager's ID then I'm going to get that we get into the last use case here this is the last index we're going to talk about there's just a couple of use cases on this one we need to on this first query we're going to get all the employees in a given ceiling but city building or floor of a building this is that subtree aggregation use case that we talked about earlier I see this a lot is you know people need to get all the products in a given category give me all the employees of given place you know we can you know create a path just create a path of the hierarchy and use begins with on that path to get those subtree aggregations we want to get all the tickets that are languishing so you know as I know there's a lot of tickets in the organizational I message is going down on those tickets I want to get everything that hasn't been touched in the last 24 hours for escalation so I'm going to create a right sharted set of partitions here you can see this ticket lives in partition 7 there might be you know 20 partitions here to increase the throughput depending on number of tickets my systems are processing but again I need a thousand WCU's per partition so if I'm aggravated you know it cost ten partitions I can write you know ten thousand you know tickets per second would be a significant amount of workload there but let's say the tickets might be bigger than one kilobyte in the end what we end up with is a set of tables a table that describes the access patterns that we were tasked with the query conditions that I need the table of the GSI that I'm going to query the filter conditions to use and this is exactly the exercise to go through with every team I talked to you create that entity relationship diagram what kind of data are we working with define your access patterns how am I actually accessing this data how do I need to store this data okay let's define the data model and let's come back with a set of like you know clearly defined conditions that I'm going to use to produce the results that I'm looking for and again I built this entire model in iterative fashion right I started with one or two access patterns and started adding more more and more you know with real data sometimes you might have to you know do an ETL on the existing table and decorate the table with the distant existing items with new attributes you might need to change the values and existing attributes but the reality is you're not going to have to throw the baby out with the bathwater right this is an evolution and it's not dissimilar to what you go through with the relational database it's just we're much more familiar with that ok so let's get into designing for common patterns this is about understanding what type of when to use big items why big items might not necessarily be the thing for you in this particular use case let's talk about a customer I was talking to you recently it's assurance insurance quoting service fairly successful business they're doing about 800 quotes per minute peak their provision a thousand WCU's on their dynamo table the data model looks like this basically consumers come in they open up insurance quote that create a new quote they're going to start creating versions of that quote by changing things you know I Drive more than that or you know I live here no actually I'm going to be doing this here what not basically the every customer might change the quote two or three times every time they change the quote they would create a new copy of the quote with a new version so they had a customer ID as the partition key their star key was along the lines of quote ID under bar version when a customer would say okay show me version my current version they would just you know scan for word index false limit one they would get the most current version as customers started going back and paginating through these versions they would look at and go back to the database to get a new version right so it was great for them they loved it that came to me started asking me about optimizations is there anything we can do here I said well sure the first thing we can look at is how much data changes when I update a quote there's a lot of data change does a little data change it turns out that really only a couple attributes change the matter of fact 90% of the quote is immutable data right it's like a dress it's not immutable but it never changes right a dress name you know phone numbers things like that and every time they were creating a version of the quote they were writing all that data back to the database and creating a 50 kilobyte you know update so what we did was said look okay why don't you store the deltas on the table inside the partition change your schema slightly now what you've got is a quote ID and all those version attributes those are just very of the various sections of your quote and we're going when you query the quote you're gonna send down all these version objects and you can push the logic to assemble the most current view into the endpoint write that on the browser and so now they drop their provision capacity on the table from a thousand to fifty WCU's okay because most of the time they're not creating big blobs I mean every now and then they create a new quote but most of their traffic was just updates to existing quotes and if they ever go above fifty WCU's that's okay because they'll just go ahead and use the burst bucket to satisfy if they get four or five simultaneous quotes being created they have no problems they see spikes regularly over their provision wcu they're fine with that so this is a really good example of how I can optimize the data model by thinking about how I'm implementing the system and how it affects the entire stack all right we have a lot of customers that do things like this to come back and say you know I was able to decrease my application server fleet by 5% because I'm no longer doing I'm not handling that data in the application server anymore I'm just passing it off to the front end and letting them do that triage as a matter of fact in this system imagine they don't even have to come back to the server to get the previous versions that the customer gets of the current versions is back back back there's not one request going back to the front-end to the API of their service all right it's all sitting there on the front end and they know what it is okay one of the things I got to say before we stop no SQL work bench for dynamodb I don't know how many folks are aware of this we released this tool a couple months ago this is the best tool out there for modeling no SQL databases today it will build the types of views I show you in my presentations these aggregate views at the table you can what's nice about this is you can create just some JSON data define the JSON data throw it into the into the modeler and then you can start defining indexes on your various attributes and I'll actually show you how the data your sample data lays out it also has load from a relational database you can connect a relational database it has a code generator you can use the code generator to you know help build your own applications it's it's rock-solid stuff there's a nice easy to read URL there but I couldn't find anything easier so that's what you got but with those charts will be online if you google for no SQL work bench for dynamo DB you'll find it easily enough it's really a great tool I got more and more customers coming to me with their models already done and it makes it nice because I can just load it in that we can make our tweaks and show them exactly what needs to happen all right conclusions no SQL is not about non-relational data please don't use that term if you'd notice I didn't use that term very often other than to say it doesn't exist because it doesn't exist it's a marketing term that was invented by marketing people who tried describe technology they don't understand to other people who don't understand it and they came up with this it's not relational it must be non relational don't use that don't fall in that trap data is all relational or doesn't matter the erd still matters yeah a relational database is not deprecated by no SQL you know use no SQL for OLTP or decisions poor systems at scale use that relational database for OLAP but I always think cloud native first I'll tell you the elasticity of the the AWS compute cloud is amazing and of cloud native services like dynamo DB is is unparalleled in scale so I would go there first all right we got a lot of tools for you guys to look at when you want to learn more about AWS database services go here AWS training and certification lots of online resources digital training courses we've got some real good stuff coming out from our partners as well Linux Academy I think just released a new DynamoDB training a modeling course and definitely worth taking a look at but there's some tools out there for you like there's never been there's a lot of information for you to kind of master this stuff and hopefully you got some good information from me today so thank you very much for your time so by the way if you didn't like the session then it's that 450 if you like it it's 403 please rate me
Info
Channel: AWS Events
Views: 79,285
Rating: 4.9241705 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, DAT403-R1, Databases, Amazon DynamoDB
Id: 6yqfmXiZTlM
Channel Id: undefined
Length: 59min 12sec (3552 seconds)
Published: Fri Dec 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.