AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I watched this one before. The guy is REALLY good and talks very fast. One of those gurus who has been doing Dynamo DB since before it existed...and mongo db before that.

One of the best talks on how to design your indexes and STOP thinking relational.

👍︎︎ 4 👤︎︎ u/[deleted] 📅︎︎ May 23 2019 🗫︎ replies

Agree, first resource/video I recommend for anyone starting to model in DynamoDB

👍︎︎ 3 👤︎︎ u/Naher93 📅︎︎ May 23 2019 🗫︎ replies

My jaw dropped when I watched that video a few months back. Denser Information than in a black hole :-)

👍︎︎ 2 👤︎︎ u/jkuehl 📅︎︎ May 23 2019 🗫︎ replies

I haven't watching this video yet (unless it's one where they spend a bit of time talking about how to avoid hot keys in dynamo, then I saw it a while back), but I still have trouble finding a suitable case for NoSQL databases in my day to day work.

It's not that I don't see the benefit of NoSQL, it's just that any dataset i've worked with ends up having so many ETL jobs run on it that NoSQL loses it's benefit.

The one case I had for NoSQL recently was going to go to DocumentDB. After some analysis, we saw that we were only ever accessing a record by ID, and then flat files in S3 became much cheaper than DocumentDB to run (by nearly 2-3 orders of magnitude)

👍︎︎ 1 👤︎︎ u/krewenki 📅︎︎ May 23 2019 🗫︎ replies

I've done this a few times at scale now, and my co-worker wrote a little more about our experiences with some tips and more thoughts here. https://www.trek10.com/blog/dynamodb-single-table-relational-modeling/

👍︎︎ 1 👤︎︎ u/shortj 📅︎︎ May 24 2019 🗫︎ replies

I love this video so much. I thought I knew DynamoDb until I saw this and it blew my mind.

👍︎︎ 1 👤︎︎ u/stefano_vozza 📅︎︎ May 24 2019 🗫︎ replies
Captions
my name is Rick Houlihan I am a principal technologist for no SQL at AWS that means I do a lot of dynamodb work although I work across technology stacks I do a lot of MongoDB and Cassandra as well most the design patterns I'm gonna be talking to you about today actually apply to all no SQL databases they're going to be presented in the form of a wide column key value store which is what DynamoDB is if anyone's interested in figuring out how to apply those design patterns to MongoDB track me down after the session I'll show you how to do this in the document store as well I've idea that's a lot and one of the big messages I like people take away is there's really not a lot of difference between the various technology platforms and the design patterns pretty much apply across the board so we're going to talk about today I always like to talk about brief history of data processing to kind of set the tone the mindset for why why are we even looking at no SQL and it's pretty important that we understand this because we've had this great technology for many decades this relational database it seems to do everything so why would I spend my time learning this new technology which is so seems to be so alien compared to what I know already we'll talk a little bit about that well getting into an overview of DynamoDB I'm not gonna spend a lot of time on this is a 300 level session so it's going to be you know a very brief overview of the no SQL a service offering we call dynamodb and then we'll get really into the meat of the discussion today is gonna be about no SQL data modeling and we'll talk about normalized versus D normalized schema and what that really means and how do we build data structures into a no SQL database like dynamodb and then I'm gonna get into some of the common design patterns now historically or in throughout the last couple years I've really focused on basic use case design patterns with this time we're really gonna go deep into relational modeling I'm going to focus on composite key structures for the most part we'll talk about how to translate hierarchical data models how to translate relational data models into no SQL so this is really going to be they represented a lot of the work that my team does which has been you know working with global strategic accounts as well as our internal Amazon retail teams to migrate from relational database application services into no SQL databases and we'll close out with a real a quick discussion or about service and and and talk some about modeling real applications and I'll give you an example of a real Amazon service as a service that we did with very complex and a long list of access patterns so again first thing I like to talk about is a history of data processing and I love this quote I don't know who said it but you know listen to and look at what happened in the past so that we don't repeat the mistakes of the past and that's a lot of what this message is about so if you look at the timeline of database technology it's really great comes down to a series of what I call peaks and valleys being data pressure and data pressure is the ability of the system to process the amount of data that I'm asking it to process in a reasonable cost or a reasonable time when one of those dimensions is broken that's a technology trigger and we invent things and we've done a lot of that invention over the years so the first database we had was a really good one we're all born with it stuck between your ears and highly available right when my eyes are open it's it's online you know but maybe questionable durability about zero fault tolerance it's a single user system so pretty soon we figured out that we actually need to do something better than that and so we started writing things down and we developed a system of ledger accounting which is our first structured data store that we had and that ran public and private sector applications for several millennia until 1880 u.s. census came along and a guy named Herman Hollerith was tasked with collating and processing all the data that was collected now if you're familiar with the US Census it runs every 10 years and it took mr. Hollerith and his team about eight of those ten years to process that data in 1880 and he figured out you needed to do something different so he invented the machine readable punch card in the punch card sorting machine and the era of modern data processing was born as well as a small little company called IBM which has a large and long and storied history on database technology so rapidly we developed many technologies as public and private sector applications started to consume these new technologies and and produce applications that required more and more and more data so paper tape magnetic tape distributed block storage random access file systems and Along Came 1970 or so the relational database and it's important to understand why we built the relational database and we did it because storage was expensive extremely expensive in the mid-80s I was at Macworld in Moscone Center in San Francisco I was walking through the Convention Center and I saw in the middle of the conference room floor there was a truck transmission and I was like why is there a truck transmission in the middle of maybe there was an RV show or something they couldn't get it out of there and I walked over to look at it really wasn't a truck transmission it was a hard drive from 1974 it was cross section it was really cool-looking but it had a sticker on it said four megabytes and MSRP $250,000 that's that's pretty expensive and now obviously we didn't use a lot magnetic disk in 1974 but the point is that storage was extremely expensive so normalizing the data right reducing the footprint of that data on disk was extremely important and that's what we did then the relational data store is a wonderful way to reduce the storage cost of your application and what does it do it increases the CPU costs because the complex queries it must execute to produce the denormalized views of data that your application consumes right joining tables it's extremely expensive so now fast-forward 30 40 years and the most expensive resource in the data center is actually the CPU right it's not the storage so why would I want to use the technology that's optimizing the least expensive resource in the data center and this is really why we're looking at no SQL today right because we want to do things more cost-effectively and and more easier on the wallet so to speak so when we use new technologies it's important to understand how to use them before we use them most of the teams that I work with actually kind of fail this test they they kind of take their relational design patterns they deploy multi tables you know implementations normalized data models in know SQL and then they wonder why is it not working it's it's so terrible and it's really due to this effect so if you look at the bottom this is the technology adoption curve we're all very familiar with this in the beginning we have innovators running around solving a problem a technology trigger has occurred right the data pressure is too high in the system we need something that's going to be able to process this data a little bit more efficiently they land on a solution in this case we're talking about no sq technology they have that a few people have some really good results and the rest of the market starts to move there and as people start to deploy the new technology they realize it doesn't work right not for them their use case is different the other people must be doing something or having a different application the reality is no that you're probably just using the technology the same way you use the old technology and typically new technology doesn't work the same way so if you actually learn how to use it first you'll have a better result and so if you look at the bottom chart the relational technology that's way out on the right-hand side the laggards right if you don't understand what a join is today then you've been living in a cave for thirty years and I can't help you right but if you don't know how to build a denormalized data model well that's fully understandable because no SQL technology is over on the left-hand side right that's where the innovators are still operating we're still in that technology gap people are still trying to understand this new technology so if you take the time to actually learn how to model your data correctly in a no SQL database you're gonna have a much better result and so that's what we're gonna talk about today how do I actually model the data before we get there it's important to understand that you know this relational database that we've been using for 30 or 40 years well it still has a very good place in the modern application you know development environment and that really comes down to what types of access patterns are we trying to deal with with this application service are they well-known and well-understood then maybe that's a perfect database or a perfect application for a no SQL database right where I have I have to actually structure the data specifically to support the given access pattern and that's how come no SQL data is if databases can be more efficient but if I have the need to support ad-hoc queries all right maybe a bi analytics use case an OLAP style application that might not be the best application for no SQL because no SQL databases aren't really good at reshaping the data right they like simple query select star from they don't like complex queries you know inner joins and calculate values and these types of things know SQL databases are not very good at that so if you that's when it really breaks down to the OLTP application is a excellent application for no SQL databases and good for us that's 90 percent applications we build because they represent very common business processes I want to go to amazon.com and I order you know and I hit the order button the same thing happens every time and that's really the crux of when it when it when is it a good decision to use a no SQL database if I understand the access patterns very well they're repeatable they're consistent then that's when we go to know SQL if not then let's look at the SQL database right so it's not obsolete and it's just now we have a different choice the Amazon DynamoDB is a fully managed NoSQL database how many people in the room have actually run no SQL databases at scale I'm talking like 50 or more nodes yeah not very many right when you get there you will realize that managed services are really cool and most of the customers I work with have scaled out the Cassandra clusters they scaled out their MongoDB clusters and oftentimes I'll work with them to kind of correct whatever mistakes they made in their data models to get a little bit more life out of that cluster but sooner or later they're gonna say you know it's too expensive it's hard to run this I don't want to manage it anymore I have a 724 knock that I'm staffing you know 365 days a year to manage all of these this infrastructure whether it's running on ec2 or on-premise irrelevant it's the same cost right server updates patching operating systems and and and and software rebuilding storage devices failed nodes all these things nobody wants to do this that's not core to your business so the fully managed aspect of no SQL database services is really the most powerful feature that you can you can have it's a document or key value store what we really mean is it's kind of a wide column key value store that supports a a document attribute type I'm going to talk a little bit about that later on and and really scales to any workload is fast and consistent at any scale dynamodb has tables single tables running upwards of 4 million transactions per second with low single-digit millisecond latency as a matter of fact one of the most interesting characteristics of dynamo DB is the busier it gets the more consistent the low latency responses become is because and that's because we have a fully distributed request router and as you start to hammer the request for a dirty or more and more the the partition information of your know SQL table starts to get cached across the front-end there's no more lookups on that configuration table so if you look at the services like snapchat when the when the White Sox won the World Series they were peeking out at 3.5 million transactions per second or so and the graph is just really interesting because the latency drops to a very flat low one to three millisecond range as they approach their peak and fully fine-grained access control in DynamoDB we can restrict the access to the items on the table to the table itself and to the attributes within the item so if I have an application process that's running against a datastore that might have items that contain information I don't want to have visible to my order entry clerks but maybe they have annotated data that's interesting like maybe sales persons Commission data and whatnot that I want to have access to sales managers I can have those access patterns hit the table with different I am security permissions which gives me a fine-grained access control over the data and what people can read and then it's a backplane service which is really great when you talk about serverless programming we're going to talk about that at the end of this and why is that really a big value to customers today so dynamodb has tables like all databases the table and DynamoDB is more like a catalog in a relational database you're going to put many many items into the table and the items in the table don't always have to have the same attributes but they do have to have one attribute that uniquely identifies the item and that's the partition key so every attribute I inserted into DynamoDB tea table must at least have a partition key attribute I can also contain I include an optional sort key attribute and when I do this it gives me the ability to execute complex range queries against the items that are in those partitions so you can think of the partition as now a folder or a bucket that contains items and the sort key orders the items within that folder and so when I query those those partitions or those folders I can execute those queries with complex range operators so in this example let's say I have a partition key might be a customer ID and the store key might be the order date and I want the customer and the primary access pattern the app give me all of the customer's orders in the last 24 hours and so I can say query the table where the partition key equals customer ID X and the SAR key is greater than 24 hours ago that gives me a nice filtered list of orders by customer over the last 24 hours so that's a really good way to maintain that one-to-many relationship and you typically want to model the table to support one of your primary access patterns and we get into the actual data modeling you'll see what I mean by that you always want that table to be able to query something that is interesting to the application partition keys are used to uniquely identify the item they're also used to distribute the items across the key space so every table in DynamoDB represents a unique logical key space and we're going to distribute those items by taking that partition key attribute and creating an unordered hash index and laying these items out across this virtual key space this is the way all no SQL database is actually work and when we scale the database we're just gonna chop that key space up and spread those items out across multiple physical storage devices so now when I query the system I'm going to always provide that partition key is that equality condition so the system knows exactly which store is node to get to to go read that data this is what makes all no SQL databases fast and consistent at any scale they there's there's automatic routing of the query to the exact storage node that needs to be served to serve as the request when I include the range key or the sort key in the table schema now when I query it provide that a sort key condition I'm going to go into the partition selectively read the items that are co-located on the same storage node and sorted using that sort key attribute and this is again this is how no SQL databases maintain that fast consistent behavior this is not unique to DynamoDB all no SQL databases have this construct partitions in DynamoDB are automatically 3-way replicated so I write two dynamodb you're going to get an acknowledgement from the client when two of the replicas have received that right when you read from DynamoDB then you have a choice between eventual consistency consistent reads and typically it's up to you how to do that we recommend that you use the eventual consistent read because that three-way replication the primary node to secondary is sub one millisecond so you're not all that eventually consistent it's pretty much by the time your round trip back from the client the data is going to have replicated the secondaries eventually consistent reads are half the costs of a strongly consistent read because we have more nodes to choose from when you make that eventually consistent read I'm gonna randomly read from one of those three partitions if you make a strongly consistent read I'm gonna read from the primary node primary node always accepts the right so you're gonna get a very consistent read on guaranteed strongly consistent read off in the primary but it is twice the cost so a cheap way to double the capacity of your application is to use EC reads they are on on by default so unless you change that parameter that's what's gonna happen local secondary indexes well we have two types of indexes in DynamoDB we have local secondary indexes and we have global secondary indexes local secondary indexes allow you to resort the data in the partitions right so let's say in this case I have a need to get all the backordered you know items for a customer my primary use case is the primary table say order ID is the partition key sort key is the date but on my GSI or my LSI I'm going to create a partition key which is the same as the table which it must be which is the customer ID and I'm going to resort the data using maybe order state right so now I can query the LSI and say give me all the back ordered items for a customer and I can query the table and say give me all the orders for a customer in the last 24 hours right so it's a different access pattern I'm gonna restore the data to support that access pattern but local secondary indexes must always use the same partition key as the table so it's a way to resort the data but not regroup the data the alternative is to use a GSI or global secondary index and global secondary indexes allow me to create a completely new aggregation of the data so in the primary table which might group the orders by customer maybe my global secondary index groups the orders by warehouse and I could say that my partition key on the global secondary index would be warehouse ID and the store key would be order date now if I need to get the orders for a given warehouse in the last hour I can query the yes I buy warehouse ID and with a sort key operator saying greater than one hour ago that would give me everything for a given warehouse in last hours so you're getting the idea what we're gonna end up doing as we model the data is we're gonna start to use these indexes to regroup resource re-aggravate the data to support secondary access patterns right but using the same types of key structures so be interesting when we get into the modeling we'll see some real worlds examples of how this works GSI updates are eventually consistent l sis are strongly consistent so something to remember when you're working with these GSI scan when you make the update to the GSI you're going to be get the acknowledgement back to the client asynchronous process will kick off to update the global secondary index is very important to know that the GSIS have to have enough capacity allocated or the table could end up being throttled because we need to maintain the consistency between the indexes and the table so if you're writing data into the table faster then the GSIS can replicate eventually they're going to back up there's a small buffer between the GSI and the table but if that buffer overruns we're gonna shut down writes to the table until the GSI can actually catch up so important to know that you're provisioning on the GSIS must match the throughput of the table all right scaling no SQL technology does work you just have to know how it works but I like Douglas Adams so I put this quote up there but this is what bad no SQL looks like and this is what usually happens when people kind of don't understand how to use the new technology if you look at this is a heat map and we'll run this for customers I don't necessarily need to see this I can look at a tables you know cloud watch and say okay you're throttling and you're well below your provision capacity that probably means you have a hot key in this particular case we have a partition count on the y-axis we have time on the x-axis and you can see by that big red line all of our access is hitting a single storage node so you remember I said key space we chopped up that key space in this example you have 16 different partitions but something in the application layer is causing a high velocity access pattern to a small number of keys or a single key and it's causing one storage node to light up and that is an anti-pattern NoSQL in all no SQL databases we can force this condition in Cassandra MongoDB dynamodb it doesn't matter we want to distribute the access patterns right this is what no SQL is about is a fully distributed database so getting the most out of Amazon DynamoDB throughput it it's oftentimes about using partitioned key elements with large number of distinct values right high cardinality sets we don't want binary partition keys right too or false in or out these are bad partition keys because it's going to allocate so it's going to a granade so much of the data into a small number of partitions what we really want all right things like uu IDs right we want large numbers of distinct values and we want to access those values evenly over time so let's say let's get all of our customers to line up and take a number right so we can get them to come in in a nice orderly fashion we all know that doesn't happen right there's thundering herd's all kinds of high demand access patterns but if we can get that data spread out and get those requests arrive more evenly spaced over time and oftentimes this is about distributing the data will have a much better picture when we look at that heat map in this particular example I would probably tell this customer to D provision the table because it's not really doing too much it's very much underutilized I usually like to see a little bit more color I just like to see things evenly distributed across that key space I don't see those big red lines we want pepperoni pizzas on these heat charts right so one thing to mention when we get into dynamodb is the biggest value one of the biggest values of dynamo is the elasticity right when I when I manage a no SQL cluster like a MongoDB or Cassandra I have to provision it for peak load and it doesn't go away because I can't take those shards away I can't take those those those nodes off of the ring I have to actually continue to replicate data and run those and manage that infrastructure whether or not it's doing something with DynamoDB we give you a really neat technology to deal with the elastic demand of your application this is a real application service it runs in one of our fulfillment centers you can see like before without auto scaling there was this high bar of provision throughput everything under that want that line and above the blue curve is wasted dollars now with auto-scaling what happens is as your commands the demand for your application increases in flows and ebbs you'll see the auto scaling adjust that capacity on demand to meet the applications access requirements so this is a really good way to manage the cost of your database now no SQL databases in general can't do this so these managed services are really a valuable a managed service like dynamo DB is really valuable because it provides the elasticity which is a huge cost savings over time in the middle of the night when your application is not doing anything you don't really want to be paying for all of those services to be running and so that's one of the nice things about dynamo all right it's getting the no SQL data modeling and it's not for the faint of heart right but it wasn't really designed to be it was designed to be to maximize the efficiency of your access patterns and this is one of the things that as developers we're really going to have to understand and embrace because we're so used to developing with relational technology and when you look at the data modeling in no SQL it's different it's hard ok there's a lot of differences between how I model data NoSQL and how I model data in relational databases but the bottom line is the data is relational it doesn't stop being relational just because I'm using a different database it's the same entity relationship model that we're going to build and that we're going to manage and that we're gonna have to support with the application service and it doesn't matter what type of application that we're we're building it's all the same is social networking document management IT monitoring you know process control every single application you can think of has some sort of data model that is relational in nature so how do I deal with relational data in NoSQL database a lot of people call no SQL they call it non-relational I won't even you'll notice I don't even use the word on relational because the data is relational it has to be or we wouldn't care about it so we get into how we've done that in the past we've used this normalized model and this is a example of a product catalog where I have you know products of three types at books I have albums I have videos and you can see all the common relationships that we track and we manage with the relational database in this structure we have one-to-one between products and books albums and and videos I have a one-to-many between and tracks I have a many-to-many that goes through a lookup table to get you know videos and actors because actors can be in many videos so this is a complex set of queries that need to be executed right to get a list of all my products three different queries joining up to four tables and this is why relational databases cannot scale because that CPU is going nuts hopping all over the disk pulling data off of all these tables sticking it together in a denormalized view and serving it up to the application layer and on the flip side when I need to update that data because what I'm really doing when I execute those queries is I'm populating application layer entities when I need up they update the application layer entity well what happens the data lives in multiple places so now I need acid transactions all right so a lot of the need for acid transactions is really the reason for that is because of the data model that we're using with relational databases sorry I'm losing my voice booth I've been a long week already so maybe a better approach to this is to not do that we don't want to burn that CPU like that we want to take those hierarchical data structures we want to collapse those things and build documents or collections of items within single partitions that represent these data hierarchies and now instead of having to execute three queries with various degrees of complexity I'm executing one simple query that says select star from products if I want to get all my books select star from products where type equals book right these are much simpler queries much simpler access patterns you can see immediately why does the system scale better when I use this type of hierarchical data model right then this relational data model is because I'm not executing as complex an operation to assemble the view I'm just going in getting documents or collections of items out of single partitions I don't have to go in and join data and create these views so when it comes down to it there's really just a few concepts we need to understand when we get into data modeling and dynamodb the key concepts are really about selecting the partition key in the sort key as we talked about this is the partition key is about large numbers of distinct values we want things to be uniformly requested over time so maybe some bad examples would be status and gender good examples might be customer ID vici D things that allow me to kind of distribute that data selecting the sort key is about modeling those one-to-many and many-to-many relationships that we need to support within the in the data model and and building a sort key that allows me to execute very efficient selected patterns this is what we're gonna get into we start talking about composite key modeling in a few minutes but it's about querying across entities with a single trip to the database I want to get all of the items that I need to support that access pattern I don't want to have to go back multiple times go get my customer item now go get all the order items for that customer that's a relational pattern and I see this all the time when I work with developers because they're used to modeling data relationally so they do it that way and then their access patterns become very inefficient because I'm really managing that joint at the application layer as soon as I do that right so what we want is you know make good examples of this we talk about orders and order items hierarchical relationships which we'll get into in a few minutes as we walk through the process and it's important to understand the differences in the process between modeling and application and no SQL and relational database as a relational database all I need to do is normalize the data right we have this neat things called third normal form I could probably talk to in almost anybody in this room right now who's listening to this presentation and say here's my business problem here's my entity relationship model can you give me a data model for my relational database and everybody be able to sit down and build that third normal form right and then we can argue about what queries are more efficient at an index or to you after the fact but the reality is with with no SQL it's the opposite I need to understand every access pattern I need to know exactly what the application is doing because if I don't then I can't I can't model the data in a way that's going to be efficient for that particular application service so the first thing we want to do is understand that use case what is the nature of your application is this an OLTP app is this an OLAP app is this a decision support system there's a very different requirements for those applications and and really there's one of them that doesn't fit right that OLAP application is does not fit with a no SQL back in you know define that entity relationship model know what it is what is what is the data that I'm working with what is the nature of that data and how is it related and then identify what the data lifecycle is right what's my archive backup do I need to TTL this data how what is the lifecycle of the data on the table the next thing we're gonna do is identify all of the access patterns of your applications this is what I'll do when I sit down I do my design reviews with customers you know how are we read the data how do you write the data what's the right pattern what's the read pattern you know what are the aggregations that we're trying to support with the acclimate with this particular application service and we want to document all of the workflows upfront and yet why because I'm actually designing a data model that's gonna be very specifically tuned to those access patterns and if I don't identify all those patterns and I could be in a lot of trouble when I go out and try and deploy this thing I might do a lot of work and have to unwind a lot of things I've done so one of the things I hear a lot is use no SQL because it's very flexible I've done a thousand no SQL applications I can tell you nothing could be further from the truth now as Gil has not a flexible database it's an efficient database right but the data model is very much not not not flexible because again because I'm building the app and you'll see when we get into the actual modeling of this and how you actually build real services on no SQL with complex actually access patterns the more that I tune that data to the access pattern the more tightly coupled to that service I am so it's not really a flexible service but it is a very efficient database to use at scale then the next thing where you do is going to actually model the data and this is where we get into the common mistake everybody makes when they work with no SQL application they start building multiple tables and they start building you know relational design models it's about one application service requiring one table right and we're going to get into that we talked about the data modeling in a minute we'll show you some pretty complex services that have been modeled down to a single table identifying those keys and how we're going to access the data and what are the actual queries that we're going to execute define your indexes for your secondary access patterns and then it's just an iterative process right just like any development process we're gonna review we're going to repeat we're gonna review and we're gonna get this thing down to a science sooner or later all right so complex queries it's all about the questions computers give you answers but we have to we have to ask the right questions so not always useless but you know some people have thought so one of the things that no SQL databases aren't so good at is actually answering those complex questions right I mean I need to know what is the count the average the sum the maximum the minimum in a given set all kinds of complex computed aggregations things that may be stored procedures one of the things about DynamoDB is really neat is this thing called DynamoDB streams and lambda it's like the best stored procedure engine in the business because it's completely disconnected from the tablespace one of the things that we did with Amazon's retail organization and one of the reasons that we migrated off of Oracle was because we had a problem with with the with service teams deploying stored procedures into an Oracle server it was shared across multiple teams and maybe somebody would deploy some bad code and all of a sudden we'd have three or four services or more going belly-up because you know the the processing space of the head node of the database server got knocked sideways one of the nice things about streams and lambda is all of the processing and of the data occurs in a different processing space than the table so you don't have to worry about impacting the availability of your dynamodb table so we can deploy really bad code to lambda and it's not gonna really kill us we don't I do that obviously so the way lambda works is you know it works in conjunction with the dynamodb stream stream is the change log for the dynamodb table it takes all right operations will appear on the stream once the data is on the stream you can invoke a lambda function that lambda function has two iam roles we talked about you know the security fine-grained access control lambda has an invocation role which defines what it can see or what it can read off of the stream which items can it see which attributes in those items can the land of process actually read and then it has an execution role which defines what it can do what other services within your AWS account space does it have access to you and what permissions does it have on those services to to work with this data so what do people do with this in this example not very much it's just dumping those attributes out to the console but it's code and code can do anything people do lots of things with it one of the most common things we do with streams and lambda is computed aggregations right people need to understand what are the averages the counts the sums one of the nice things about MongoDB when you're working with small data is Asian framework how many people have used aggregation framework for MongoDB a few okay I loved aggregation framework when I was in MongoDB until I had to scale it they doesn't scale too well one of the nice things about this particularly design pattern is as we read data off of the stream we can compute these running aggregations these these counts sums averages or complex computed metrics that we need to maintain at the application layer and then write that data back into the table as a metadata item so for things like time series data maybe I have time based partitions as I load those time series the time based partitions I can execute my my aggregation functions and produce all those time-series metrics and write that metadata item right back into that partition a really neat thing about time series data is once it's loaded it don't change right so we don't have to worry about that metric having to be calculated a thousand times a second the data gets loaded the metric is calculated now it can be read a million times and I don't have to recalculate it every time this is what we want to do with no SQL we want to offload the CPU we don't want to have to compute things we want things to be pre computed so this is a really neat design pattern for that we have lots and lots of customers that use that plenty of other things we can do with lambda update cloud search or elastic search or other indexing systems pushing the data into Kinesis firehose for stream processing interact with external systems again lambda is just code and it can do anything and the other thing you can do is doesn't have to always be lambda that reads the stream if you have a high velocity work flow then maybe lambda is not the most cost efficient thing to use maybe I'll stand up an ec2 instance and create a static stream reader service and that's perfectly acceptable as well we have lots of customers doing that so realize that streams and lambda is there for you to execute those stored procedure type operations or complex compute it up aggregations certainly a valuable service for doing that alright let's get into what we deal with composite keys and I love this quote because it's perfect right most people use no SQL as a key value store that's not the way that no SQL is the most efficient way to be used right we wanted to actually store our hierarchical data in in the table so to speak so how do we do that in this case let's talk a little bit about you know composite keys and in the use case here we have maybe players for a particular game they have sessions sessions have state what I'm interested in is all the Givens sessions for a given user that have a state of pending so game invites but not in this picture case I might have a table that is partitioned on the opponent that is sorted by the date and if I want to get all of the sessions for a user Bob sorted by date and and and filtered on pending a dynamodb can support this because we give you two key conditions that you can actually evaluate to range queries the first one applies to the sort key that's the sort condition that's the date condition so in this case I'm saying there is really no sort filter I understand give me everything ordered by date since the sort key is the date all the items going to be returned sorted by date that's great but I really only interested in the pending items so I'm gonna say filter on pending that's the filter condition so the sort condition applies before the read so it'll it'll give me a nice selective read the filter condition applies after the read so it'll knock out the items that I'm sending back across the wire but I'm still paying to read those items so the cost of that read is the same as whatever the sort key dictates this is okay in this particular example because I really only have three items only one of them is being filtered out all three of those items really less than one RCU so the cost of that query is the equivalent right whether it was more more selective or not but let's say there was 10,000 items in this particular users partition and only two of them were pending maybe I don't want to read 9,998 items just to return to so one of the ways that I could do that and the only way to do that is to create a composite key composite keys are how we create hierarchies on this using the sort key structure so if you think of what we're gonna do here is we're gonna take the status and the date we're going to concatenate those things together where you create one key called status date when we push back to the table now you can see what that view looks like it's like a faceted search I can say give me everything for this particular user that starts with pending it's gonna give me only the items are pending it's gonna be a nice selective read I could say starts with pending underbar timestamp 1 or between pending underbar time stamp one impending underbar time stamp 2 and I getting a range of items within a given state and they gonna get a query across state so think of this like a faceted search type and what I'm really doing is creating a hierarchy and we'll show you how to do that when get into the advanced data modeling here an advanced data modeling is about thinking about the data right nothing does itself the way the OLTP apps use data again is about hierarchical structures that use entity driven workflows the data gets spread out across tables and requires complex queries to be able to populate these application layer entities requires multiple queries to be able to update the entities as a primary driver for acid when we get into no SQL databases however so when we do normalize the data and we create hierarchical data items then maybe I don't need much more than atomic updates there's still a there's still times when we might need acid transactions I'm one of the really good one of the good ones is in maintaining the version history or creating new items that actually get created in multiple passes and committed all at once so in this particular example let's talk about we have items on a table these items have maybe a first item that we put into the table as the v-0 item you know it's copy of the v1 item and that main take contains the current data for this particular partition so when a customer wants to come in and get the most current version of that particular item he says select star from table where the item ID equals 1 and the sort key starts with v-0 he's always going to get a copy of the current version so if we look at the state of this given partition it looks like somebody came along they created item 1 initially it was version 1 and there was a copy of version 1 in v-0 somebody came along they created version 2 they committed version 2 they updated the version 0 item they just clobbered it with a copy of v2 and they updated the current version attribute to indicate that it's now version 2 so when the when the reader comes along says select start where it starts with v-0 he's gonna actually get that item but he'll see that it's really version 2 and now somebody's come along and they've created version 3 but version 3 is not committed version 3 is just sitting in that partition you know some works being done on version 3 will execute multiple updates we'll build this item in multiple passes eventually we're ready to commit version 3 so what do we do we clobber v-0 we update the current version attribute to version three and now any reader that comes into this partition and looks and says give me the item that starts with v zero is actually going to get V three so neat things about this particular pattern I have an audit trail write everything to changed what version has changed across you can decorate those items with the you know the who changed it so all kinds of night the nice things you can do I get the same type of visibility I'm used to with acid transactions read committed read uncommitted I can read committed starts with v-0 read uncommitted scanned forward index false limit one gives me the item that might be being worked on it so on and so forth so it's a neat way to be able to maintain version history and and have some sort of transactional workflow against a single item now we get into multiple items we're going to talk about started getting into some real data modeling here in this example I'm gonna use a pretty simple this is an internal service that we have at Amazon so it's a resolver service for configuration items configuration item we create the resolver groups we associate the configuration items to those resolver groups and then we have contacts for those resolver groups when new configuration items come in we email all the contacts are associated to given resolver group and so the data model looks something like this we have a read a resolver Group entity there's a many to many relationship between contacts and resolver groups in between configuration items and resolver groups and there's a couple of transactional workflows that we have to execute right we want to add configuration items to the resolver groups all at once or not at all and the reason and a commanded configuration item can belong to multiple resolver groups so and multiple configuration items might come in together and we want these things to be committed to the resolver groups all at once or not at all and then maybe contacts might need to be added to resolve our groups as well transactionally we might want to update the configuration item data transactionally there's a lot of workflows around here that require some transactional you know kin interactions with the data and DynamoDB if you're up to speed this morning or this afternoon we announced a really cool new feature which is transactions API we have now a transact right items transact read items API where we can actually support synchronous updates puts deletes across multiple items full at full acid compliance with automated role backs up to ten items within a transaction and supports multiple tables although you should not have multiple tables I didn't want them to do that that they did that it's okay there's actually there are use cases for multiple tables it's it's not I try to drive things to a single table but even myself I'll get into the point where now let's split the data but generally speaking it's a single table so good use cases for this commit changes across items absolutely love it conditional batch inserts and updates right we can have multiple conditions defined within a transaction if any of those conditions fail none of those items will get written that's a great use case really bad use case maintaining normalized data models please don't do that so transactions is here for you but it's not a crutch to make your relational models work all right that's actually going to be a real bad pattern for you so how does this work in DynamoDB with the single table in this particular case we have resolved or group partitions and contact partitions what we've done is created a pretty simple adjacency list adjacency list is a simple graph as we denormalize the contacts across the resolver groups what I'm really going to do is create a copy of that contact and reinsert it into the table with a different sort key and that sort key is the actual resolved or group ID so now what I've done in my contact partitions is I have a copy of the contact for each resolver group it belongs to if I need to add resolver metadata into the resolver group partitions and I'm going to add configuration items into those resolver Group partitions as well now when I get into the transactional updates here what's going to end up happening is maybe I have a configuration item I need to add to multiple resolver groups transact write API gives me the ability to execute that insert to both of those partitions and and guarantee that both those inserts will commit or not commit right it's up to you know when the transaction you know API to manage that process not the application layer anymore maybe I need to update the transaction status maybe I want to cancel an item and I can add multiple conditional checks I can say you know cancel this configuration item across all resolver groups as long as none of them are in progress that's a really valid use case for us right because sometimes configuration items get pushed into the system somebody goes oops we don't want to do that so pull it back right I need to I need to call that configuration item back maybe I need to update the contacts email right across multiple resolver groups and I don't really want to do that outside of a transactional envelope again so multiple ways that we can execute transactional writes in this particular use case and this is a really good example of a single table design pattern it maintains a complex entity relationship model right cond we have configuration items we've got contacts we've got resolver groups they all live on the same table the metadata for all of that lives on the same table and just to kind of show you how that you know is we add one GSI to do our reverse lookup right I can look up contacts by resolver group on the GSI and we've member we denormalized our contacts across the resolver group so if I want to go back to the to the reverse lookup GSI I can partition I can read the partition key for the with the resolver group idea get all of the contacts forgiving resolver group or I'd get all of the resolver groups for a given configuration item right again so all I've done is I've taken that tape that primary table and I've created a reverse look-up on the prod on the partition and sort key so now the sort key is the partition key and the and the partition key is the store key for the GSI it gives you that kind of reverse lookup so this is a way to be able to maintain many-to-many relationships I mean if you look back to the erd here we had we've got a many-to-many between resolver groups and contacts many-many between configuration items and resolver groups and you know in this particular data model now I've maintained that many many relationship across all of those entities using the primary table and the reverse lookup GSI alright getting into hierarchical data that was a little bit of a some data hierarchies we work with there but this is another service we use internally at Amazon this is for getting office information if we go to our wiki page and we click on a particular office building whatnot this is the service that it's going back to research for offices in this given in this particular example we have pretty straightforward linear hierarchy country state city office on the table where partitioned on the country ID were sorted on a composite key which is the state city office ID so now the access patterns here might be say give me everything in u.s. in the United States query the table where the partition key equals USA I'll get every office in the United States I want every office in New York ok country ID equals US city start source or key starts with NY excuse me everything in New York state everything in New York City starts with NY hash NYC gives me everything in New York City it's a really nice way to be able to take a linear hierarchy like this and just slice it up into a composite sort key and be able to support multiple access patterns multiple groups multiple aggregations I don't need multiple tables yeah and if I tried to do this with multiple tables just think of the access that I'd have to execute at the application layer ok let me get the country ID let me go back to the state and get me the states you know in that country oh right ok this this state is New York let me go back and get all the cities in New York now let me go back and get all the offices in those cities that's a lot of round trips that's a lot of that's a high latency and a very expensive operation that you're gonna be executing if you implement this in a relational pattern and no SQL why because there's no joins in no SQL because joins are expensive and that's why relational databases don't scale so when you hear people say that no it feels missing joins well you say you're missing the point all right so complex relational data we had a little bit of a picture of that when we're looking at that configuration management service this is more of a that's going to a theoretical delivery service right so I'm gonna create a fictitional delivery service that's called get me that get me that gets people things right people are busy they need stuff you write download get me that brow stuff get stuff tell us where to put your stuff that's what this does so it's very simple service but it's not a very simple into the relationship model when we look at this we have customers vendors orders items drivers right customers place orders vendors accept items from those orders drivers deliver those items drivers have status current status they have 5-minute status they have 10-minute status because we want to kind of schedule things efficiently right so let's see what that actually looks like you know when we were getting the access patterns we're talking about 10 or 12 different access patterns to this application right I want to get customers by date vendors by date orders by customer by date and vendor by vendor by date order details order items status delivery you know deliveries by drivers drivers status for scheduling so it's a very complex set of access patterns about ten or ten or eleven access patterns we need to support in this given application pretty straightforward thing to do with the relational database I'm just going to create this you know normalized view of the data execute a bunch of queries across here but again as we talked about the joins are gonna be very expensive especially as we scale the data set right and I've really noticed a trend I used to say that no SQL is for OLTP at scale if you're not dealing with big data then maybe you should be looking at other technologies and what I'm really finding these days more and more is that the common app is becoming a Big Data app so it's not so much I really do believe these days that no SQL is the future for the vast majority of workloads just simply because of the scale of the data and these relational models break with when I try to execute those complex queries so the no SQL approach here this is a little bit of an eye chart but we're gonna walk through it this is all of that entity all those entities stored on the same table the first query I might get is say get me the customers information and I would I would query the table by customer email that's gonna use as the partition key and I would my sort key value would be customer give me the customer item for this customer for this email that gives me nice nice selective query I that customers partition give me the customers orders in the last hour and the last day in the last week I'm just going to timestamp all those orders I'm gonna insert those orders into the customers partition by email now when I query the customers partition by email I used to give me the date range as the as the as the store key condition I'm gonna get a different set of items I'm not going to get the customers metadata I'm gonna get the orders that the customer gave over that time period that I've specified so again I'm slicing the data into these partitions a you know out of these partitions to support the specific access patterns of the application I want to get the vendor data anyone from Austin Austin to there you go you should recognize these vendors so my favorite restaurants if you're ever in Austin go to Torchy's tacos look the best barbecue in Austin in Salt Lake anyway so I get the vendors data I just select by the vendor ID and I want to get the driver data I'm going to select by the drivers email I'm gonna don't want the driver item in this particular example the drivers are drivers and the customers are customers but nothing prevents me from using the same email for a driver and a customer I would have different metadata or different items in those partitions they would all be they would still support the same selective access patterns as we even go further I want the drivers status by five-minute ten-minute status 5 minute 10 minute status I can get that by driver just saying select by driver starts with GPS we go oh and then we go into the indexing so the indexing is about overloading the key so if you look at the key attributes on these items the order items are going to be sorted on the indexes by email and by order ID and email by vendor ID and and date across gsi 1 and GSI 2 then as we get into the drivers GSI one is totally different it's not using the same key values but it's overloaded using the same name for the attribute so we'll see when we get to the index these eyes actually show up but I'm indexing the drivers GPS coordinates by sector because when a vendors order comes in he's in a particular sector I'm gonna want to know who what driver is gonna be available currently five minutes from now ten is from now whenever that order should be done so I can do my scheduling and then I'm gonna I'm gonna index the order items across those GSIS as well by timestamp and by customer and by timestamp and by vendors so when you look at the GSIS now it's kind of neat when I go I could query the GSI one I say query by order ID it's going to give me all the items for the given order it's going to give me the the customer that ordered it that's a nice query I don't have to go back and you know several times to the database to get multi if I have an order I can get all of the metadata that's interesting that's the you know all the details for that or just by querying the GSI by order ID single round-trip this is what we want to do with no SQL data modeling we want single queries to deliver multiple entities in this case the orders that the order that the items for the order and the end the customers information delivered in one query I can query by sector and I can get drivers that are in a given sector and what their current status is in five minute or ten minute intervals using the appropriate sort key conditions right current status five minutes - ten minutes times by sector right so a nice way to be able to solve the Traveling Salesman problem if anyone's familiar with that one that's a tough one this actually is a good way to do that this comes out of a customer you know design review that I had with them that was a problem they were trying to solve I go back to GSI to now I can query a GSI to buy vendor ID and created GSI to buy driver ID and get the orders by vendor by date or orders by driver by date what's the driver delivering what did he deliver in the last hour was my vendor delivering what's he delivered in the last hour right so these are the access patterns the application they give me the ability to support all of those you know complex access patterns I've got the entire entity relationship model stored in a single table I have only two gs eyes so the other thing I hear a lot with DynamoDB is you know you only have five GSI is we can't we can't use DynamoDB because we only have five GSI is I just showed you either support 12 access patterns with with this three-d-- with just two GSI the table now let's get into a real-world example which is even a lot more complex and when we talk about this service this is the audible eBook sync service this is how many people have a Kindle ok quite a few if you have Kindle you have and every ebook you buy there's a you have access if you have a Prime membership if you have access to the audible ebook version and the audible ebook is obviously just you know when you play on your Kindle device you play on other devices there's a lot of different relationships in this service that need to be supported so given a book and have multiple audio products associated to it because it depends on the device that's being played on the format of the audio files that need to be to played an audio product can have many audio audio files so there's a many to many mapping between ebooks audio products audio and and and then the ACR info table is the sync file information so if I pull into the driveway listening to it on my Alexa in my car and I walkie and I start listening to it on my laptop or my Kindle I'm gonna want it to start at the same place that there that ACR info file has a many to many relationship between itself and the and the audio products and so these guys were out there trying to figure out that a lot of downstream and upstream consumers had 20 access patterns they're trying to figure out how to deal with all of this with five sighs right select by ebooks key you select by a since like you know all kinds of different access patterns engine support ACR info a CRS and and so they're having a terrible time trying to map the data into a single table implementation and they're trying and he's thinking about going down the relational approach when we came in we gave him a pretty simple table structure as they insert the audio books into the table one of the things they were interested in was the audit trail when the a book a CR file change they want to know who changed it and why so we gave him that v-0 design pattern so they can implement and the other current item has always be zero on the a book a CR we you know we have two partitions on the table we have the abled partition we have an e book partition and then there's an item that we insert into the a book a CR partition that associates the a book a CR to the e-book so once we have all this data laid out then what we've done is created the GS is there's three GS is on this table if you notice that the GSI 1 GS i2 and GSI 3 don't always contain the same data type and this is because again there's so many different access patterns and what ends up happening is on the table on the GS is that looks like nonsense it's just a bunch of items sorted by all kinds of arbitrary dimensions I mean one particular GSI is totally you know who knows why I mean I might query buy a bookcase in a book SKU across multiple GS eyes by pulling different items out what ends up happening here is that the computer is what's reading it thus or key is what defines what comes back off to the GSI and we ended up doing was just of taking their 20 access patterns and extending the table so to speak again big eye chart here but it's not terribly difficult if you look at the first three columns that's what they gave us then what we said is ok you query this table or this index with this sort key condition and these filter conditions and that will satisfy your access pattern so these guys basically we did three indexes one table 20 different access patterns I have satisfied I've done single table designs for applications that need to satisfy up to 30 or 40 different access patterns okay with extremely complex yu-er DS as a matter of fact we have a really good chunk of information up there on the website for you it's the the best practices for Amazon's DynamoDB was just updated about six months ago brand new content and it has a really complex schema in there these 27 tables 30 different access patterns that shows you how to map the whole thing into a single table so I would definitely recommend that people take a look at that it talks about an extended design patterns a lot more than what I've talked about today and I'm totally running out of time but I have one more thing to talk about just gonna take a minute it's a serverless paradigm and if it's you know one of the things about I like this quote because it's a you know Linda's Torvalds told us right it was cheap home computing that changed his life well think about cheap data center infrastructure because that's what service is this is a really good example of application we built for Amazon CTO or for I'm sorry Amazon's essay organization the they wanted to be able to get customer feedback anytime feedback so if you get an email from an Amazon essay it's going to have a link in the signature it says you know rate my interaction when you click that link you're actually interacting with this application it has a pull down an HTML form from a secure s3 bucket when you hit the post button on that form it interacts with the API gateway API gateway calls the lambda function to process that data when we wrote the application it was we didn't have encryption at rest so we actually had to push the personally identifiable information up into us into an encrypted s3 bucket now we can I'm sure they've actually changed the application to store at all on the Dynamo table and that unencrypted searchable meta data was stored on dynamodb originally and then we would email the manager and let them know that feedback had come in really neat application we built very quickly it just took us about a day or two to actually design and deploy it deploys for pennies a month is the support cost on this right and and the nice thing about this applications if you could scale to a mutant a million users if it had to all I need to do is turn auto-scaling on on that dynamodb table and it goes so that's a really neat aspect of service is that you can really get things out there you can deploy it it's cheap that code just sits there until people actually need it you can prove the application before you pay for it it's not just fail fast its fail cheap now right this is the cheapest data center infrastructure you're ever going to get your hands on for launching new application services so I def I recommend you explore that surrealist framework so conclusions no SQL is not non-relational don't use that word it's a bad description ok the ER D still matters relational database is not deprecated but we want to use no SQL for oil TB database the royalty P or DSS at scale that's the sweet spot and then use that our DBMS for your OLAP or oil DP when scales not so important generally speaking the common case is the big data case today so thank you very much that's why I have for you today [Applause]
Info
Channel: Amazon Web Services
Views: 293,471
Rating: 4.9579744 out of 5
Keywords: re:Invent 2018, Amazon, AWS re:Invent, Databases, DAT401
Id: HaEPXoXVf2k
Channel Id: undefined
Length: 59min 56sec (3596 seconds)
Published: Wed Nov 28 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.