Fundamentals of Amazon DynamoDB Single Table Design with Rick Houlihan

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everybody uh thanks for taking the time to join again for another session of office hours uh here for dynamodb i'm rick lehan as most of you probably know already i do a lot of dynamodb stuff today we're going to get into you know really deep into the data modeling aspect of things i've had a lot of people you know say hey your sessions are really good we're going through all kinds of stuff but all of this of course as usual it's always too much right well not always but sometimes uh you know i go through things pretty quick and we touch a lot of subjects and so you know today i really want to focus on that single table design uh you know one of the benefits from it i've had a couple of twitter exchanges recently with people that you have asked me you know when are the benefits of a single table and how do i i get so much pushback i think somebody said uh you know on single table design you know how do i how do i win the arguments that people you know put in my way and again a lot of this has to do with people's you know just understanding of data modeling as a best practice in relational databases and you know how we normalize data and whatnot and and but it comes down to value to the business and you know we spent a lot of time at amazon you know building out these design patterns because we saw systems at scale not really operating the way we expected uh them to operate as far as cost efficiency uh you know for the workloads that we were deploying you know and and so you know our customers that we're now embracing these concepts are seeing the same things you know one of the things i just recently uh talked with the product manager from a large uh you know online gaming company they have a pretty popular title i i can't mention the company but you know they had mentioned to me that they were you know getting rid of dynamodb because of the cost and complexity and i said well geez that's that's really you know not good to hear and why is that we talked a little bit about their environments and one of their biggest problems was they were setting up their sandbox environments for the developers he said you know we have 17 tables and each one of these tables has two or three gsis and you know just setting up that environment for our developers is painful you know it takes 45 minutes to an hour to deploy those tables and sometimes you know even longer and it's like my gosh you know that's got to be you know what are you guys running we talked about their data models it really came down to the fact they were using this multi-table you know data model you know for a lot of reasons they had kind of broke this thing up into entity tables and it wasn't really because you know uh this is just the way they did things but it really came down to the way the applications used the data uh and and that large documents to maintain these hierarchical data structures just wasn't working for them right there was a lot of uh you know a conversation about um you know how much the cost of of uh the queries that they were making were we're running them to go and and pull out big large hierarchical documents on a per customer basis and the types of access patterns right now their users you know a lot of things revolve in a video game and online multi-user massively multi-user uh you know online role-playing game you end up with things like inventory of objects and a lot of different interesting access patterns occur you know sometimes you need to touch all the inventories of all your customers and all your all your players right i mean i need to give them credit objects and things like this and if you think about inventory objects and how they get added to the players they don't get added all at the same time they kind of accumulate inventory over time and you know so if you start to create these big large documents to kind of accommodate these relational structures it breaks down at scale because you start to read and write big huge chunks of data every time you're doing anything to alter any aspect of a customers or player's inventory so this was their this customer's experience was that they were you know their users were you know driving a workload that they couldn't support effectively unless they split the data up across multiple tables and start working with things as multiple items little items and small items of objects and entities and things and and this caused their access patterns to fan out across multiple tables so in the end we went through the whole single table design how to bring all these entities into a single container use indexes to replace those joins and and i actually got off the phone with the customer the other day it's been a while since i talked to the product manager but they said they ran the poc they still haven't done the migration but they're estimating a 75 cost reduction right to use the single table design over the multi-table design so today what we're going to do is break down you know what it takes to actually do a single table design you know using a set of content that was built around a schema that we introduced you know several weeks ago when we did the introduction to nosql workbench and you know one of our specialist essays out there in emea i decided to take that content and kind of you know elevate it up to the next level and and deliver a real deep dive discussion around how do you you know layer this model together it's going to demonstrate a lot of of of different design patterns but it's also going to demonstrate a design methodology that you know kind of contradicts the common belief that single table designs require a huge upfront design burden right that is you add access patterns that you have to redesign everything because you know really it it's not necessarily true that that is the case and and what we'll see is a kind of layering of patterns as we build out the richness of this model and so let's kind of get into discussing you know what this thing is going to look like uh here i'm going to go ahead and pop this use case that we're going to talk about again it's going to be you know kind of really model the round engagement i might have with the customer when i walk in we're going to talk about their use case uh what is the nature of their data you know what are the entities the relationships between those models you know we're going to list out the access patterns that we need to do and then we're going to we're going to walk through the whole process of modeling this in nosql workbench as we do every week but this week we're going to do it a little differently you're not going to watch me stumble around on the keyboard i'm going to use uh simon a's content here and that's going to be by the way that was the person who developed the content her name is in the in the data model we'll mention her multiple times but uh thank you simone utter for putting this together because this is going to be a much more seamless presentation of a single table modeling than watch me stumble around the keyboard and so anyways this in that use case discovery phase what we're really going to be doing is looking at uh you know talking to the customer about what is their application what are they doing and so essentially if there's an online shop scenario this is the use case right and and we're kind of highlighting our entities in uh in yellow or orange here whatever you want to have depending on how colorblind you are uh the or amber as we kind of talk to the customer and this is what we're going to do we're really kind of writing these not really not writing these things down necessarily but noting them right customers products right customers come to an online shop they browse through the products that's a you know first use case you know i'm going to get a customer object and i'm going to be you know putting some session state into and you know maybe there's some product catalog that i need to browse through they're going to place orders for some of those products there's another entity we're going to you're going to invoice for those orders right when i check out there's some sort of invoice they're going to pay using some discount code or gift card potentially or maybe credit card or some other you know mechanism for payment so there we've got invoices and payments uh we've you know purchased products then get picked from a warehouse or one or more so we have you know warehouses and and shipments that get shipped from those warehouses to those addresses that are provided by customers those addresses are associated to those customers right so the top level entities just from this short discussion with the customer come down to customers products orders invoices payments warehouses and shipments okay and so if you look at that entity relationship model as we kind of discussed with the customer we're going to drive into okay let's let's go to the whiteboard let's go ahead and box these things out here's all of our entities now how are these entities associated to you well customers obviously can have many orders right uh orders can contain you know many products products can you know exist on many orders so there's yeah and you could just imagine how you would have this discussion to try it it's a typical discussion this is not unfamiliar to any engineer who's kind of gone through any kind of a relational design you need to understand your data entities you need to understand the relationships between these objects but the big difference that we're going to kind of go through here when we go through the single table model is you know how we're going to arrange that data right we've got in a relational model i've got you know all these objects these entities i'm going to go ahead and associate and define tables for these entities we're going to you know then write queries they're going to represent our access patterns i'm going to execute these queries against these tables we could join the data we need essentially i'm going to define you know items or or rows any all these tables that define these various entities and then queries that that you know traverse the relationships so to speak of those entities in our you know no sql single table design what we're really doing is we're taking a single table we're loading it with all of those rows right these are just items on the database they're all the rows of your your table we're going to decorate these items with attributes we're going to have this unique item this unique attribute to identify those items it's going to be some sort of an id and we call that the partition key in dynamodb and you know then we're going to go ahead and decorate and then we're going to go ahead and uh you know be able to query that those items by partition key and this would be if i define a simple table in dynamodb but really the power of nosql data database is to be able to group these items into you know relevant collections and in nosql databases like dynamodb that first collection is really the table and you can define kind of sub groupings of items in those tables which are associated by partition key and uniquely identified by their sort key and now this one that many relationships uh other types of relational structures we're going to build into this uh can be represented right on the table and so as we start to define this application that we kind of went through we're going to start to build some set of access patterns right into the table but as you can kind of see and we talked about earlier customers have more complex access patterns than you know get me all the inventory for a given player right it's uh oh you know get me all the players that have this inventory item because i need to update some you know attribute of that or you know hey guess what i need to add this item to every player's inventory because we had a server outage last night uh you know i want to give them a credit uh you know so you know there's different patterns uh that that i might need to access against the data and i might need then to go ahead and index these objects in different dimensions and that's really what we're going to do we're going to go ahead and decorate these objects with a bunch of attributes we're going to index them with attributes that are common across those objects and then we're going to group them in meaningful ways either on the table or in my indexes and we're going to query those groupings of objects so to speak with range conditions to produce the relevant results we're looking for right so i could have a table that's full of players you know these players might have a bunch of objects that represent inventory objects those inventory objects might be decorated you know with a type attribute and now i can start to query that type attribute for all the players with a given and to get and return all the players that have an object of a given type or you know whatever um or item id so to speak um all right so let's go ahead and see what that looks like from the customer's perspective really we have a set of access patterns that we have defined uh and in that model that we just looked at our entity model you know we have again customers products you know we have warehouses we have orders orders have payments and invoices orders have products that are on the order you know shipments then have products that are in the shipments and and so we've broken this down in these 16 access patterns and we can look at all of these is what we're going to end up doing is walking through each one of these individually so i'm not going to go ahead and go through each one right now but the fundamental exercise is going to give us a result that's going to tell us what table gsi lsi whatever it is that we need to query to get a given result set that we're looking for for a given access pattern what are those key conditions what do those filter expressions need to look like to be able to produce the result that the application is actually looking for okay and so the first access pattern that we have is going to be you know actually the first thing we're going to do is build this model in nosql workbench and see what that actually looks like so let's go ahead and do that i'm going to go ahead and switch over to my nosql workbench tool and i've introduced this tool many times we use it every week uh please you know download their this tool into their design process because it really gives you a way to visualize you know the the data on the table and and not only on the table but how it lays out on the indexes as well right because what we're doing again is regrouping the data and so being able to see how the data gets grouped you know what those what the data looks like as we group it up is really important so the first thing we're going to do is go ahead and i'm just actually going to open up you know the second step here which gives me some entity data already on the table and you know what we always do you'll see me stumble from week to week to create these generic tables and what i just demonstrate is one of the nice features of the tool is to be able to actually import a data model that is uh and also export that data model to a json structure which you can then share across your team you can check into source control and you know then just go ahead and import that into the tool this is a partial data model that's going to support that first access pattern you know that we looked at and that was you know basically get customer by customer id okay so if we look at the data that's on the table right now and that and we can look at that particular customer item in this case now this customer has a customer id c hash whatever it is one two three four five in this case uh we're duplicating that customer id into the sort key don't necessarily have to duplicate the id we could use you know anything there but what we really want is you know we want a sort key on the table because i might have more than one item in the customer's partition in this particular case right now it's the first item we're sitting on the table uh you know over time i might decide customers have you know different types of relationships and they i can actually represent the data that that you know uh configures or or informs me of those relationships as items that are either going to live inside of this customer's partition and that's why i want this sort key so that i can uniquely identify additional items that we're going to spread into the customer's partition in this model not sure if we actually do put anything into the customer's partition but there are other entities that we do and we do have the access pattern that says get me customer by customer id now some systems might not necessarily this unique id might be something that's more meaningful maybe i might use an email as a customer id but you know in this particular example let's assume that you know we're interfacing with upstream and downstream consumers that don't necessarily recognize email there's a predefined you know id format that we have to support this happens all the time in the enterprise environment right it might be more efficient to use something that's more meaningful the workflow as an id but i can't i always have to look up an email when i'm by getting the id whatever in this particular example that's what we're doing we've got look customer by id and this particular item describes that customer and there's salmone and that's not her real email so don't bother sending there all right so that's great and let's go ahead and see what that looks like from a query perspective if i go ahead and configure that particular access pattern and i go query the table i could query with these key conditions to retrieve only that customer object now if you if the oops the astute user might notice that i could uh query by pr i'm sorry too many i could query by customer id i don't need to add the sort key because there's only one item on the table there's only one item in the customer partition and so yeah that is true in this particular case but again maybe i add new items onto the table at some point i don't want this access pattern to necessarily pick those items up maybe i do in which case i wouldn't add the sort key at all it gives me every item that's in that customer's partition in this case i'm looking for a single item and so i would add the sort key that is you know going to uniquely identify the item within the customer's partition and that's the access pattern that we've implemented okay so the next pattern we're going to look at is going to be you know again get product we're going to look at get warehouse i'm going to go ahead and add those into the model here when we do it do that but those queries are relatively straightforward they're very simple to the to the tool let's go ahead and add those into the workbench and we can go ahead and override our existing model and you can see what's happening now if we go back to the vid to the data model i've got new i've got new attributes right when i add the customer i have the entity type i always like to specify the type on every object i sit on the table it's a nice way to be able to identify what items you're looking at when you make a query down to the database and a whole bunch of objects come back and you know one way to be able to you know populate your application layer objects is to iterate through the result set and switch on the object type and then that way you know what what to wrap it in when you hand it off to you know the consumer of the api that you're standing up right so in customer we have an entity type we have the email we have the name in the product we have the detail we have the price and we also have the entity type and when we visualize the model now we can see we have two types of objects on the table we have the customer object we have the product object and my access pattern holds true for both select by customer id select by product id and again as we start to decorate the table we may or may not insert more items into these partitions as a matter of fact we probably are in the product partition because products live across many warehouses right there's a relationship there so we're going to start to traverse these relationships and build this out but what you can see here is you know if i have the simplest of simple applications which is you know customer database let me go get customer information and then i add products into that you know and then i add you know other items into that i can i can kind of layer these access patterns i don't have to know every single pattern when i start designing the table i can actually design you know a data structure for the items that are associated to the access patterns as i build the table all right so if we go ahead and we import the next item here this would be the warehouse item right now we have some additional attributes that we brought in right we've got an address you know and we have a detail right for the particular warehouse and if we visualize the data model we can see now that we have the warehouse items that have this address this entity type we're going to start to push more data in there and again when we go back and we look for the access pattern that we're trying to support here that's the same type of query right so three types entities customer product warehouse we're going to access them exactly the same way with the same query conditions and and each one of these query conditions uh key conditions delivers a single item right because we're specifying both the warehouse the warehouse id in both the partition key and the sort key now if i apply some range conditions in here and i add some new items in there i could get a whole different set of results i could get many items and we're going to actually we're going to start showing you that in a second here the next access pattern that we're going to be looking at is getting all the inventory uh you know get the get a product inventory for all warehouses by product id okay so this is where we start to introduce a relational you know pattern into the data if we go back and look at the workbench let's go ahead and import i believe this is model five and we can go ahead and overwrite that uh is that yep so you can see we actually added a warehouse item in here for the product partition and this is where now things start to get interesting if we look at the visualizer we added some new things we added a quantity and we added a price and so now we can go back to our data model we can look at we have this uh oh we didn't add the price price was actually on the product already but we added the quantity attribute for the warehouse item and this is telling me that in this warehouse we have you know 50 of this item all right so if i want to go ahead and say hey what warehouses contain an item i can go ahead and query this table by product id and where starts with w again because my sort key up here starts with p which is a product id all warehouse items are going to start with a warehouse id because this is the warehouse in which they exist and so uh this access pattern that we just described which is you know get me the let's go back to our session here uh we can say you know get me a product inventory for all warehouses by a product id again query the table where the pk equals product id sort key begins with w hash right so this is a again this is a great way to be able to get a subset of items out of a particular partition subset in this case a subset of items by type i know that all items inside of the product partition that you know our warehouse items which you know tell me what the inventory of my product is across all warehouses they'll start with w hash so let's go ahead and do that let's give a query that begins with w hash and by product id so there we go we've that could be a totally new access pattern maybe i never had an inventory i never had warehouses i never had anything i had a customer database i have product database and you know this could be the beginning of my online store and so i put all my customers and products into the same table and later on i say wow now i gotta ship products okay well i got store products okay i got warehouses i got you know warehouse inventory and you can kind of imagine how i'm kind of building the application i can be adding access patterns and i'm not redesigning anything as you do this this is just the evolution of of your business right so um next couple access patterns we're going to add are going to be about orders uh you know orders have invoices orders have shipments they have payments uh you know so this is where things start to get a little bit more complex right we have multiple types of entities multiple relationships across entities and i have multiple access patterns for all of these entities within you know these you know within this hierarchy right it's not just getting all the invoices for a payment but i mean i might say you know get me all the invoices and as you can see down below we have others get me all the invoices uh for the invoice id get me all the invoices for a given order id you know there's to multiple dimensions you know there's there's lots of access patterns that we're going to be executing across these entities so it does help us to kind of spread these entities out into their own little items and start to index these things in interesting ways and so let's go back to our table and let's import the next version of our model was that six yeah so we added a couple more products here we've added some more warehouses here in in version six this is really just we've added some more customers we've made the data look a little bit more you know realistic and i i know some people do this in their modelers some people you know in this particular case it's great because it helps us demonstrate things i find when i'm demonstrating when i build my models they're simple i don't really put multiple instances of things unless i absolutely need to the but this is great for demonstration purposes to show these full-blown models let's go ahead and import uh i think is here's where we're going to get into orders where do we get into orders why am i not seeing orders yes there's an order so now here's an order partition orders start with o and they are ordered by a customer right so that's the first thing we're going to do is we're going to say orders now have a partition key of o hash they have a cust they have a customer id as the sort key and then they order products right so that's the simplest form of an order let's go ahead and you know create an order you can almost look at this like a shopping cart when you first come in you're going to create this cart partition you're going to start to you know first when you create the card partition it's going to have some you know top level item that describes who created the cart and then you know what items have been added to the cart and you know if you think about this workflow it's actually important because you know when even when you build a shopping cart customers don't build a shopping cart by selecting 10 items and saying add to cart right they they build the shopping cart by adding one item at a time and this is one of the things that some of our service teams noted some of them didn't have a problem with that because the eye of the carts they were creating were all relatively small but and the buckets they were creating the customers were creating were were small but other services had problems with that because the items that are being added contain metadata that you know wasn't large but was relatively significant the number of items being added to these carts they started to exceed one wcu so you know a lot of their customers and on average they were exceeding you know one wcu for their rights because the more items you add the bigger the object gets if you put it all in one object so sometimes in this case it makes it for you know multiple reasons to just add these product items to an individual cart or to an individual order because not only do i want to be able to get all the products for an order but i might want to get all the orders that contained a given product and it's a lot easier to do that if i can index the individual items within you know these partitions right and one of the things about dynamodb is it doesn't you can't nest uh index nested properties of json objects you can't index arrays and honestly it's not very efficient to do those things and so we should try to avoid building big complex documents for many reasons uh you know but indexing deeply nested attributes and arrays is not necessarily something that is an efficient way to manage uh you know your data uh you do kind of if applications use the data in bits you should probably model the data in bits right and most of your applications are going to use your data in chunks they're not going to use big giant you know documents right all right so let's go back and say so now we're going to go ahead and add a few more things into the data model here let's see where are we yes okay so now invoices and payments get added to the system as well right so uh here we have uh products that are being ordered uh we have invoices that are being made uh invoices might end up being decorated with some sort of payment because if you look at my workflow i don't have anything that says select by payment id right uh most of my workflows around payments or get all the payments for a given invoice and we don't have any other i don't think we have any other access for payments so as we add payments to our orders they're going to actually get edit they're going to get entered on the invoice item right and we'll see that as we start to extend the data model as a matter of fact let me load one more right so now we have shipments shipments are coming in and we've got the um pretty complex partition space here now right so we have uh the order we have customer object within the order we have invoices we have products that have been ordered and we have shipments that are being made against the uh uh against the order right so all these things are happening but whatever what else happens on shipments right and we have shipment items also come into shipments so let's load that in there did we get that i keep forgetting which number i'm on i don't think i got my shipment items in that one yes so here we have in this particular one now let me go back to our i've gotten way ahead of myself and i apologize let's go back to our powerpoint [Music] wrong transition so if we look at the queries what happened right so if we look at the queries that we're running here we've got multiple you know queries that are executing in and what we want to do is execute the query to get all the orders of these objects within all these objects within the order right so the data structure and i'll go back to the data structure we can look at that i apologize this is my first time through this content simone actually does a much better job of getting through this section i should have done better at this transition but the object the the the object here is to be able to get all the the elements of a given order right so if we have a uh you know these access patterns against the order get the order for by order id we can query the table with pk equals order id it brings back all of that data right the the customer object who ordered it the invoice that was sent the products that were shipped or the products that were ordered the shipments that were made uh you know it's gonna have the uh you know that's basically the uh the pattern that we're making here against order against all those particular items in the order partition uh so these are the queries that we would execute to get by order id obviously query by order of id to get the invoices it's again begins with queries that are going to give us subsets of those items within those particular objects and some of our patterns are not going to be interested in the entire collection of objects right some of our patterns are only going to be interested in subsets of those objects and you know again depending on how much data lives in these objects we could start to exceed the the single rcu that you know or you know some of these objects may be big objects that live inside of here we don't necessarily want to see all of that data so it's better for us to put to be more selective when we pull the data in and out of these partitions because some of these patterns don't necessarily need it and that was one of the things the biggest lessons learned as we went through this type of process of modeling all the data into a single table at amazon was that it's not so you know that you know every single pattern doesn't need all of the data and by doing this it's driving a lot of inefficiency in the system uh and so we wanted to get the data spread out but then when we spread the data out into multiple tables then the queries across those multiple tables to try and traverse the relational hierarchies become extremely expensive so you know again replacing these uh these joins with partitions right partitions that live either on the table you know or on the index right so uh you know so this is a quick look at the patterns that we have on the table right if we talk about when you actually structure a table you're going to uh go ahead and build some set of patterns that live on the table uh and then there's going to be some set of patterns that are going to require us to build indexes for uh but this is really kind of the the the first set of patterns that we're going to build are the ones that actually live on the table and so what we're doing is creating our our core partitions in this case we have customer partitions product partitions warehouse partitions and order partitions okay and those partitions create you know contain multiple items the customer partitions and the warehouse partitions they only contain one item the item that describes the customer the item that describes the warehouse you know the order partition contains many items right the products that were ordered the customer that ordered it the you know the invoices that were sent the shipments that were made the product partitions contain multiple items they contain the item that describes the product and we're going to use that when the customer builds the shopping cart they contain you know inventory items where does it live across my network of warehouses you know i don't necessarily want to see all of that data every time i look up a product so again always reasons why we want to push this data out into you know individual items individual objects certainly we want to start to look at where does it make sense to push it together right we don't necessarily always want to get too religious about you know maintaining all these individual objects sometimes if i'm always querying the stuff like we said when we went payments payments don't necessarily be their own items because there's no access pattern for get me payments by payment id uh there is really no payment id there's only payment details and so that's where we start to you know here's the invoice the invoice was sent the payment is the details in which how it was paid and that all happens you know when i check out so there's no real need to you know have payments be their own you know uh process now i may in the future require to you know hey get me all the invoices that had a gift card and so in that particular case there might be the need in this case in that case there might be the need to go ahead and remediate some set some chunk of this data to be able to introduce the new you know payment items that's certainly a process that occurs every now and then there's it's not so different than the altered table process in a relational database when you add a new column that doesn't have a default value right maybe i need to execute some sort of stored procedure you know add the column execute a stored procedure generate the the the value that's supposed to exist in these tables and you know call it done we're going to declare some maintenance window we're going to do this process this is no different than what you're going to see in a relational database so but in this particular workflow we've decided the payments don't necessarily need to live in their own items because they're never looked up individually right so as we go forward in the model now we're going to actually start looking at how are we going to get the data that is a uh is indexed right and so that's what we're going to go back and actually show you here some of the gsi work that we're going to end up doing to let me go ahead and get to the right data model here before i go back that was so this is it i believe sorry if i get caught up to where i am in the presentation with my data model before i flip back and show you what's going on here so let's go ahead and go back we can take a look now at our order partition if we look at the design of the table now it's much more complex right we've added uh you know many many attributes the ones that we've just recently added are these gsi attributes right because in our access pattern and let's go take a quick look back at that the access patterns we're looking at are saying things like okay get all the orders for a given product id even during a given date range i i don't have a partition on the table it's going to give me that i have things like select customer by id select product by id but i don't have you know get me all the products that the customer ordered right and get all the products in a given order and that's how we've kind of set up the table so in order to be able to produce a different or alternate you know grouping of these objects to be that support my access pattern i'm going to have to create an index and so what we're going to do there and let's go back to our data model is we're going to go ahead and add these attributes this gsi1 pk gsi1 sk these attributes are going to be our index attributes uh all these other attributes exist across our entity objects they define our entity objects and uh you know when we look at the actual table we also have the table we have this new gsi the gsi is going to be you know partitioned on those gsi pk gsisk right so let's go ahead and look at our model and see what ends up happening now when i look at the objects now so think about what our access pattern was our access pattern was get me all the products that were ordered in a given date range right so you know the last 30 days could be you know whatever during this week so if i decorate the product items that exist in all of the orders with these attributes gsi1pk gsi1sk and i add the product id right so that that becomes the partition the new partition key of my gsi and i add the date of the order right to these individual product items then when i look at gsi1 and i query gsi1 by product id guess what i get a date range list of product orders of orders for that product and i can go ahead and then go back and retrieve those orders if i need to if i need to project some order information onto these items so that i don't want to go round trip back i can use index projections to maybe project some order details on to all of the items on the order because i know when i go to query this you know access pattern that by product id i know that i want to get some top level information about the order maybe you know who the customer was for example so i might decorate these things with a customer id or customer information customer contact information maybe it's my marketing team saying who ordered the product in the last 30 days i'm gonna go send them you know a follow-up you know discount email so you know to go get more right here's 10 off if you order in the next seven days or something like that right so again you know depending on the patterns that you have in your application you know we can structure the data to be able to access in multiple ways and one way is get me all the products for a given order and that's your typical hey this order needs to be processed okay let me go get all the products for this order let me go figure out where they all live i'm going to go query the you know system for you know warehouse items to look for where does the where do these things live what warehouses okay where does it make sense to fulfill this order from let me go sit let me go ahead and send shipments okay and all and so on and so forth so the as you can see we can build an extremely complex application here with many many access patterns so again get all the orders to a given date range query gsi1 by product id you know where the sort key is between you know date one and day two uh we can now look at okay let's go to our next set of access patterns get invoices for a given invoice id get all payments for a given invoice id right so since payments are part of the actual invoice item these two access patterns are generally speaking they're the kind of the same we decided that payments don't necessarily need to live you know as their own items we could certainly break those out into their own items in which case instead of just having a single item for an invoice inside of a partition right here we have invoice five four four you know i could also say okay let me add a payment item which could be you know hash p1 p2 p3 and then have many items on the end of this uh that would describe payments index those individually but since you know for our particular access pattern you know what we're actually looking for is just get me the invoices by id really for both these access patterns let's get invoice by id because the invoices contain the payments so let's go ahead and go back to our workbench see what that data is going to look like i believe that is number 11 hopefully yes so as you can see we added this detail attribute onto the invoice so when a customer basically goes to check out they create an invoice when they pay they add this detail here's the payments some of this payment was a gift card some of it was mastercard you know whatever you can put those payment details in there you know if you want there could be maybe a background process that runs on these invoice items whenever i make a payment you can have some top level summary items on the table hey how much gift card you know redemption was there in this time period you could you know pick these objects up from streams and lambda and you could create new summary items on the table that say okay you know these items were paid distribution of x right by for top level products right so you could then create a whole new set of access patterns off data doesn't even exist yet right using a change data capture process that allows you to add new items into these partitions and query them in interesting ways right um so let's see let's go so does it get so we go back to gsi one now we can see if we query gsi1 by product id we get a list of product uh orders by product by date range for a given product if we query by invoice id what do we get we get the actual invoice itself and all the payments that were made against that invoice right so that's uh that satisfies those two access patterns that we just looked at in our excel here as we go through we query gsi1 by invoice id we get the information basically again the same query for both this is really one access pattern it's kind of split against two but you know in one case we're going to get an invoice maybe there hasn't been any payments and i need to add the payments so let me get the invoice and the other is going to say you know there's an accounting process that's more interested in seeing the payment data all right so next is to go get the shipment detail for a given shipment go get shipments for a given warehouse you know what was the information that was put in on these shipments let me go ahead and import one more data model go back to workbench and so now we can start to look at these shipments okay shipments as shipments are created shipments have products okay so we can now start to look at the relationship between uh you know when we go into an online into the order right i can get all the products that were ordered and all the products that were shipped uh so this is two different two different workflows that happen shipments can occur from multiple warehouses to fulfill the order so you know as we start to you know look at order status we can you know go get all the items from the order that start with you know ship uh or sh that'll give me all the shipments and all the shipment items i can iterate through that result set and determine you know where things came from uh you know status of individual shipments uh you know update individual shipment status again another example of a workflow where you know maybe i don't necessarily want to go ahead and you know read the whole order to be able to update shipment status maybe in this case what i can do is i can query uh you know the gsi1 i can query gsi1 by shipment id and i can get all of the individual shipment items for a given shipment right and get the description of a given shipment the top level item i can update its status i have individual items i can update any status it needs to be updated on those individual items um yeah because i have the top level order items that are showing me the uh you know what what was actually shipped and that can be updated individually now what we've done now is we also added a second gsi right because we have another access pattern that says you know give me all the shipments for a given warehouse and if you look at the table what ended up happening here is that shipments kind of ran out of gsi space right we were indexing everything just fine gsi1 gsi1 eventually we had more access patterns than you know two types of then two right you know i had to access a shipment in more than two ways i accessed it by order id i accessed it by uh shipment id uh i also need to access it by warehouse id so this is what ends up happening on your tables as we start to add new entities we're going to start to potentially and new access patterns we might potentially need to introdu introduce new gsis and sometimes when we do this again this might be a situation where these attributes don't necessarily exist yet on these items but maybe they do and you know we might need to do some modification of the data uh you know i think if you've watched my office hours for the last couple weeks you're kind of getting the idea that it doesn't take very long to scan a dynamodb table i mean i i work with big big chunks of data granted it's millions of rows and sometimes we're talking about billions of items but you know even billions of items can be churned through pretty quick if you need to do some you know data modification to index things in in new ways right so and we can always add new uh indexes to support additional access patterns on existing objects just like this so we added that gsi2 now when we query gsi2 by warehouse id we're going to be see the shipments for a given warehouse now i i might say that maybe the use case for this might be get me all the shipments for a given warehouse and and i might say okay that might be interesting and but i would expect this data to have some kind of life cycle to it right it's not going to live on the table forever because that query is just going to get bigger and bigger and bigger over time right there's no way to you know to limit the result set from this particular query because i'm saying give me everything for this warehouse is really what i'm saying and if the shipments for that warehouse grow over time and they're never cleaned off at the table that could become problematic and so maybe this is like a daily process where i say hey get me all the shipments for this warehouse and then i'm going to go through them all and say okay clear clear clear clear and and wipe them off the table or something like that and leave the shipments that are still in process or something but or i want to see a ttl you know so like maybe after 30 days you know the data just goes off the table because otherwise that that queries can get bigger and bigger the alternative here is to put a date range condition inside of that gsi2 sk if we put a date range condition on those shipment items then when i query this i could give it say hey give me everything in the last 30 days you know give me everything in the last 24 hours and that's going to be maybe a more meaningful query right so that that'd be you know one potential improvement uh you know on the data model so you know again let's go back take a quick look at what those queries might look like for those access patterns uh and again it's query gsi1 you know pk equals shipment id oops pk equals shipment id uh sk equal shipment id that's going to give you the shipment detail uh give me the shipments for a given warehouse yeah i might want to be able to get like we said when we query gsi1 and i query by shipment id i might want to see more than just a shipment id shipment detail i might want to see the products that were shipped on it as well in which case uh that pattern might say look a little bit more like uh you know where pk equals shipment id instead of adding that sort key condition right because if i add the store key condition i'm only going to get a single item now that could be your access pattern might say hey i want to get shipments by shipment id and i don't want to see the products i might have one pattern that says that i might have another pattern that says you know i need to see the whole shipment and it could very well be that the product items carry more metadata than you want to return when i'm accessing the shipment details so you might have two different access patterns where hey get me the details starts with p you know if you look back at our data model here let's go flip back to that right if i look at you know select gsi by shipment idea starts with p hash gives me all the products starts with sh hash gives me the uh just the ship and item so again in this particular example these items are small you're not seeing a lot of metadata it probably doesn't matter but you know in in the real world these items get big right so you know have all kinds of information about you know uh that you don't necessarily want to return right so let's be let's make sure that when we look at the data and we build our data models that we're really only working with the data and that's probably the biggest lesson here that you're really only working with the data that you want to use you know at any given time all right let's go back to our next set of patterns here which is get inventory for all products for a given warehouse uh get all invoices for a given customer id get all products products ordered by given customer id during a given date range all right so we're getting pretty much the end of the model here this is going to let's make sure i got everything here i don't think so all right let's go back here all right so most of this is going to be about decorating items on the customer's order right because we have uh you know of course get all the products we're giving warehouse that's that's that's not going to be you know decorating the order that's decorating the the warehouse items right all the inventory items but uh invoice for customer id during a date range giving products for a customer id during date range again both kind of indexing the items within you know the order partitions and let's go ahead and take a look at what that looks like so inside the order partitions we have um invoices invoices were already being indexed on gsi gsi1 on the invoice id so we need to add that gsi2 gsi2 is going to be on customer id again by date range if we go look at gsi2 what we can see is if we query by customer id we're going to get date range list of invoices right for a given customer when you look at the invoice status payment status details all that stuff for a given customer if i want to look at uh [Music] products ordered by a customer for a given date range again i'm looking at those at those order items and i'm just i'm already i'm already indexing them by product id on gsi one because you know they exist on the table under the order so that's probably my more common access pattern right uh and so what i'm going to do is then add that date range condition is gsi2 sort key and you know use the customer id as the partition key and when i query gsi2 by customer id again we'll see not just the orders that the invoices right we'll see the payments now if i have the need to add uh filter conditions right yeah that's where we're starting to go here is because now there's two ways to go about this i could be selective and i could prefix all these dates with the type and that way i could go into the customer's partition and i could say give me customers data for invoices in the date range where i could say between you know i hash date one and i hash date two right that'll give me all the invoices for a customer if i prefix these keys here with the actual entity type right but that means that i can't get like a life cycle of the customer view if i come into the table because it's faceted right i've created you know date sorted lists of items within the customer's partition that could be useful for some access patterns but you know if i have an access panel it says give me all the activity for a customer over the last you know six months 12 years or you know 12 months whatever then i don't necessarily want to create those facets and the reality is you know customers don't make a lot of orders right so they're not going to have a lot of products that they've ordered i mean even a busy busy customer sure you're going to have some corner cases that have large large numbers of orders and we can talk about how to deal with those but for your average online shop the people interacting with your website they're not going to be placing orders with thousands of items thousands of times a day even if you're getting thousands of orders a day so you know you're you're actually okay with filter conditions where i say query where customer id equals x uh within the last six months filter on uh pk starts with you know uh or sk starts with i right i hash that gives me all the invoices the customers received in the last 12 months right so there's ways to be able to selectively parse through the data now the read is going to be a little fat because you're going to go read a bunch of items that you didn't necessarily return but that's okay because the data potentially isn't being read that that that frequently or the collection of items in the partition are relatively small so the filter doesn't hurt you too bad but you know again lots of different ways to be able to carve through the data and see what that looks like so in this particular example if we go back and look at those queries what we can see is that we have a partition again query free products of all given warehouse id oh i don't think i did show you the products in a warehouse let's go back and look at that so if i look at products in a warehouse i'm inventorying on the actual product warehouse items right and since warehouses are already indexed on gsi2 it doesn't really matter that they're not that the warehouse items aren't on gsi one looks like this warehouse item is actually missing uh some attributes we can go ahead and update that real quick let's do that warehouse item warehouse item there you are what product is that that is yeah that's probably the same product and so we can save that and so now if i look at my aggregate view i'm going to see that all my warehouse items are indexed on gsi2 and that's what we're looking for this is actually the wrong warehouse id because he's in so that would give me two warehouse items in the same see i i knew i'd find a way to get you so you guys could watch me stumble around editing data there you go that's okay and so now let's go back look at those product partitions we can see that you know this particular product has two items that exist across two different warehouses uh and have two and two different same inventory in both but again against two different warehouses if i query on gsi two by warehouse id that's what i'll get i'll get the warehouse items the shipments that came out of those warehouses right all the shipments by warehouse was one of our other use cases and i'll get all the products that are in inventory in a given warehouse so if i want to get a given warehouse inventory i can query gsi2 by warehouse id it gives me all the product inventory for that given warehouse and you know so on and so forth right so we have multiple access patterns that are executing against all of our gsis at this point this is one of the common patterns that we like to implement in uh dynamodb and all nosql databases what we call index overloading you know again the only reason that you know if you look back at the access patterns the ones we designed for the table that's about half of them the other half were index patterns but we only have two indexes and that's because we're actually indexing those items in multiple ways depending on the item type and we're reusing the indexes this is called index overloading this is one of the more valuable aspects of single table design and so it's one of the reasons why cost drives so low in single table design when you start to again let's go back to talk about the access patterns of the company that i talked about when we started all of this discussion uh you know they were creating 17 tables within multiple indexes on each table it was exploding out to 60 or 70 objects they needed to create in dynamodb just for a developer sandbox right so just the mechanics of it was killing them uh you know when they try to develop at scale i mean the team had like 125 developers when you look at you know having to do that and and actually look at the inefficiency of the access patterns across those tables it just becomes tremendous and and so these design models these data models start to become you know really core to the cost profile of your application and if you want to scale your systems and do it efficiently then we need to start to think about our data this way and honestly it's not a a so difficult really as we kind of walk through we create that entity model and you know and we just kind of go from there in the in in the beginning so to speak right there is the entity relationship diagram and and that's what we're actually going to build everything off of and there's no you know uh there's no difference in in modeling where the difference comes in it's like how much data am i going to denormalize how much you know of the data am i going to decorate the child objects with from the parent so that you know when i make the queries for the indexed objects i actually get the i get the results that i need without having to go back to the table and make additional queries and that's okay you know that's that's exactly you know uh what the process is about and and honestly as you start to get through you know putting these data models together and creating these aggregates on these different indexes it'll become very apparent to you which ones you want to optimize by denormalizing which ones you want to say you know what that's okay because it's an infrequent pattern and i don't want to pay the cost of denormalizing and you know it's going to be trade-offs so you know hopefully this has been a good exercise i'm sorry i got a little lost in the middle there i think we recovered the um you know this is again a really representative of the exercise that i go through with customers and it's one of the reasons why i kind of wanted to dive into this because it kind of shows the evolution uh in a much better way than you know trying to walk through all of this and show you lambda and show you aggregations and top level reporting all at the same time right this is more of a focused you know exploration of how do we actually model the data in an osql database and and really you know represents the kind of design that we've implemented thousands of times you know here at amazon and you know across the customer base and so hopefully it'll help you guys and one of the things we're going to be doing is next week we're going to be showing you a an implementation based on this model with some real data and it's going to be using aws amplify with direct lambda resolvers right we've got a lot of customers that talk to us and say hey you know what i want to do a single table design but doesn't that you know make my amplify blow up uh and uh nader davit is going to be you know coming on and helping me with this and he's working on that demo right now furiously and you know that'll be a great session so you know please come back dial in for that we're going to be again using this model we're going to show you those 16 well not all 16 but as many as we can actually get implemented and we're going to show you that in aws amplify demo you know next friday same time 12 30 uh central daylight time and uh thanks a lot everybody if there's any questions feel free to fire away i haven't been checking the uh uh sidebar here but hopefully the moderators have been busy if there's any questions you guys want me to answer fire away all right so i'm not seeing anything from the moderators so don't can't see the chat channel so hopefully you guys got everything out of this you guys needed thank you so much again for your time next week dial in we'll show you that to amplify with direct lambda resolvers and that's it from today from dynamodb office hours and thanks again for our moderators thank you salmone for providing this great content and again we'll see you all next week thanks a lot talk to you later bye [Music] you
Info
Channel: Serverless Land
Views: 19,607
Rating: 4.9630485 out of 5
Keywords: serverless, dynamodb, aws, nosql, database, single table
Id: KYy8X8t4MB8
Channel Id: undefined
Length: 58min 29sec (3509 seconds)
Published: Tue Sep 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.