AWS re:Invent 2020: Data modeling with Amazon DynamoDB – Part 1

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Welcome everybody. Thank you for taking the time out of your day to come see this session on Data Modelling with Amazon DynamoDB. My name's Rick Houlihan. I'm a senior practise manager for AWS, for the new SQL services. I'm here to introduce to you today, a very special guest. It's Alex DeBrie. Alex is a senior engineer at Stedi, but also probably more well-known in the community for being the author of The DynamoDB Book. I can't thank Alex enough for taking care of this piece of work for me. The people have been asking for it for years. Alex took it on himself to put that thing together. I recommend this to everybody. It is a collection of the best practises, design patterns that I talk about and have his spouse for years, and Alex did a fantastic job putting these things together. Here he is today, we're going to talk. He has a two-part session; this is part one of two. Welcome Alex and thank you. Thank you, Rick. Welcome everyone to Data Modelling with DynamoDB. I'm Alex DeBrie, I'll be your guide today, and we're going to talk about how to design a nice, clean, efficient data model with DynamoDB. This is a two-part talk, so this is part one. Make sure you come back for Part 2. We're going to talk about three things in this first talk. First, we're going to start off with Amazon DynamoDB basics. We're going to talk about vocabulary, terminology, some key concepts just to set the foundation for you, so you know what we're talking about. Then, we're going to move into what I call SQL versus NoSQL. We're going to see why DynamoDB and these NoSQL databases were designed? What they're trying to accomplish? The design decisions they had to make to accomplish those goals, and then what are the implications for our data modelling? Then, just to bring it back down to earth we're going to talk about one-to-many relationships and see what that looks like in a database like DynamoDB. Again, make sure you come back for Part 2. We're going to look at additional Data Modelling Strategies in DynamoDB. Who am I? I'm Alex DeBrie. I'm the author of The DynamoDB Book, which is just a comprehensive guide to Data Modelling with DynamoDB. It has a lot of different strategies and concepts as well as some full walkthrough examples. Make sure you check that out if you're interested in DynamoDB. I'm also the creator of DynamoDBGuide.com which is just a free resource, if you're looking to get familiar with the DynamoDB API, go check that out. I'm in AWS Data Hero due to my work with DynamoDB, and then during my day job I'm an engineer at Stedi. So, if you want to work with me, come check out Stedi. So, let's go start here and talk about DynamoDB basics, and I just want to start off with some vocabulary to get us on the same page. There are four key concepts I want to get started with. That's table, item, primary key, and attributes. I want to look at this in the context of an example. So, imagine you have a user service in your application where you're storing users that are signed up. It might look something like this, so here are four different records in our users service, I've got myself Alex DeBrie as well as a few folks from Amazon AWS like Jeff Bezos, Jeff Barr, Warner Vogels. If you look at all this data together, that's going to be called a table in DynamoDB. So, that would be similar in some ways to a table in a relational database, but also different as well. Now if you want to look at an individual record like I have here, that Alex would be record, that's going to be called an item. So, an item is going to be similar to a row in a relational database, or a document in something like MongoDB. Now, when you create your table you need to specify what's called a primary key, and every item you put in that table needs to include your primary key. Each item in that table needs to be uniquely identified by that primary key. So, you can see on this table here, we have that primary key there, over on the left side, which is usernames. So, a username is going to uniquely identify each user in our table, because you don't want to have users with the same username. In addition to that primary key, you can also have attributes, which we have over on the right. These are additional bits of data you can include in your items; they are going to be similar to columns in a relational database, but the big difference here with DynamoDB is, you don't need to declare those columns or those attributes ahead of time like you would in relational database. So, people call DynamoDB schema list, that's true in the sense that DynamoDB itself is not going to force your schema. The database wasn't going to do it like it would in a relational world. So, what you need to do is make sure you're enforcing that scheme and having a schema in your application code. So, you know what you're writing to in a reading from DynamoDB. Last thing I want to point out with attributes, if you look at these, each attribute is going to have a type. That can include simple types like strings, like I have here for first name and last name. You can have strings and numbers, but you can also have complex types. So, if you look at this interest attribute, I have an array of items there, and I can have multiple items in there. You can also have a map, if you want to have complex objects, or you can have sets if you want to have a collection of unique items. So, you can have these complex attributes, and DynamoDB support some really well. We talked a little bit about primary key, I want to go back to that a little bit more, just because it's very critical to how you model your data in DynamoDB. We'll see that a little later on here, but there are two types of primary keys when you're working with DynamoDB. The first type is what's called a simple primary key and that has just a partition key; and the second kind is a composite primary key which is made up of two elements a partition key and a sort key. So, let's look at examples of those. You know that example we already looked at, that users table. That was an example of a simple primary key, just had one element that username that made up that primary key and uniquely identified each item. We can also look at an example, but composite primary key like we have here. So, this is a table that you include actors and actresses in the movie roles that they've been in, and you can see we have a composite primary key that has two elements. First, you have the partition key which is actor, and then second you have this sort key which is movie. One thing I want to call out here you might notice that there are few items here that have the same partition key. We have two items that have Tom Hanks as the actor, and I mentioned that primary key needs to uniquely identify each item. When you're using that composite primary key, it's the combination of those two elements; the partition key and a sort key that uniquely identify the items. So, you can have multiple items with the same partition key, and you actually will very often. Likewise, you see a couple items that have the same sort key, a couple items with Toy Story there, but since they have different partition keys it's going to still uniquely identify each item. Most of the time, if you're doing a more complex application, you're going to be using that composite primary key that gives you more complex access patterns as compared to that simple primary key, which is mostly just a key value store. So, now I want to move on to what I call SQL versus NoSQL, because I know a lot of you have a background in relational SQL databases. I want to think about what problems NoSQL is trying to solve, and because of that what decisions did I have to make? What are the implications for our data modelling? So, let's get started with what problems NoSQL is trying to solve. I want to show you that with a chart here. On that X axis, I'm just going to show data size. You can see as you go further out on that X axis, the data size is getting bigger and bigger from 1 gig all the way up to a TB and beyond. Then, on that Y axis you can see performance. It starts with blazing fast and then goes to regular fast and starts to get sluggish, and then painful. If you're working with a traditional relational database, something like MySQL, you might see a curve sort of like this, where it starts off blazing fast and sort of slowly gets slower over time until gets up into that painful range. When you launch your application for the first time, or using it in test, you're loving it. You think it's great because all the data fits in the memory, everything returns really quickly, but sometime six months, a year, two years down the road, you're going to get to a lot of data in there, now you're going to get sluggish. You need to investigate what's going on. "Hey, why is this getting slower? Do I need to add more indexes? Do I need to do some denormalization? Do I need to make some other changes to my data model?" At some point it might get so painful, so slow that you need to re-architect. Because you use some features that worked at 1 gig and 10 gigs but just don't work at one terabyte and 10 terabytes you need to refactor how you do that. So, that's the typical performance curve that you might see with a traditional relational database. DynamoDB on the other hand is going to have a performance curve like this. it's going to be very flat. It's going to be the same exact performance in your test environment or on the first day you launch, as it is 10 years down the road when you have 10 terabytes of data and still churning through. It's going to give you the exact same consistent performance. That's really what DynamoDB is aiming to do there. With that in mind, I think it's worth thinking about what I call DynamoDB's guiding principle. This isn't in the DynamoDB docs anywhere, or in their marketing materials. But I think it's the unstated assumption of what guides product decisions around DynamoDB in terms of what features they add in more poorly. What features they don't have? That principle is "don't allow operations that won't scale." We do not want to let you do something that works at 1 gig and 10 gigs of data. That's not going to work down the road at 10 terabytes of data. Given that, let's talk about SQL to NoSQL. How do they sort of make sure that you won't allow operations that don't scale? First thing is the primary key is very important. That's going to do how you do almost all your query patterns, and we looked earlier on about how we have these actors and movie roles in this table. Almost all your access patterns are going to be based off that primary key. So, it's very easy for me to go to this table and look up Natalie Portman Black Swan. It's very easy to say: "hey, give me all the Tom Hanks' movies." But it's very difficult to query off the attributes. We don't worry of those attributes, so I couldn't say "hey, give me all the movies in the year 2000" or "give me all the German movies" right? That's not going to work. So, very important to use those primary keys correctly. Second thing, you need to know about DynamoDB is, there are no joins in DynamoDB, and this is something that feels weird if you're coming from relational database, you used to join your data in. How do I do this? Really a lot of people when they're working with Dynamo, they say why the primary key steps? Why the joins? Why is that going to work? To understand that, I want to give you a little background and how Dynamo works under the hood. Because I used to take Dynamo, this super-fast computer up in the cloud, and somehow, they made just a faster computer. That's not what's going on. It's just basic computer science under the hood. So, what we think of as DynamoDB, that DynamoDB front-end, it's actually going to be splitting your data into these different partitions behind the scenes. They have these different storage partitions about every 10 gigs of data you have in your table. They're going to split it off into another partition. So, imagine here maybe we have 15, 20 gigs of data in our table; we've got these two different partitions. Now, when this request comes into DynamoDB, let's say I want to add a new item. So, I make that PutItem call with DynamoDB, I'm going to insert "Tom Hanks" in "Big" into my DynamoDB table. That DynamoDB front-end, which is called the request router, what it's going to do? Right away, it's going to hash that actor value which is your partition key. It is going to hash that partition key value and figure out "which partition does this belong to?" It's going to say: "this item belongs to partition one," and it's going to write it to partition one, and then return to you. Then, what you shouldn't do, you know your data grows to 30 gigs, you add another partition or maybe you need to add 6 more partitions, or maybe you need to add 1000 more partitions. The important part is that initial step right there, that's a constant time operation. It's always just going to look up in a HashMap very quick, very efficient. So, even if your table gets to 10 terabytes, the very first thing that's happening is narrowing it down to this 10-gig chunk and figuring out which partition it needs to go to. So, that's really great and again you don't need to worry about these partitions. DynamoDB is doing it for you. I think this helps understand help you build that mental model of what's happening with DynamoDB. So, now we know about those partitions, we know that's why we can't do join. We know about these primary keys. What are the implications for our data modelling? They've made these design decisions; how do we use DynamoDB in our application. So, first implication you know is, you need to know your access patterns up-front. This is going to be different than a relational database, where you sort of model your data in this abstract way. You put each entity in a separate table and your model relationships between them. Then, you think about your access patterns and say: "what queries do I need?" Which indexes do I need to add? That's not the case with DynamoDB. With Dynamo, you're going to design your table for your access patterns to make these efficient lookups. Sets one, big one, you need to know your access patterns up-front. The second thing is, you need to use secondary indexes, and this helps you get around the importance of those primary keys. So, we saw just a bit ago about how you need to query with that primary keys. It's very easy to look up actors by or look up movies by their actor name. But what if I do want to query off one of those attributes? What if I want to have a different access patterns that says, "hey give me all of Tom Hanks' movies after the year 2000?" How do I do accomplish that sort of access pattern? What you can do is, you can use something called secondary indexes. This basically allows you to declare an additional primary key on your table and DynamoDB is going to handle duplicating that data into that secondary index with that new primary key to enable those additional access patterns. So, these items I have here, I could have a secondary index with the partition key of actor and the sort key of year, it's going to rearrange it like this and now you can see it got that partition key factor, sort key of the year, very efficient for me to go and look up Tom Hanks' movies by a particular year. So, secondary index is very important. Two things you want to note there. Number one, the data from your base table is going to get copied into that secondary index with that new primary key, and DynamoDB is going to handle all that replication for you. So, you don't have to manage these different items. You just write it one time and it's going to handle replicating it out into your indexes. Second thing you need to know is, you can only use these read-based operations on your secondary indexes. You can't do any writes on that secondary index. All the writes need to go through your base table. So, you can read with secondary indexes no writes. So, that's the second implication. Use those secondary indexes that helps with the importance of primary keys. Third implication you need to know about, this is the most controversial one, and this is you're going to put all your entities into one table and use generic primary keys. This is maybe the single table design that you've heard about, you've heard Rick talk about, very interesting and I think this throws a lot of people. So, let's just walkthrough a simple example to see what that looks like and see what we're talking about here. The example we're going to do, this is going to be a SaaS application. Imagine you're building SaaS application; we're going to model out two types of entities. First, we're going to have organizations or the people that sign up for your SaaS application, pay your bill, and then within that organization there are users that act on behalf of the organisation to actually use your SaaS application. Let's see how we've modelled these two different types of entities in a single DynamoDB table. We'll start with a couple of organizations, and here we have two items in our DynamoDB Table, two organizations; we have Berkshire Hathaway, and we have Facebook. A couple of things I want to call out here, if you look at the primary key, notice that the primary key names are very generic. It's just "PK" for partition key, and "SK" for short key. I'm not using something like organization name or username and things like that, because if I have multiple different item types in my table, they're not all going to share the same attribute name. So, my organization is not going to have a username, my users are not going to have an organization name, so you need to have these generic names here. Another thing you need to look at, just look at the pattern and values for these primary key values. They're kind of weird. It's in all caps, it's also got this pattern, where I start with ORG# and then the organization name and I'm doing that for both the partition key and the sort key. What you're going to do is, you're going to create these little templates and say "hey, this is what an organization items pattern looks like." It's going to help you arrange those items. It is going to help you tell one item type from another in your DynamoDB table. Now, we know the basics. We've got our two organization items in there. Let's add a few user items, so here we go. We have now five items in our table. We still have our two organization items, but we've also added three user items in our table outlined in red. So, for Berkshire we've got Charlie Munger and Warren Buffett, and then for Facebook we've got Sheryl Sandberg. A couple of things I want to point out here, just look at the primary key pattern, they're slightly different on that user item. So, the user short key is going to be user, # and then the username to help you indicate you know this is a user item. Another thing I want to notice is that the attributes are going to be different between these different types of entities. So, if we look at that organization entity, it's got attributes like organization name, subscription level, things that matter for that organization. Whereas the user is going to have attributes like username and role, things that matter for that user. They can be distinct and that's where that scheme less nature of DynamoDB really helps you, because you don't need to declare all these attributes and sort of enforce a scheme on all these different types of items. So, this is the basics of single table design is, where people get thrown a little bit. So, I want to bring it back down to earth a little bit and talk about one-to-many relationships. Because I'm guessing most of you have modelled out one-to-many relationships in applications before. I want to walk through how you might do that in DynamoDB, what that looks like, just to give you a feel for how it's different than relational database. So, let's talk about one-to-many relationships. First, what is a one-to-many relationship? Well, this is anytime you have some sort of parent entity with a lot of related sub-entities, or you might have an owner entity that has all these related entities. So, examples of this might be in office with employees, in office might have multiple employees that work in that office. In an e-commerce store you might have a customer with multiple orders. One customer makes many orders at least you hope people now E-Commerce store and then in your SaaS application you have the organization and many users in that organization, like we just talked about in our example. So, with one-to-many relationship there's a key problem you need to solve, and that's usually, how do I get information about my parent item, when I'm fetching my related items? If I'm fetching in order, how do I also get information about that customer same time? If I'm fetching a user, how do I get information about its organization as well? If you're using a relational database, you do that by normalizing your data and using joins at query time. But we already talked about how Dynamo doesn't have joins. So, in Dynamo you need to do that a little bit differently, a couple different strategies here. One is to denormalize your data which could sound like a dirty word for those of you coming from our relational database. Second strategy is to pre-join your data. So, we're going to look through three different what I call strategies, one-to-many relationship strategies. These are just different approaches to modelling out these one-to-many relationships, depending on your needs. So, let's get started with the first strategy. First strategy is a denormalization strategy, denormalization, plus a complex attribute. So, I want to show this by way of an example, when we do this let's go back to that SaaS application that we've already used, where we have organizations, and we have users. We're going to add one more entity type in here. We're going to say: "when an organization signs up for that SaaS application, they have to pay for it, choose a subscription." It will allow them to register multiple payment methods, and this allows you know at the end of the month if one of those payment methods fail, you can go to that back up one, and they don't lose access to your service. So, one-to-many payment methods, on this application. So, how would we model that in DynamoDB? What we're going to do here, I've just added on these organization items, this payment method attribute. We'll get that outlined in red there, notice that payment method, how it is a complex attribute. It's this map, and it's got a default payment method and a backup payment method, and each method has some rich information like that type, and number, and all sorts of stuff. So, it's this complex object which is something you would never do in a relational database, because one of those is normalization, one of the principles of normalization is to break each column value down to an atomic value. So, you wouldn't have a complex attribute like you do here, but we're going to do that here in DynamoDB. This is a good strategy when two things are true. Number one, you don't have an access pattern on that related item directly. So, in this case, you know that related item is the payment method, we're never going to say "hey, I have this credit card number. Can you go find which organization it belongs to?" You are never going to have that access pattern. the only access pattern we're going to have around payment methods is, "hey, it's the end of the month. We need to charge this organization, go look up the organization, find all the payment methods related to it and work through until one of them succeeds." So, we don't have an access pattern on that related item directly. You also want to make sure; you have a limited number of related items. This is because DynamoDB has an item size limit of 400 KB. You don't want to exceed that by putting too much information on that parent item. Just to go back to one of those one-to-many relationships we showed earlier, imagine you had an e-commerce store where a customer makes orders, you could sort of denormalize this with the complex attribute, and put every order on that customer item itself. But then, at some point that customer order would get so big that you couldn't add more information on it. If they want to make their 30th or 40th order, you'd have to say sorry, we can't take your order because we modelled their data incorrectly, which is not something you want to do. So, you know, in this case it's very reasonable to limit the number of related items. You're not going to allow them to have 10,000 different payment methods, you can say: "you can have two, or three, or five, but not 10,000." So, that works in this situation. So, that's our first strategy, let's go into the second strategy. This is another denormalization strategy. This is denormalization plus duplication. So, let's look at what it looks like here. Here, we have a one-to-many relationship with authors and books. An author might write multiple books, so this could be for a library, for a bookstore, or something like that. If you recall going back to the beginning of this section, we talked about how the key problem with one-to-many relationships is, how do I get information about that parent item, when I'm fetching this related item? So, if I look up a book by Stephen King, or a book by JK Rowling, how do I get information about that parent at the same time? What we've done here is, we've just duplicated some of that data onto the item itself. So, you can see outlined in red there, on each of these book items I've just copied the author book date, their birthday excuse me, onto those items. This sort of duplication again is something you would never do in a relational database. When you learn about normalization, don't repeat yourself. But I think it's helpful to think about why denormalization is helpful here, and why do you want to do that? The big problem is around data integrity. If you have the same people, piece of data that's represented across a lot of different rows in your relational database. If you ever need to update that piece of data, you need to run all over your database finding all those pieces of information and update them, so that you don't have data integrity issues. So, that's the big thing we're fighting about there. So, when you use this, you want to make sure you don't have that source. So, this is good in one of the two following cases; number one, if that duplicated data is immutable. In this case Stephen King's birthday is not going to change, it's the same today as it was yesterday, as it will be tomorrow, as it will be 2 years from now. So, you don't need to worry about that data changing its immutable, so you don't have to worry about data integrity issues as much. Even if your data is not immutable, this could work if the data doesn't change often or if it's not replicated much. It's only replicated across two or three items, maybe it's fine to go out and search out those items in and update those as needed. Where you're going to get into trouble is when this data can change fairly frequently, maybe daily, or hourly, something like that, isn't it's replicated across hundreds or thousands of items, because now, you're going to be hopping all over your database. Updating all those all those items, and maybe having data integrity issues. So, think about that when you're using denormalization and duplication. So, those first two strategies again, both denormalization strategies can work in certain situations, but sometimes they won't work. That's where we might fall back to this third strategy, which is a composite primary key and a query. So, we talked about composite primary key earlier, we're going to show how we model that data in our SaaS application to really work with this one-to-many relationship. So, going back to that SaaS application we have our five items, we have two organizations, and three users in there. One thing you'll notice is that three of those items have the same partition key. You can see ORG#BERKSHIRE, we see a couple different items in there. There's a query operation in DynamoDB that makes it very efficient to get all items with the same partition key. So, we could very quickly get all those items, and the reason that's so efficient, I want to go back to that chart we talked about earlier, with those different partitions and how the very first thing DynamoDB is doing, when you are sending a request, is checking that partition key and figuring out which partition it needs to gather to. Because of that, it's very efficient to read items out of a single partition. So, all items with a single partition key, that's going to be called an item collection, that's going to be very important as you're modelling with DynamoDB to sort of build the right items collections to handle your access patterns. One thing I want to call out here is, notice that have different entity types in this single item collections. We have different collection of entities. We have an organization item, and we also have our two user items in this single item collection. The reason we might do this is, imagine you had an access pattern and said: "hey, list all members for an organization." When you're doing that, you need to fetch all the members, but you also need to fetch that organization items, so you can enrich those members with some of that information. By putting them in the same item collection co-locating together like that, it's a single DynamoDB query of very efficient read, that's going to scale as you get up to terabytes and terabytes of data, it is going to work really well, and that's going to work nicely. So, with that composite primary key and query, what we're doing is, we're joining our data, but we're pre-joining it into an item collection. We're joining it at-write time rather than at-read time like you might in a relational database. So, those are the three one-to-many relationship strategies that I use a lot. You know denormalization when you can and then use that composite primary key and query. There are a few other ones in the book as well if you want to check that out. So, just to summarize what we talked about here, we went through three main categories. First of all, we started off with DynamoDB basics, and just talked about vocabulary, terminology, key concepts just to get us on the same foundation, so we know we're talking about with DynamoDB. Then we looked into SQL versus NoSQL. For those of you coming from a relational database, what is NoSQL trying to achieve? What are the design decisions that made to achieve those goals? Then what are the implications for you modelling that data? Finally, to bring it back down to earth, we talked about one-to-many relationships, what that looks like in DynamoDB? Just to give you something more concrete there. Again, this is a two-part talk; so, keep watching Data Modelling with DynamoDB Part 2, we're going to do more Data Modelling Strategies, we're going to talk about filtering and sorting and all sorts of fun stuff. So, make sure you check that out. Then, also later on, Rick Houlihan, who introduced me at the beginning, he does his Advanced Design Patterns with DynamoDB, it's a two-part session. These are the best session every year, most watched on YouTube, parts to get in to. So, make sure you watch these great stuff. Once again, thank you. I'm Alex DeBrie, thanks for watching this talk.

Info

Channel: AWS Events

Views: 12,573

Rating: 5 out of 5

Keywords: re:Invent 2020, Amazon, AWS re:Invent, DAT305-PT1, Databases, Amazon DynamoDB, Stedi, Inc.

Id: fiP2e-g-r4g

Channel Id: undefined

Length: 25min 35sec (1535 seconds)

Published: Fri Feb 05 2021