Welcome everybody. Thank you
for taking the time out of your day to come see this session on Data
Modelling with Amazon DynamoDB. My name's Rick Houlihan. I'm a senior practise manager
for AWS, for the new SQL services. I'm here to introduce to you today,
a very special guest. It's Alex DeBrie.
Alex is a senior engineer at Stedi, but also probably more well-known
in the community for being the author
of The DynamoDB Book. I can't thank Alex enough for taking
care of this piece of work for me. The people have been
asking for it for years. Alex took it on himself
to put that thing together. I recommend this to everybody. It is a collection
of the best practises, design patterns that I talk about
and have his spouse for years, and Alex did a fantastic job
putting these things together. Here he is today,
we're going to talk. He has a two-part session;
this is part one of two. Welcome Alex and thank you. Thank you, Rick. Welcome everyone to Data Modelling
with DynamoDB. I'm Alex DeBrie,
I'll be your guide today, and we're going to talk about
how to design a nice, clean, efficient
data model with DynamoDB. This is a two-part talk,
so this is part one. Make sure you come back for Part 2. We're going to talk about
three things in this first talk. First, we're going to start off
with Amazon DynamoDB basics. We're going to talk about vocabulary,
terminology, some key concepts just to set the foundation for you,
so you know what we're talking about. Then, we're going to move into
what I call SQL versus NoSQL. We're going to see
why DynamoDB and these NoSQL databases were designed?
What they're trying to accomplish? The design decisions they had to make
to accomplish those goals, and then what are the implications
for our data modelling? Then, just to bring it
back down to earth we're going to talk about
one-to-many relationships and see what that looks like
in a database like DynamoDB. Again, make sure
you come back for Part 2. We're going to look at additional Data Modelling Strategies
in DynamoDB. Who am I? I'm Alex DeBrie.
I'm the author of The DynamoDB Book, which is just a comprehensive guide
to Data Modelling with DynamoDB. It has a lot of different
strategies and concepts as well as some full
walkthrough examples. Make sure you check that out
if you're interested in DynamoDB. I'm also the creator
of DynamoDBGuide.com which is just a free resource, if you're looking to get familiar
with the DynamoDB API, go check that out.
I'm in AWS Data Hero due to my work
with DynamoDB, and then during my day job
I'm an engineer at Stedi. So, if you want to work with me,
come check out Stedi. So, let's go start here
and talk about DynamoDB basics, and I just want to start off
with some vocabulary to get us on the same page. There are four key concepts
I want to get started with. That's table, item, primary key,
and attributes. I want to look at this
in the context of an example. So, imagine you have
a user service in your application where you're storing users
that are signed up. It might look something like this, so here are four different records
in our users service, I've got myself Alex DeBrie
as well as a few folks from Amazon AWS like Jeff Bezos,
Jeff Barr, Warner Vogels. If you look at
all this data together, that's going to be called
a table in DynamoDB. So, that would be similar
in some ways to a table in a relational database,
but also different as well. Now if you want to look at an
individual record like I have here, that Alex would be record,
that's going to be called an item. So, an item is going to be similar
to a row in a relational database, or a document
in something like MongoDB. Now, when you create
your table you need to specify what's called a primary key, and every item you put in that table
needs to include your primary key. Each item in that table
needs to be uniquely identified by that primary key. So, you can see on this table here,
we have that primary key there, over on the left side,
which is usernames. So, a username is going to uniquely
identify each user in our table, because you don't want to have
users with the same username. In addition to that primary key,
you can also have attributes, which we have over on the right. These are additional bits of data
you can include in your items; they are going to be similar
to columns in a relational database, but the big difference
here with DynamoDB is, you don't need to declare
those columns or those attributes ahead of time like you would
in relational database. So, people call DynamoDB schema list,
that's true in the sense that DynamoDB itself
is not going to force your schema. The database wasn't going to do it
like it would in a relational world. So, what you need to do is make sure
you're enforcing that scheme and having a schema
in your application code. So, you know what you're writing to
in a reading from DynamoDB. Last thing I want to point out
with attributes, if you look at these, each attribute
is going to have a type. That can include
simple types like strings, like I have here
for first name and last name. You can have strings and numbers,
but you can also have complex types. So, if you look
at this interest attribute, I have an array of items there, and I can have
multiple items in there. You can also have a map,
if you want to have complex objects, or you can have sets if you want
to have a collection of unique items. So, you can have
these complex attributes, and DynamoDB support
some really well. We talked a little bit
about primary key, I want to go back to that
a little bit more, just because it's very critical to how you model
your data in DynamoDB. We'll see that a little later
on here, but there are two types
of primary keys when you're working
with DynamoDB. The first type is what's called
a simple primary key and that has just
a partition key; and the second kind is
a composite primary key which is made up of two elements
a partition key and a sort key. So, let's look
at examples of those. You know that example we already
looked at, that users table. That was an example
of a simple primary key, just had one element that username
that made up that primary key and uniquely identified each item.
We can also look at an example, but composite primary key
like we have here. So, this is a table
that you include actors and actresses in the movie roles
that they've been in, and you can see we have a composite
primary key that has two elements. First, you have the partition
key which is actor, and then second you have
this sort key which is movie. One thing I want to call out here
you might notice that there are few items here
that have the same partition key. We have two items that have
Tom Hanks as the actor, and I mentioned that primary key
needs to uniquely identify each item. When you're using
that composite primary key, it's the combination
of those two elements; the partition key and a sort key
that uniquely identify the items. So, you can have multiple items
with the same partition key, and you actually
will very often. Likewise, you see a couple items
that have the same sort key, a couple items
with Toy Story there, but since they have
different partition keys it's going to still
uniquely identify each item. Most of the time, if you're doing
a more complex application, you're going to be using
that composite primary key that gives you more
complex access patterns as compared to
that simple primary key, which is mostly
just a key value store. So, now I want to move on
to what I call SQL versus NoSQL, because I know
a lot of you have a background in relational SQL databases. I want to think about what problems
NoSQL is trying to solve, and because of that
what decisions did I have to make? What are the implications
for our data modelling? So, let's get started
with what problems NoSQL is trying to solve. I want to show you that
with a chart here. On that X axis, I'm just
going to show data size. You can see as you go
further out on that X axis, the data size is getting
bigger and bigger from 1 gig all the way
up to a TB and beyond. Then, on that Y axis
you can see performance. It starts with blazing fast
and then goes to regular fast and starts to get sluggish,
and then painful. If you're working with
a traditional relational database, something like MySQL, you might see
a curve sort of like this, where it starts off blazing fast and
sort of slowly gets slower over time until gets up into
that painful range. When you launch your application
for the first time, or using it in test,
you're loving it. You think it's great because
all the data fits in the memory, everything returns really quickly, but sometime six months,
a year, two years down the road, you're going to get to
a lot of data in there, now you're going to get sluggish. You need to investigate
what's going on. "Hey, why is this getting slower?
Do I need to add more indexes? Do I need to do
some denormalization? Do I need to make some other changes
to my data model?" At some point it might
get so painful, so slow that you need
to re-architect. Because you use some features
that worked at 1 gig and 10 gigs but just don't work
at one terabyte and 10 terabytes you need
to refactor how you do that. So, that's the typical
performance curve that you might see with
a traditional relational database. DynamoDB on the other hand is going to have
a performance curve like this. it's going to be very flat. It's going to be the same exact
performance in your test environment or on the first day you launch,
as it is 10 years down the road when you have 10 terabytes of data
and still churning through. It's going to give you the exact
same consistent performance. That's really what DynamoDB
is aiming to do there. With that in mind,
I think it's worth thinking about what I call DynamoDB's
guiding principle. This isn't in the DynamoDB
docs anywhere, or in their marketing materials. But I think it's
the unstated assumption of what guides product
decisions around DynamoDB in terms of what features
they add in more poorly. What features they don't have? That principle is "don't allow
operations that won't scale." We do not want to let you
do something that works at 1 gig
and 10 gigs of data. That's not going to work down
the road at 10 terabytes of data. Given that, let's talk
about SQL to NoSQL. How do they sort of make sure that you won't allow operations
that don't scale? First thing is the primary key
is very important. That's going to do how you do
almost all your query patterns, and we looked earlier on about
how we have these actors and movie roles in this table.
Almost all your access patterns are going to be based off
that primary key. So, it's very easy for me
to go to this table and look up
Natalie Portman Black Swan. It's very easy to say: "hey, give me all
the Tom Hanks' movies." But it's very difficult
to query off the attributes. We don't worry of those attributes, so I couldn't say "hey, give me
all the movies in the year 2000" or "give me all the German movies"
right? That's not going to work. So, very important to use
those primary keys correctly. Second thing, you need
to know about DynamoDB is, there are no joins in DynamoDB, and this is something
that feels weird if you're coming
from relational database, you used to join your data in.
How do I do this? Really a lot of people when
they're working with Dynamo, they say why the primary key steps? Why the joins?
Why is that going to work? To understand that, I want to
give you a little background and how Dynamo works under the hood.
Because I used to take Dynamo, this super-fast computer
up in the cloud, and somehow, they made
just a faster computer. That's not what's going on. It's just basic computer science
under the hood. So, what we think of as DynamoDB,
that DynamoDB front-end, it's actually going to be
splitting your data into these different partitions
behind the scenes. They have these different
storage partitions about every 10 gigs of data
you have in your table. They're going to split it off
into another partition. So, imagine here maybe we have 15,
20 gigs of data in our table; we've got these two
different partitions. Now, when this request
comes into DynamoDB, let's say I want to add a new item. So, I make that PutItem
call with DynamoDB, I'm going to insert "Tom Hanks"
in "Big" into my DynamoDB table. That DynamoDB front-end,
which is called the request router, what it's going to do? Right away, it's going
to hash that actor value which is your partition key. It is going to hash that
partition key value and figure out "which partition
does this belong to?" It's going to say:
"this item belongs to partition one," and it's going to write it
to partition one, and then return to you. Then, what you shouldn't do,
you know your data grows to 30 gigs, you add another partition or maybe
you need to add 6 more partitions, or maybe you need to add
1000 more partitions. The important part is
that initial step right there, that's a constant time operation. It's always just going to look up in
a HashMap very quick, very efficient. So, even if your table
gets to 10 terabytes, the very first thing that's happening is narrowing it down
to this 10-gig chunk and figuring out
which partition it needs to go to. So, that's really great and again you don't need to worry
about these partitions. DynamoDB is doing it for you. I think this helps understand
help you build that mental model of what's happening
with DynamoDB. So, now we know
about those partitions, we know that's why we can't do join.
We know about these primary keys. What are the implications
for our data modelling? They've made these design decisions; how do we use DynamoDB
in our application. So, first implication you know is, you need to know
your access patterns up-front. This is going to be different
than a relational database, where you sort of model your data
in this abstract way. You put each entity
in a separate table and your model relationships
between them. Then, you think about
your access patterns and say: "what queries do I need?"
Which indexes do I need to add? That's not the case with DynamoDB. With Dynamo, you're going
to design your table for your access patterns
to make these efficient lookups. Sets one, big one, you need to know
your access patterns up-front. The second thing is, you need
to use secondary indexes, and this helps you get around the
importance of those primary keys. So, we saw just a bit ago about how you need to query
with that primary keys. It's very easy to look up actors by or look up movies
by their actor name. But what if I do want to query off
one of those attributes? What if I want to have a different
access patterns that says, "hey give me all of Tom Hanks'
movies after the year 2000?" How do I do accomplish
that sort of access pattern? What you can do is, you can use
something called secondary indexes. This basically allows you
to declare an additional primary key
on your table and DynamoDB is going to handle
duplicating that data into that secondary index
with that new primary key to enable those additional
access patterns. So, these items I have here,
I could have a secondary index with the partition key of actor
and the sort key of year, it's going to rearrange it like this and now you can see it got
that partition key factor, sort key of the year,
very efficient for me to go and look up Tom Hanks' movies
by a particular year. So, secondary index
is very important. Two things you want to note there. Number one, the data
from your base table is going to get copied
into that secondary index with that new primary key, and DynamoDB is going to handle
all that replication for you. So, you don't have to manage
these different items. You just write it one time and it's going to handle
replicating it out into your indexes. Second thing you need to know is, you can only use these read-based
operations on your secondary indexes. You can't do any writes
on that secondary index. All the writes need to go
through your base table. So, you can read with
secondary indexes no writes. So, that's the second implication.
Use those secondary indexes that helps with the importance
of primary keys. Third implication
you need to know about, this is the most
controversial one, and this is you're going to put
all your entities into one table and use generic primary keys. This is maybe the single table design
that you've heard about, you've heard Rick talk about,
very interesting and I think this throws
a lot of people. So, let's just walkthrough
a simple example to see what that looks like and see
what we're talking about here. The example we're going to do, this is going to be
a SaaS application. Imagine you're building
SaaS application; we're going to model out
two types of entities. First, we're going
to have organizations or the people that sign up
for your SaaS application, pay your bill, and then within
that organization there are users that act on behalf
of the organisation to actually use
your SaaS application. Let's see how we've modelled
these two different types of entities
in a single DynamoDB table. We'll start with
a couple of organizations, and here we have two items
in our DynamoDB Table, two organizations; we have Berkshire Hathaway,
and we have Facebook. A couple of things
I want to call out here, if you look
at the primary key, notice that the primary key names
are very generic. It's just "PK" for partition key,
and "SK" for short key. I'm not using something like
organization name or username and things like that, because if I have multiple
different item types in my table, they're not all going to share
the same attribute name. So, my organization is not
going to have a username, my users are not going to
have an organization name, so you need to have
these generic names here. Another thing you need to look at,
just look at the pattern and values for these primary key values.
They're kind of weird. It's in all caps, it's also got
this pattern, where I start with ORG# and then the organization name and I'm doing that for both
the partition key and the sort key. What you're going to do is, you're going to create
these little templates and say "hey, this is what an organization
items pattern looks like." It's going to help you
arrange those items. It is going to help you
tell one item type from another
in your DynamoDB table. Now, we know the basics. We've got our two organization
items in there. Let's add a few user items,
so here we go. We have now five items
in our table. We still have our
two organization items, but we've also added three user items
in our table outlined in red. So, for Berkshire we've got Charlie
Munger and Warren Buffett, and then for Facebook
we've got Sheryl Sandberg. A couple of things
I want to point out here, just look at
the primary key pattern, they're slightly different
on that user item. So, the user short key
is going to be user, # and then the username
to help you indicate you know this is a user item.
Another thing I want to notice is that the attributes
are going to be different between these different
types of entities. So, if we look
at that organization entity, it's got attributes
like organization name, subscription level, things
that matter for that organization. Whereas the user is going to have
attributes like username and role, things that matter for that user. They can be distinct and that's where
that scheme less nature of DynamoDB really helps you, because you don't need
to declare all these attributes and sort of enforce a scheme on
all these different types of items. So, this is the basics
of single table design is, where people get thrown
a little bit. So, I want to bring it back down
to earth a little bit and talk about
one-to-many relationships. Because I'm guessing
most of you have modelled out one-to-many relationships
in applications before. I want to walk through how you might
do that in DynamoDB, what that looks like,
just to give you a feel for how it's different
than relational database. So, let's talk about
one-to-many relationships. First, what is
a one-to-many relationship? Well, this is anytime you have
some sort of parent entity with a lot of related
sub-entities, or you might have an owner entity
that has all these related entities. So, examples of this might be
in office with employees, in office might have multiple
employees that work in that office. In an e-commerce store you might have
a customer with multiple orders. One customer makes many orders
at least you hope people now E-Commerce store
and then in your SaaS application you have the organization
and many users in that organization, like we just talked about
in our example. So, with one-to-many relationship there's a key problem
you need to solve, and that's usually, how do I get
information about my parent item, when I'm fetching my related items?
If I'm fetching in order, how do I also get information
about that customer same time? If I'm fetching a user, how do I get information
about its organization as well? If you're using
a relational database, you do that by normalizing your data
and using joins at query time. But we already talked about
how Dynamo doesn't have joins. So, in Dynamo you need to do
that a little bit differently, a couple different
strategies here. One is to denormalize your data
which could sound like a dirty word for those of you coming
from our relational database. Second strategy
is to pre-join your data. So, we're going to look
through three different what I call strategies,
one-to-many relationship strategies. These are just different approaches
to modelling out these one-to-many relationships,
depending on your needs. So, let's get started
with the first strategy. First strategy is
a denormalization strategy, denormalization,
plus a complex attribute. So, I want to show this
by way of an example, when we do this let's go back
to that SaaS application that we've already used, where we have organizations,
and we have users. We're going to add
one more entity type in here. We're going to say: "when an organization signs up
for that SaaS application, they have to pay for it,
choose a subscription." It will allow them to register
multiple payment methods, and this allows you know
at the end of the month if one of those payment methods fail,
you can go to that back up one, and they don't lose access
to your service. So, one-to-many payment methods,
on this application. So, how would we model
that in DynamoDB? What we're going to do here, I've just added
on these organization items, this payment method attribute.
We'll get that outlined in red there, notice that payment method,
how it is a complex attribute. It's this map, and it's got
a default payment method and a backup payment method, and each method has some rich
information like that type, and number, and all sorts of stuff.
So, it's this complex object which is something you would never do
in a relational database, because one of those
is normalization, one of the principles
of normalization is to break each column value
down to an atomic value. So, you wouldn't have a complex
attribute like you do here, but we're going to do
that here in DynamoDB. This is a good strategy
when two things are true. Number one, you don't have
an access pattern on that related item directly. So, in this case,
you know that related item is the payment method, we're never going to say "hey,
I have this credit card number. Can you go find which organization
it belongs to?" You are never going to have
that access pattern. the only access pattern
we're going to have around payment methods is,
"hey, it's the end of the month. We need to charge this organization,
go look up the organization, find all the payment methods
related to it and work through
until one of them succeeds." So, we don't have an access pattern
on that related item directly. You also want to make sure; you have a limited number
of related items. This is because DynamoDB
has an item size limit of 400 KB. You don't want to exceed that by putting too much information
on that parent item. Just to go back to one of those
one-to-many relationships we showed earlier, imagine you had an e-commerce store
where a customer makes orders, you could sort of denormalize this
with the complex attribute, and put every order
on that customer item itself. But then, at some point
that customer order would get so big that you couldn't add
more information on it. If they want to make
their 30th or 40th order, you'd have to say sorry,
we can't take your order because we modelled
their data incorrectly, which is not something
you want to do. So, you know, in this case
it's very reasonable to limit the number of related items.
You're not going to allow them to have 10,000 different payment
methods, you can say: "you can have two, or three,
or five, but not 10,000." So, that works in this situation. So, that's our first strategy,
let's go into the second strategy. This is another
denormalization strategy. This is denormalization
plus duplication. So, let's look
at what it looks like here. Here, we have a one-to-many
relationship with authors and books. An author might write multiple books,
so this could be for a library, for a bookstore,
or something like that. If you recall going back
to the beginning of this section, we talked about how the key problem
with one-to-many relationships is, how do I get information
about that parent item, when I'm fetching this related item? So, if I look up a book
by Stephen King, or a book by JK Rowling, how do I get information about
that parent at the same time? What we've done here is, we've just duplicated some of
that data onto the item itself. So, you can see outlined
in red there, on each of these book items I've just
copied the author book date, their birthday excuse me,
onto those items. This sort of duplication again is something you would never do
in a relational database. When you learn about normalization,
don't repeat yourself. But I think it's helpful
to think about why denormalization is helpful here,
and why do you want to do that? The big problem
is around data integrity. If you have the same people,
piece of data that's represented across a lot of different rows
in your relational database. If you ever need to update
that piece of data, you need to run
all over your database finding all those pieces
of information and update them, so that you don't have
data integrity issues. So, that's the big thing
we're fighting about there. So, when you use this, you want to
make sure you don't have that source. So, this is good in one
of the two following cases; number one, if that
duplicated data is immutable. In this case Stephen King's birthday
is not going to change, it's the same today
as it was yesterday, as it will be tomorrow,
as it will be 2 years from now. So, you don't need to worry about
that data changing its immutable, so you don't have to worry about
data integrity issues as much. Even if your data is not immutable, this could work if the data
doesn't change often or if it's not replicated much. It's only replicated
across two or three items, maybe it's fine to go out
and search out those items in and update those as needed. Where you're going
to get into trouble is when this data
can change fairly frequently, maybe daily, or hourly,
something like that, isn't it's replicated across
hundreds or thousands of items, because now, you're going to be
hopping all over your database. Updating all those
all those items, and maybe having data
integrity issues. So, think about that
when you're using denormalization and duplication. So, those first
two strategies again, both denormalization strategies
can work in certain situations, but sometimes they won't work. That's where we might fall back
to this third strategy, which is a composite
primary key and a query. So, we talked about
composite primary key earlier, we're going to show how we model
that data in our SaaS application to really work with this
one-to-many relationship. So, going back to that SaaS
application we have our five items, we have two organizations,
and three users in there. One thing you'll notice
is that three of those items have the same partition key.
You can see ORG#BERKSHIRE, we see a couple
different items in there. There's a query operation
in DynamoDB that makes it very efficient
to get all items with the same partition key. So, we could very quickly
get all those items, and the reason
that's so efficient, I want to go back to that chart
we talked about earlier, with those different partitions and how the very
first thing DynamoDB is doing, when you are sending a request,
is checking that partition key and figuring out which partition
it needs to gather to. Because of that,
it's very efficient to read items out
of a single partition. So, all items with
a single partition key, that's going to be called
an item collection, that's going to be very important
as you're modelling with DynamoDB to sort of build
the right items collections to handle your access patterns. One thing I want
to call out here is, notice that have different
entity types in this single item collections. We have different
collection of entities. We have an organization item,
and we also have our two user items in this single item collection.
The reason we might do this is, imagine you had
an access pattern and said: "hey, list all members
for an organization." When you're doing that,
you need to fetch all the members, but you also need to fetch
that organization items, so you can enrich those members
with some of that information. By putting them
in the same item collection co-locating
together like that, it's a single DynamoDB
query of very efficient read, that's going to scale as you get up
to terabytes and terabytes of data, it is going to work really well,
and that's going to work nicely. So, with that composite
primary key and query, what we're doing is,
we're joining our data, but we're pre-joining it
into an item collection. We're joining it at-write time
rather than at-read time like you might in
a relational database. So, those are the three
one-to-many relationship strategies that I use a lot. You know denormalization
when you can and then use that composite
primary key and query. There are a few other ones
in the book as well if you want to check that out. So, just to summarize
what we talked about here, we went through
three main categories. First of all, we started off
with DynamoDB basics, and just talked about vocabulary,
terminology, key concepts just to get us
on the same foundation, so we know we're talking
about with DynamoDB. Then we looked into SQL
versus NoSQL. For those of you coming
from a relational database, what is
NoSQL trying to achieve? What are the design decisions
that made to achieve those goals? Then what are the implications
for you modelling that data? Finally, to bring it
back down to earth, we talked about
one-to-many relationships, what that looks like in DynamoDB? Just to give you something
more concrete there. Again, this is a two-part talk; so, keep watching Data Modelling
with DynamoDB Part 2, we're going to do more
Data Modelling Strategies, we're going to talk about
filtering and sorting and all sorts of fun stuff.
So, make sure you check that out. Then, also later on, Rick Houlihan,
who introduced me at the beginning, he does his Advanced Design
Patterns with DynamoDB, it's a two-part session. These are the best session
every year, most watched on YouTube,
parts to get in to. So, make sure you watch
these great stuff. Once again, thank you. I'm Alex DeBrie,
thanks for watching this talk.