Migrating from relational databases to DynamoDB with Rick Houlihan

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everybody thanks and welcome for uh joining me again this week on another session office hours here uh with uh you know on dynamodb i'm rick lehan as you uh know hopefully uh by now some of you may some of you don't i'm sure most of you do and today we're here to talk a little bit about a subject that's near and dear to my heart which is the whole relational to or really you know moving from a relational database and that kind of mindset you know how do you how do you address the issues what are the one of the common problems people kind of run into what are some of the lessons we learned you know at amazon you know a lot of times when i talk to people about you know nosql and and you know relational database modeling and things i you know people kind of lose track of the fact that you know all applications use data pretty much the same way and we kind of start to make some general assumptions about how you can do things and and there's there's very few people that have actually tried to solve these problems at scale which where it matters right and the people who have you know we've kind of figured some things out so and that's a lot of what we're going to talk about today um so uh you know with that the first thing i want to do is kind of walk through a little bit of content to level set some of you might find this content to be familiar uh others maybe not but the the idea here is that we want to you know get the idea of what are we trying to do when we talk about you know modeling data and nosql what are the biggest differences between you know modeling you know for nosql databases versus relational database and uh so let me go ahead and fire that up here um and so again i've given this content before some people may have seen this but you know worthwhile hanging out because i'm gonna talk a little bit to some of again some of those lessons learned at amazon uh and really again it really comes down to data modeling for dynamodb and nosql in general right it's not uh you know there seems to be a kind of a common concept out there that there's this idea that you know no sql databases are about key value access or or dynamodb is a key value database i mean i hear people say this all the time now and you know really nothing could be further from the truth you know the reality is you know most applications don't just use key value access patterns i mean that's a big one that's a very valuable one it's used you know in a large you know percentage of of applications but it's certainly not the only way to to work with your data and in many ways if you kind of build your data as a key value store or an object store that's you know a hierarchy these big giant hierarchical objects and large objects uh and even large documents you're actually gonna end up hurting yourself and we'll talk a little bit about why uh so i'm not gonna get into much of the overview of dynamo we're really going to jump right into data modeling and and what does that really mean for uh relational databases and and versus an osql and again the data that we're working with is always relational it doesn't matter what type of application we're building we're going to have some sort of a relational entity structure that we're modeling and we need to take care of and and so the common you know there seems to be a perception out there that's evolving around you know uh relational databases look at data in bits and and and no sql databases look at data data as the whole and and and and the reality is now your applications kind of need pieces of these hierarchies of these relational data structures and that you know these entities exist no matter what type of application you're building it's a social networking i t monitoring uh document management it doesn't matter what we're doing we're building you know a a repository of data that's collected and aggregated and and we retrieved in bits and pieces and that's really the most efficient way to work with the data it's not very efficient to work with data in big large blobs and you know if you think about it the reason why is because if an application needs to get only part of this hierarchy maybe i need to get an item from an order and there could be you know hundreds of other items on that order putting all those items on the same document might not necessarily be great because i have to read all of the items to get the one so there's lots of situations like that we ran into problems like that a lot at amazon and so what it comes down to is in the past we've used this relational structure this is that standardized you know view of your third normal form we've got you know a one-to-one relationship between the products and these top level product tables books albums videos one to many between albums and tracks many to many between videos and and actors and if you look at how we might want to query that which we will in a few minutes here you know it's it can be complex to get a full list of products there's many joins that have to occur and and those are costly and and especially when you start to look at things at scale and so if we took that kind of key value approach and and collapsed the rows of these tables into these hierarchical objects here or views of data that we're going to show you in a minute uh we're going to we've got a little simpler pattern if i need to go get all the products for a given catalog i can just select all the products from the and and scan the collection and retrieve all the products i don't have to issue multiple queries i don't have to set the the cpu on fire if i have large tables that i'm trying to join across and whatnot and so the key here is that we're going to create a table and that table is going to have a large number of items on it those items are going to have lots of attributes and and they're going to exist on and they're really going to be the rows from all those tables right and this is how we kind of want to look at architecting nosql how is it that the application goes and works with the data you know if you work with a a large you know object then by all means store that data in a large object but if your application works with smaller objects then let's let's store those that data as smaller objects and group them on meaningful dimensions and that's how we're going to actually use the database and so in dynamodb and all nosql databases we create you know these all objects we insert into the table have to have a unique attribute that identifies the object and that our unique key and in this case if you define a partition key only table then this is that key value pattern right i have this exists in a document database in a in a y column store like dynamodb anywhere you've got a nosql database i can create this pattern and this really says hey if this object has this id give me the attributes that are associated with it and this would be where i'd create maybe some big hierarchy of data that could represent all the rows from those tables that's a very common way to try and model data in nosql sometimes works sometimes doesn't but what we really want to get when it gets interesting is when you start to look at how you can apply uh the indexes in nosql right since i'm putting all the objects on the same table and in dynamodb the first table is really an index that you manually maintain i create partitions when i add the sort key i can uniquely identify items within those partitions and enable some more rich query capabilities so again if you think of a partition key might be a customer id inside of that customer's partition there could be many objects that are sorted on date range there could be you know orders and payments and invoices and you name it i'm just pushing a lot of objects in there that are kind of associated to a customer and then i can execute these complex query operations against those you know particular uh range keys or sort keys to be able to retrieve subsets of the data that are interesting and and then i'm going to go ahead and decorate these objects with additional partition keys and sort keys for other indexes to kind of regroup them and and and this is where it gets really interesting because essentially what i'm doing now is i'm i'm joining or i'm collecting those objects uh and so when we talk about nosql databases and people say yo they don't support joins well that's correct we don't but we don't really need to because if we're storing all the rows on the same table then the index becomes the join more or less when i query an index and all the attributes that are decorated across the objects that are related to each other then guess what happens i get a collection of rows that are related to each other and that's that's kind of what happens when you join across tables right when i join our particular dimension and we'll demonstrate that in the modeling when we actually get into the actual modeling i'm going to show you i actually took a sample my mysql database their employee database it's got a couple million rows like four or five million rows across many tables we're going to go ahead and migrate all that into dynamodb i didn't get a chance to define i hope dozens of access patterns but i got you know four or five access patterns we can look at in the queries that we can compare and and we'll actually see what that data looks like when we actually model it this way but this is what we're trying to do we're trying to create groups of objects that are grouped on dimensions either on the primary table or on the index that are uh in in patterns that are interesting to the application so when we break down the applications access patterns that's really what i'm trying to do well what are the bits of information out of this entity relationship that i'm querying against this data model this data structure what are the bits and pieces that my application needs to get it's really gets important when you start to look at scale because if you're reading and writing data that you don't need then that's wasted bandwidth that's wasted you know processing time you know that's wasted data you know in across the network and bandwidth it's really it gets extremely expensive and i've seen you know customers who think that they're doing themselves you know favors by collapsing their relational models into like these big large documents and in the end they're just biting themselves in the butt because you know accessing those objects is killing them because all they're really doing is trying to increment counters every now and then and and these objects and these documents might be megabytes in size and you know if you think about it what does it cost you to update a if your high velocity workload is an update pattern and all you're doing is updating counters then you should be you know moving in you know four or five four to eight bytes or so of data across the wire instead i'm moving megabytes so you can burn your storage arrays you can burn your cpu but again the key thing about nosql databases and the thing you need to get your head around is the idea of indexes right indexes create groups of objects that are interesting for your application in dynamodb it's one of the most valuable features there's a actual contract between the gsi and the table uh that's maintained as far as replication if the if the gsi starts to lag behind the table it can't lag too far 30 seconds is the is the maximum buffer that will allow between the the the table and the gsi and that would be a gsi that's incredibly bad for you know not not well provisioned and not well designed and we'll talk a little bit about that where you're getting hot keys and things like this sometimes i can generate hotkeys on the index that cause the index to lag behind the table if my indexes are properly configured my data model is correct my gsi replication lag should be sub 10 milliseconds and actually that's what i'll get into next week i think is i'm going to go dive deep on gsi replication lag you know sharding patterns you know how do you correct how do you select the correct shard keys to to guarantee a an even distribution of traffic across your table space uh things like that because as i went into this exercise you know i found some issues that i hadn't actually seen before and i had to mitigate so um anyways the idea here is that you want to create groups of items that are relevant to your application and when you access those items on these indexes you only want to get the data you need and so you can use indexes to actually be if you're smart you can use indexes to project a subset of the items uh and only when you're satisfying a specific pattern so that if i have large items this is one way to be able to support access patterns that only need bits and pieces of them is to create indexes with projections that don't carry the big data so to speak if that's not what i need and if i only need to get the big data when i go get the item then fine i have a key value pattern to get the big blob i have another pattern says get me all the blobs that contain x then it might be good to use the secondary index for that second access pattern because you know you don't necessarily want to you know get all those blobs you only want to get pointers to those blobs and that's what you use projections for so on an index in dynamodb you define the partition key the sort key attributes and then also what gets sent you can you know send the keys only to the index so you can say hey i want to just get a pure pointer to the items on the table that match this condition uh in which case you somebody selects one you go get that item or you might say hey i'm going to include some subset of the attributes from this item from all these items that match these you know partition and sort conditions into the index in and in which case you're paying less storage you're paying less wcu when you update the table because the index right isn't carrying the entire load of the item or you could say hey you know what when i query the index i need the whole object anyways let me project it all that's up to you but it's a good way to be able to kind of dial up and down the amount of resources you're burning based on your access patterns right so again the key concept here in all of this is this thing that we're going to call partition overloading and we're going to show you how this works in a relational context in a few minutes but the idea is that you're going to create partitions the partition is going to be loaded with items and those items are going to have lots of different sort key types and and conditions that i'm going to execute to get subsets of them so in this example here we've got a customer partition customers make orders orders have items all of this stuff is date ranged except for the customer's record and when i send down a query that says get me everything for the customer uh in this case this this is an older chart but this was you know last 30 day type query it would say you know it would return all the orders the order items and it would also actually return the customer's metadata as well so if this was like a dashboard login let's say customer logs in what's my current view what's my activity stream and i want to populate some top level kpis then this would be a really good query right because i create this this would be maybe my primary partition on the table because every time someone logs in that's what they do query my pk where you know the sk is greater than you know 90 days ago and i'm going to get everything that you want me to see to load my dashboard right you can have top level metrics on the customers item you have different types of items in there everything that has a string or an ascii you know value as the first you know digit actually is larger than a date range so this date range query which is actually a text based query would return anything uh that's you know text based as a sort key right so these are the types of tricks we'll use to kind of create faceted sorts you know starts with queries to get everything on a given day things like this with inside of those partitions and that is the idea we're going to create groups of objects these objects should be queried based on the sort key attributes we may apply some additional filter conditions but in generally we're going to produce the same types of results and we'll see that in a few minutes that you would get from your relational database all right so if we look at how do we join the data in a relational database and the cost of those joins this is where the efficiency of the nosql database at scale starts to really shine in that example we talked about which is our product catalog if we have a uh you know products that join to the first table there for a book is a one-to-one join it's not that expensive but the time complexity starts to increase as the number of tables are brought in right so i see a lot of blog posts out there hey look relational databases scale look i can join these two simple tables together with a one-to-many relationship fine now join across multiple tables grouped on dimensions for multiple tables and try and give me cohesive lists of data and i'll show you one of some some use cases there where this starts to break down really fast because that time complexity if you add the larger number of tables in it just it's unavoidable right these are just multiple index scans uh you know you have to iterate through the result set from these outer tables to join the inner tables and these are all nested loop joins assuming that you know the indexes are correct we're sorted on the join dimensions and we can efficiently execute these queries right i mean if you don't have an optimized index structure across these tables which you know is something that you probably wouldn't uh or you would try to fix if you did but if you don't it can be get really expensive right so but if even if you look at fully optimized as you increase the number of tables and some of these relational models get pretty extensive uh it gets expensive right so when you model the joints in nosql essentially you create these partitions you load these partitions with objects in this particular example we're taking all of those rows right it's the exact same data we're just converting it into a single table we're creating a partition key and a sort key on each one of those rows and we're sorting them into you know buckets and and applying query conditions to retrieve the collection of items you know get the book where the book title equals x uh you know get the album title uh the album by title get the movie by title in this particular case you can think movies and actors together on the same table kind of create a graph right there's the the the movie nodes the actor nodes uh and these edges inside of the movie partition that are sorted on the actor name they kind of define how the actors related to the movie right what role did he play i've done a little denormalization there because you know hey when somebody goes in to look at the movie summary right you think like imdb you look at the movie summary it's going to tell you who directed it you know maybe some summary information about the movie you know who were the actors in the primary roles maybe some little summary information about those actors i would project the information into those edges that's relatively immutable right it doesn't change and and it actually it accelerates the query right i don't have to go back to the actors partition and get that data so when i query the movie off the table where the movie title equals x i get all of that data and so this is a directed graph right in this case i don't predict project the the actor's bio because the actor's bio gets updated frequently and if you want the actor's detail information i would click the actor's name and go and get that and show it to you and but when i click on the actor's name what do i want to see right i want to see what other movies he's been in and this particular data structure doesn't show me that so if you look at the graph it's a there's a it's kind of a to many potentially a directed graph is not many to many but well i mean yeah i mean kind of is but it's it it you know it the source may one node may know about the connection the other doesn't but a directed graph they do a full many-to-many relationship people do and and that's where we start to look at how we do that in in dynamodb you kind of swap those pk and the sk right you flip them around in this case now we're just taking it out it doesn't have to be a partition key in the sort key word this is valuable to us to do this but in some use cases i might have a source id and a target id and just create a gsi on target id and source id right so but in this case we're flipping the pk and the sk to see the other half of that uh many-to-many relationship in this case it enables a bunch of access patterns now i can query my author name and get all the books for an author i can query my song title i can get all the you know albums and uh and details about the track recordings for any album that it was on i can query by actor's name and see hey here's the actor's bio all of his details i can see the summary information from the movies that he's been in right what roles did he play uh because because i have those those edges now came back into the gsi and query by director's name query by artist to get the you know albums that the artist has produced and so on and so forth so this is how we create data structures that can be kind of i guess you'd say you know fluid or or efficient when you start to query right i don't have to get you know everything that's associated with the hierarchy to be able to view the related data right i get the only pieces that i need that are actually related to each other and that's really what the key is that we're trying to message we're trying to get across uh with relational modeling in dynamodb so you know from here you know let's go and just have some sandbox fun with uh you know the nosql workbench uh that is obviously my favorite tool uh i use it quite a lot all the all the engagements i go these days i it's rare that i don't end up pulling this thing out so if you want to get serious about dynamodb and you want to get serious about data modeling for nosql i you know suggest you get familiar with tools like this to kind of teach you how to you know manipulate data that where it's a collection of objects on one table because that's really what this is all about right so let's just create a sample model here this is going to be uh you know for doing some exercises uh in what does it look like to do a uh oh and by the way i always forget to mention this but people who are on my channel on the dynamodb guy channel if you want to come over to the aws channel twitch.tv aws we have moderators uh who are here to answer your questions and so there's uh let me see if there's a question i don't see anything yet so but yeah if you guys do have questions we've got moderators online we've got and if you're over on my channel feel free to pop into the aws channel if you do have a question uh because we've got folks online who can actually answer your questions and maybe even flag me and let me which if i do notice uh what's the advantage of mongodb over or dynavodb over mongodb uh you know honestly although sql databases are uh you know scalable performant uh the data models that i'm showing you today actually apply to mongodb i mean you'll see a lot of people talk about mongodb being the the rich document model being the huge advantage you know what it's an indexed object store and that's what all nosql databases are if we put all the objects into one collection and we create decorate them with attributes and index them then we can create interesting groupings of objects and that's what we do with mongodb and that's what we do with dynamodb and that's what we do with cassandra and that's what we do it is is the same it is really the same process now where dynamodb absolutely shines over almost all nosql databases is performance at scale i mean nothing is built like this we have you know uh tables that are i think we're kind of i heard the other day or we might have passed that because this was months ago but a hundred petabyte table i mean this is who hears about that who does that nobody because before they do they come talk to us right i mean the reality is nobody wants to run these things at scale they're tremendously difficult it's a huge amount of infrastructure investment you're making and even the largest businesses in the world understand i mean samsung's galaxy sync service they deprecated a team of like 30 people that was managing their full-time knock with over 550 nodes on their cassandra cluster i mean they bought a an actual cloud provider joint to try and restart that internal service and they did partially but then they realized oh yeah that's right that's why we left so honestly it happens everybody that i know that's running mongodb at scale is looking to get off so if you want to know what is the biggest advantage of dynamodb performance based or consumption based pricing performance at scale i mean we have free tier you're going to spend 10 15 percent of what you spend on your mongodb bill so that's probably the biggest advantage is your wallet okay so sorry that was me on my soapbox can solutions architect talk about the current state of the hot partition issue when reading and writing large amounts of elements yeah we're actually going to talk about that i want to get into the moving from relational to non-relational we're actually going to walk through some code that shows you how you can execute this stuff at scale what's interesting about the sample database that i chose today is that it it does it's got it's got five million records because it's got years of history for a company's employee database and you know that's data that trickles into the system normally but i'm actually slamming it into the table so it's creating a lot of hot keys on these aggregations like you know employees by department when i've you know there's only 10 departments in the company but it's got like 2.8 million salary records and things like this right so i've had to do some artificial right sharding i normally wouldn't do in a database like this because this would be data that would kind of build up over time or i might actually have a longer process to etl the history into it and let the key pressure on those indexes settle down a little bit but i mean honestly i'll show you how to make this happen and the query performance is pretty impressive i mean it's matching what you get from a local mysql instance generally speaking although i shouldn't say that because i haven't had a chance to optimize those mysql queries yet and i'm sure there's probably some indexing i can do to accelerate those things but in any case the numbers don't look too bad so we'll talk a little bit more about that so let's get on to uh the data modeling here uh and so in this case we're just again i'm just doing some generics we're gonna talk about uh you know what does it look like to [Music] model certain types of relationships in dynamodb one to one to many one to one uh many to many right and how do we accommodate that um so i always like to have a type attribute because it tells me what i'm looking at on my table and let's go ahead and create a gsi because we'll probably want that i'm sure for some of our exercises and that probably looks good generic table generic keys generic gsi and we're going to go ahead and create that this will be gsi1 pk1 key oops sorry key one okay so now we got table let's go ahead and add some rows all right so first one we'll look at is kind of that one to one right so let's say what would be a good one-to-one relationship maybe i've got you know products and inventory okay so i've got maybe a use case that says i need to select you know inventory you know locations buy or bins you know by bin id i need to select you know the inventory bin for a product okay and and i'll think this is a small operation so a small warehouse or whatnot every bin has one product you know let's just say that so it's a one-to-one relationship so we have a product id and we have a bnid and we have you know at some point we want to get the bin for a product right so i'm probably going to go ahead and say you know let's have some product id is going to have an a record it's going to have the type is going to be a product and we're going to have a pk which is good for the sort key which will be the bin id that the product belongs to and the store key which would be the product id this might be a naive way to do this and i'm going to show you why this might not be a great idea in a second but you know if i can spell we can do this so that might be so there's product id and i want this to actually be bnid and then this guy is going to be the record for the bin and he's going to have bin id as his pk and product id but yeah is that right so i have three access patterns here one is get by product id one is get bin by bin id and the others get bin for product okay sounds pretty straightforward and so if i create this structure here on the table and i go ahead and create my objects like this i can go ahead and save this thing i can look at my aggregate view i've got select by product id select by bin id select by bin for product okay select by bin id oops what did i do there i did something wrong i want this to be yeah i want this to be product id i want to get the the inventory for a product right so yes so this would be product i got that backwards all right and so there we go so select product by product id select bin by bin id and then you know my get the inventory for a product so i could select from the product id on this page i get this guy now i mean you could say this is probably you know i might why wouldn't i just go to this in the first place and get my product id and as a matter of fact that is the the mistake here would be that i'm storing the product in two places right in this particular case if my use case is you know get the product by product id get the bin by bin id and get the bin by product id then i don't want to index the product there's no need to get indexed the product right unless what i do is when i go get the bin or the bin by product id i want to see the extended product data in which case maybe i might say well let's let's go ahead and denormalize some of the product data into the bin because the bin is you know one to one relationship between the bin and the product maybe if these items are small i might consider you know concatenating them but maybe they're not maybe they have large amounts of data when i go and this might have just inventory information that's all i really care about when i access the bin i don't care about the product description and any other information that may be sitting in this object uh when i go get you know that for that particular workflow and that's really the key here right what is the workflow one workflow might say hey i want to populate the website when i populate the website i select my product id why i need to see inventory okay so well i do want to see both these objects in the same partition but what i really want to do now is i say okay well maybe what i should do instead of putting both these objects in into their own on here is i could say maybe a smarter way to do this would say and you could argue with me whether it is or not because there's always multiple ways to do anything but maybe what i really want to do is i want to put that bin inside the product id partition on the table and sorted on the bin id and now if i look at what happens i don't necessarily even need to you know i don't need to put the product on the gsi and i can satisfy all of my workflows and i've basically taken half my data off of my gsi or more because the product item is actually where probably the heavy data is so if you look at what we've got now oh this is interesting so i could say okay when i select by product id i'm gonna get you know the the bin for the product i can get the bin id by the product oh that's great okay as i get down the primary table and i can also get the bin by product id right do i need to do that no i've actually had now that's that's red i need to get it by bin id i'm not doing my data modeling very well today am i there we go because that was my workflow right so i have two workflows here i can go ahead and just make that and so yeah so my workflow is go ahead and get the product by product id and maybe so for the website i need i need to select off the table by product id that gives me the product description it gives me the products bin which is where it stores its inventory so i can display all that information on the product now for my fulfillment center workflow or for my warehouse workflow i really just want to get the bin by id right and i can say okay great let me go get that bin by id because there's some workflow in there for some sort of maintenance workflow that was the other access pattern right remember we had the access pattern of get me the bin by product id get me product by product id and get me bin by bin id and so in this particular schema i've taken you know 75 of the right load off of the system right if i did it the other way where on the table i stored these things as separate and i aggregated them on the gsi to get this view then i've satisfied all three workflows but i've done it less efficiently and this is part of the kind of lessons that we learned at aws right as we started to go through these detailed workflows at scale what you really start to notice is hey i can trim the fat i can trim the fat and you can trim a lot of fat by reducing the the the the objects that you need uh to the workflow and and targeting the objects that you're retrieving to the workflow that is actually using them right don't get the data that you don't need don't store data that you don't need right make sure that you know you're you're actually exercising the system as efficiently as possible and this really is no different than the exercise that we used to go through you know with dbas to optimize our you know cruddy implementations as developers right sooner or later you scale that relational database and you go ouch it hurts i remember i remember when i was at you know a company called xenos you know we had all kinds of issues with our mysql database implementations you know they were falling over they weren't scaling it was back in the early days of my nosql career we actually started a project which ended up replacing the mysql back in but for the right reasons because you know we brought in some dbas who optimized the system and gave us a lot of runway on the existing system just because you know the way that they structured our queries and restructured our queries to actually become more performant so there's lots of ways to to scale any system relational database is not a magic wand as a developer you can do terrible things with sql believe me i've done it i've done it all and i've also done it all with nosql too and this is why you know i get into this stuff of trying to drive efficiency when you're starting to talk to customers about their implementations that's really what it's about it's you know where where are you spending your dollars i don't want you spending dollars you know storing data you don't need you know reading objects that are too big you know burning wcu's that you don't need to burn i want you to get exactly what you need when you need it and that's what you do when you get down into the data model all right so let's talk about you know some of these other you know constructs you know one to many many to many uh you know so it's pretty easy to look at maybe products have multiple you know uh bins right not just uh not just one bin uh and maybe bins have multiple products right so i mean how will we model that so first we'll go after you know products with uh multi well actually i'll tell you what let's let's go with uh with it at at the same time so let's take a look at how we might change this data structure a little bit if we had bins for products right so bin id and then this might be product id within the bin and then we might have another product inside of that bin that might be product id within that bin and that'll be product id1 and so this is a bin and now since we've got the select bin by bin id on the table what i need is i need to select the bin by product id on the gsi and it's going to be nice to know you know what the bin id is or location might be that might be my sort key would be location that'd be more relevant to a warehousing workflow right give me all the bins that have a product and know where they are so i can generate a pick list right and so that's probably you know more a more relevant key structure for our bins in the warehouse yeah we would add another product we'd have product id one this guy's product id one a record product and do we want those guys on the index yes we need these guys on the index now because for my populating my view you know i would need to get my inventory data potentially and maybe that would be the question to the customer right do you need to have when you select the product you need to see the inventory maybe you don't if you don't then we'll keep them off of the gsi if you do then we'll put them on the gsi right it all depends on what that query looks like and that's one of the things we're going to do in a second here is when i get into the actual implementation using a relational database and some real data but let's go ahead and take a look at what we've got going on now so i've got this bin you know id that's you know got you know many products uh potentially i could have many bins with many products and and again you can see how you can just extend this model as i go to the gsi now i query my product one query by product you know product id product id one i'm gonna get the product information i'm going to get the locations where there's a bin with quantities and information that i would need for that workflow to go and you know figure out what would be the best locations to add to this pick list as you know the workers going through the warehouse right um so you know that's probably you know the next step here would be look at the many-to-many and if we did that that's simple enough we just you know say okay products can exist in many bins and so we would go ahead and add bin id one and bin id one has you know prosome product one or some product id in it this is again a bin and this guy would be and location and if we save that okay and so we look at our view here right we've got you know the bin has many products the products exist in many bins if we look at the other side of the relationship here now i can see that this product now has multiple bins that are associated to it and this is kind of the way we're going to go about you know making this happen a question from is a good practice to have the likes as an attribute in the product row and comments as maps you know again it depends on the size of the objects i would think in most forums you're talking about kind of like the forum thread example uh in a high volume scenario like maybe i've got like you know aws support forums or something like that probably not some of those a lot of those threads get pretty long and the content gets pretty detailed so we'd probably want to store those things as hierarchies of items within partitions right create a forum partition you would have you know maybe a a leading key which would be the forum the the forum id and then the hash that to the subject or the topic and then inside of that that would be a partition key and then the sort key could be you know the the subject of a given thread and then inside of those subjects you know uh you know underneath the subject hierarchy you could have subject hash response one subject hash response two and even deeper you could have subject hash response one hash comment one hash comment one hash comment one has comment you know subject hash response comment you know two you know whatever you know you could generate a hierarchy of items inside of a partition for a subject in a forum that would be easily selected right using a sort key condition of starts with right give me everything in this form give me everything so on and so forth you could even you know create that hierarchy on an index and you could prefix those you know sort keys with like uh some sort of a date string or something that could you know people can start looking at what are the new items right you always got to look at you know what are the items that are the new things that i haven't seen yet things like that um so anyways we're drifting off topic here i want to make sure we get to the subject that i was actually preparing for which is and i'm going to go ahead and go there now which is kind of showing you you know what we're going to be looking at let me make sure that's running yep okay um yeah so anyway so i thought okay great to talk about all these synthetic use cases it's nice talking about modeling data but you know it's actually good to see it actually work right so okay let's get some sample data let's build a sample database i thought i was going to do this but then i said well you know there's got to be some stuff out there and sure enough what do you know there's an employee database lots of database samples for these guys you know mysql oracle you name it they have these sample databases you can build out and deploy i went ahead and deployed this one it's got you know four or five million records across these tables it's an employee history uh uh database that basically tracks their you know what departments they work for what roles they had uh you know what their salary history was you know so on and so forth right so it's it's relatively complex database uh you know and and it's actually nice and representative of the types of services we had running in the thousands at amazon right little services like this that have you know relatively complex data structures across entities and have really quirky little access patterns right give me the employee history right i want to know his entire history what was his you know every role that he had what was his salary history what was the all the departments he's been in and what titles he's had who are his managers right i mean i want to get you know so and and some of these queries are actually pretty complex from a relational standpoint and when you start to look at how we model these things in nosql databases it's pretty interesting because it makes things pretty simple uh and those types of workflows where there's no easy join path so to speak to get all these tables date sorted uh in a list you you almost all you almost always have to you know run that as some sort of stored procedure or you know a multiple select right i mean it's not something you can actually easily generate uh time ordered lists across multiple tables uh but in in no sql databases you know funny enough you actually can relatively easily and we'll show you some of that if i you know let's flip over to some of the access patterns that we have defined against this uh employee store again this is a sample data that came in from uh mysql site you can download this app yourself the one thing i would tell you is that the data has got some got some weird quirks it's it's not all totally clean and there's going to be some idiosyncrasies between the results that you'll see and i'll explain the reason why but there's basically some some some data cleanliness issues in the collection that i had to account for uh but pretty straightforward i i didn't have time to do i hold dozens of access patterns i didn't have a lot of time to work on this so i came up with these four patterns right so get all employees for a department right give me all the current employees for a department they have a key structure or some sort of a schema that says hey if i have a date that's 99.990101 then that's his current title his current role is current you know whatever it is right is current pay if you look if you break down the data model that's what that's what it you know shows you uh anything that has a to date appears to be you know his open-ended this is what i am so as we migrate the data we're going to start to use these things as keys on the table uh we're going to get the current employee information this is the kind of history right or no i'm sorry this is current title current salary current you know current role and current data current metadata right join across all those tables and bring back everything that's you know that's matching his current condition and we're going to get the current managers for a given department we're going to get employees by title for given department give me all the senior engineers for this department so on and so forth right so and i wrote some pretty simple code i will walk through to go ahead and do this work uh again configure the client pretty straightforward if you watch my sessions in the past the biggest message here again that connection timeout don't run with the default defaults like 60 seconds that's an http connect so you know if it uh it doesn't respond with the if your syn packet gets lost in the network then that process dies for 60 seconds so don't do that if you're you know node running on the same network lambda function running on the same network the same region as the table you're hitting you should easily be able to run at 500 milliseconds that's what i do i know essays that run lower initialize your connections set the target table connect mysql instance now we're going to do some table migration right this is where all the work actually happens let's go ahead and take a look at that method right here so down here we're basically migrating table at a time and we're kind of transforming the data model a little bit as we do it right because i want to kind of rotate the data structures and we're kind of putting these things into the same table i'm going to be generating some keys that are going to help us with our queries again that's why it's important to kind of understand those access patterns uh and honestly it's not something that's so different when you look at uh you know relational database you know the dba is going to come in and go well what queries are you running right he's not some magic wand again he can't wave over the box and say you're magically optimized right you have to actually do this and and that's kind of the process here is what we're going to do during the migration so the first thing we're going to do is grab the employees table because that's where some of the data that's in these employee items is going to get denormalized as you'll see across the other items right so i'm going to load the employees into the table we're going to partition those on the employee number every employee has an a record which describes the employee same thing with departments right departments have a uh number and a name those are the partition and sort key and interestingly enough there's only nine departments in this model for 300 000 employees so i was like oh my god so the key pressure this thing creates is enormous right i mean you've got 2.8 million salary records dumping into 300 000 employees that are loading into and aggregating up into nine departments right so uh you know when we had the question on key pressure this data load really pushed the limits of you know in in some not the limits but it certainly pushed it put a lot of pressure on those keys on those gsis so i had to account for that uh the as you go into through the tables it's pretty straightforward is a manager wants to get indexed right so what we're going to do is trap that item the current item right you can see this a lot in the migration hey if this is a current item then we want to tag these things with uh you know some indexed attributes in this case if it's current manager then index the item on the department number with the prefix sort key so we can use a starts with m condition to pull the manager out and satisfy our get manager for department right query department partition starts with m gives us the managers for that partition for that particular department uh department employees this is where the data anomalies start really showing up they've got you know i don't know what it was that they did to generate their sample data but there's a lot of records that they kind of left i mean i thought maybe they're fired employees because they kind of they don't have they have employees without 99.99 records and that would be normal and actually that is normal as i kind of scrub through the data but they have employees that have two dates that collide and usually when that happened it meant that they forgot to update the 99.99 record but sometimes it didn't and it just got ugly so i just made some default assumptions here what this meant was that not all the items made it across yeah you know so the results are a little different here i didn't get a chance to actually dive into the data anomalies i tried to with this table that wasn't too bad it's got like three or four hundred thousand records maybe twenty thirty anomalies as you got into the history tables the titles and the salaries it was like oh geez these there were hundreds thousands of bad records and i didn't really want to write code to clean it so we just accepted it but as we went into these things and got these current items we again decorated these items with some information right if this is a current team member then index on the department number sort on the full name right you know get me so and you know people with this name by department whatever uh you know right shard the gsi because we're bulk loading right we're creating an artificial key some artificial key pressure this is not data that would normally be you know slamming into the system right this is you know employee metadata that would be slow changing but we're kind of fast loading so i have to create some right sharding and where we're doing that is on our partition keys if you see where i'm kind of sharding the employee load it costs 25 partitions right so these pk1 is our partition key on gsi1 and i'm basically starting on the department number hash you know 0 to 24 right because i need the throughput to be able to get this thing in in a reasonable amount of time uh and you know if we go into the next record same type of thing right capture the current title annotate it index it and that's what you'll see a lot when we do these types of imports from relational database we'd be importing tables row by row we're going to be looking at these tables and kind of morphing them into a data structure that is indexable inside of a single object you know indexed object store you know like again i go through the same process with mongodb there is no difference as a matter of fact you want to keep your items small that's a little trick that people don't really understand unless your access pattern is truly key value right like i mean a lot of times you do have that pattern if it's a session store something like that where it's a blob of information associated to a user when they log in i need to load the blob hey no problem right but that's not the way most apps work most apps actually need bits and pieces of data and this is kind of why it's useful to be able to structure data this way but most apps also have very common and well understood access patterns where we can structure data this way right so again catch those currents uh catch the data anomalies throw those things out and then it's basically just run some queries right i'm just going to go ahead and say okay we connect to the mysql database we're going to go ahead and run those queries that we just had those four queries i expect the mysql database to be extremely performant as a matter of fact some of these queries execute like i said slower than i would expect i think it's because they're not optimized like the order by probably needs an index the you know some of these things need to be optimized to get the performance the max performance but the but the mysql database is running local on a pretty monster box right this thing's a is like a 8 or 16 xl or something it's got 36 cores it's it's a monster so um you know i would expect it to be really performant and there's no network hop right so when we look at the numbers you're going to see some numbers actually quite impressive when you actually see the numbers and and the uh so we're going to run against mysql we're going to time those queries we're going to run against dynamodb and we're going to time those queries and and show you the results uh and they're basically the exact same access patterns right i mean we're running these four queries against the same data which i just loaded so let's go ahead and exercise the system and let me see what i want to do here go to my dynamodb table make sure that's set up all right so i got dynamodb table up here i built earlier because i always like to spin these things up to large capacity numbers and then immediately select on demand that kind of pre-warms the table so to speak so these guys were created both my my table and my index were created at 150 w 1000 wcu's i'm not doing a lot of reading off these tables so i normally had like a thousand rcu's but they they have like 150 160 partitions each which gives me the ability to kind of throw a lot of data into these things especially since i'm using that right sharding pattern of sending data across the gsi and there should be no items on these tables right okay so let's go ahead and log into my system where is there we it all right so i uploaded the jar earlier we've got the code running up there i'm just basically going to execute this jar it should go ahead and load the tables is there anything else i need to do i always do this and then it throws an exception i don't think so i think we're good okay so that's grabbing the employee table it's it's updating the other ones you see only nine items in departments that's where a real problem comes from so these other tables have hundreds of thousands of items salaries is the big one it's got three million items on there and you can see the performance is pretty good right you know 2.7 seconds 300 000 items you know i'm hitting that 150 100 to 150 000 wc range salaries table is going to take about 19 or 20 seconds he's got the couple million items there you go 19.5 seconds all right so we went ahead and imported i don't even know whatever it is 4 million or so items 3.9 million items into the table we can go take a look at those items real quick see what those look like uh so the one query i didn't get a chance to do was the one that where dynamodb would really shine and you but i'll run it from the console here this is the the idea of i need the employees history right so let's just go ahead and get everything for employee 88 958 and see what happens okay so i've got this guy's entire history here with one query one round trip i didn't have to touch multiple tables i've got his salary history for the entire time he's worked at the company i've got every every you know title that he ever had i've got every department that he ever worked in i've got his you know uh his entire picture is you know this is the single view of the customer so to speak right this is the uh and i just don't know i mean i'm not a sql guru but for me joining across all these tables where i'm trying to group on you know put items together on it just seems like with the differing dates of all these items there's just no way to do this gracefully i tried i'm not an sql master but there's no way to do this without multiple queries in a relational database from my perspective here and maybe somebody who knows more about you know sql can talk to me about where i'm wrong but i think that this is an access pattern that's kind of unique to the way we're indexing objects since they're all indexed in the same table inside of a nosql database then it's very easy to sort these things on that date dimension and get that single view of the customer that shows me everything that happened to him through his entire history of the organization right that's that's a very powerful view and that was actually a query pattern i didn't get a chance to execute so if we look at the indexes and how those indexes are laid out right basically it's uh we're looking at let's go ahead and oops i just want to do an index scan so the index is sorted on you know pk one these guys give me you know basically again i've sorted all of the current employees across 25 partitions because we have thousands and thousands of employees across these nine thousand or nine departments right there's there's very few departments so i had to and i'm bulk loading 300 000 employees so i had to you know do some right charting across these things to make sure that we didn't overheat and then they're also you know uh partitioned on the the title because there's another workflow saying give me all the you know senior engineers in the given department right so let's see what that looks like when we actually run those query patterns now am i not displaying oops i am not displaying my screen i'm sorry i got better at that and then i didn't so the the the the more the pattern i was talking about which is the more interesting pattern is when we look at the table and we uh let's see i want to run a query and i want to get 88 958 i start my search and so now this is what i was talking about earlier that you didn't get to see which i apologize for is that we have a with one query i was able to reach across multiple tables i was able able to grab all of the information from salaries all the information from titles all the information from the department uh information for a given employee and get his entire you know again life cycle of the customer i can't do that so easily in the relational okay i just got a reconnection successful message so hopefully it just came back online might have lost you guys for a minute there did you guys can i get a verification from my moderators said reconnection successful are we still streaming good there we go awesome uh sorry i lost you guys for a second we still got my console up we just ran those queries on if you look at the output from that uh the mysql queries you know we'll go through those first the first was to get the employee's current information across all of his uh you know title data what not um and that took 208 milliseconds retrieved 61 386 items this was oh this was employees current department all the current employees were given department this query again i would expect this to be a little faster in mysql i think it's probably because the index is missing on the order by and that's probably why this is slow again these other queries very fast as i would expect uh low single digit milliseconds um last query this is where we're starting to get into you know some some complexity here we're joining across four or five tables and and these are indexed as far as i can tell so that that's definitely starting to show you what happens when you start to increase the number of tables in your joins uh and we're joining into a very large table as well as we execute in dynamodb same kind of thing so numbers aren't necessarily lining up again we have that dirty data problem i went ahead and created more records because i bumped those guys into their you know uh current roles and whatnot so um we got 62 000 items in 263 milliseconds compared 208 milliseconds again i think i can optimize this query somewhat but you know i don't know if it's going to get down into single digit milliseconds but anyways fairly consistent performance there dynamodb which is interesting i got a 10 millisecond sleep loop if you look at my code so uh you know this query that you know executed in sub 10 milliseconds this guy is probably just over 10 milliseconds uh and yeah we've got the uh last query here which was 121 milliseconds to retrieve 10 000 items right so pretty consistent performance compared to what we'd see even with the nosql database running locally on your database right we can push data into dynamodb and we can read quick and that's the key you know kind of take away here so so i guess you know if you want to look at you know what you want to start to think about when you look at migrating relational databases uh you know it's really about your access patterns it's about understanding you know how you're using your data and understanding that the most efficient way to be able to structure your data is going to mirror how you actually use the the data in your application and and if anything that was the biggest lesson we took away from the exercise of trying to move all of those relational services to nosql at amazon the only way we made it work was to take this pad take this take this design approach seriously because if you don't you're just going to end up burning dollars and you may think that your performance is fine and maybe it is but you're still spending more than you need to spend to support the workload that you have deployed right and that's the key message that we need people to take away so i would say um you know that's probably the the biggest lesson learned from all of this is that you know there's no getting away from relational data it doesn't matter you know what type of application you're running you need it so learn how to model it correctly in nosql databases that's what i've got for you today thank you so much for your time and uh you know if you have any more questions i guess we'll take a few minutes here since we lost some time if anyone's got any questions uh throw them out there and the moderators will let me know i've been trying to look uh over through the call so i'll wait a minute here just to see if we get anything all right well it looks like you guys are happy hopefully you learned a lot um it was fun doing it and like i said next week i'm going to focus i think i could go deep on sharding partitioning how to create a good even spread of partitions across a key space uh gsi replication lag how to avoid it uh what kind of things you're going to be able to do for uh you know to be able to optimize your workloads at scale and yeah feel free to to ask for some submissions for you know future subjects tweet it out at me uh let let us know you can tweet the dynamodb uh at the dynamodb twitter account as well we want to bring content that you guys want and so that's the important thing for us is that you're getting stuff out of this hopefully this was good for you uh let me know in the if you if you want this to continue and what you guys want to see all right thanks a lot i appreciate it and i'll talk to you later [Music]
Info
Channel: Serverless Land
Views: 2,040
Rating: undefined out of 5
Keywords: serverless, dynamodb, aws, nosql, database
Id: EvB7REsf0ic
Channel Id: undefined
Length: 62min 56sec (3776 seconds)
Published: Tue Sep 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.