O'Reilly Webcast: MongoDB Schema Design: How to Think Non-Relational

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi and thanks and thank you all for attending and today we're going to be talking about schema design for MongoDB and sort of more in general how to think about modeling data for non-relational data stores like before we begin I'd like to push a poll to the audience to get an idea of everybody's experience level with MongoDB so I am going to push the poll and please feel free to check off the box that indicates you know how much experience you have with this hopefully will help me skip over introductory material if you guys are all really advanced or focus on more of the basics if most of you are not super familiar with so thanks for joining again my name is Jared and I work at 10gen at engine is the initiator of the MongoDB project and I spend most of my time working with the developer community and some of our customers and partners and helping them adopt and use it to the best of its ability and to have successful deployments so you know we find that when we're modeling data in things work pretty differently than a relational data store and so it's important to really understand the core data model of and in general to look at you know how you can really take best advantage of no sequel data stores so once again if you want to follow me on Twitter I am for Jared I have lots of entertaining tweets to send to you so please follow me so what are we going to cover today we're going to cover basically a bit of introduction of looking at how we use documents in the database and how this differs from using a relational database we're going to talk about how you evolve your schema in a MongoDB data store obviously MongoDB is a completely dynamic schema which means that you don't need to define what columns exist a database before you start coding you can just add them on demand as your application calls for it we're going to go through queries and indexes and talk about some of the more advanced features of how you can combine multiple nested attributes inside of documents and some of the cool capabilities that come from that towards the end of the presentation we're also going to go through some common patterns so we find that you know common object-oriented design patterns and data modeling problems tend to have slightly different solutions in MongoDB so we'll go through some of those solutions and look at how people are using to to do this in their their applications so let's look at ways to model data right the data comes in all kinds of different forms and you know typically when we think about modeling data we're coming from one of several different backgrounds we're either coming from a relational background where we think about data as rows and columns and relationships and key constraints although a lot of us also come from more of an object-oriented background where we think about classes and attributes and and aggregations and associations and you know sort of lis if you talk to a data architect typically the first thing we're going to talk about is normalization right you know in a database everything is very flat you know you tend in a relational database to model things sort of like you would lay them out in a spreadsheet and so in order to get the most mileage out of a relational database we want to normalize our data to separate all of the different attributes into different tables and that enables us to query effectively on these different things however what we found over the past few years that as sites like Twitter and Facebook go to actually scale their data model people tend to denormalize their models the reason for this is that it makes it very difficult to scale your application when you need to use lots of joins and transactions in your database anybody who's had to try to scale a database beyond the hundreds of millions or billions of rows probably has some experience with the complexity of managing joins and transactions scale so when we denormalize our data model what happens basically instead of thinking about all the different rows and columns we tend to think more on an entity level so for something like modeling a blog engine rather than separating out all the different things into different tables we can aggregate like information together into documents and we call this a denormalized data model and this is basically how we think about modeling data in MongoDB you tend to think about the high level entities the first-class citizens in your application and how they're going to be represented and stored in the database so first of all if you're coming from a relational background there's some terminology to get straight so in our DBMS you have a notion of a table MongoDB has a notion of collections which is very similar to a table in an our DBMS a table is a set of rows well in MongoDB a collection is a set of JSON documents now just like a relational database MongoDB supports indexes so we have the same idea of indexes and as you do in your our DBMS and in our DBMS where you might use a join to collect information from multiple tables in MongoDB we tend to address this by using embedding and linking of information so rather than joining data together across multiple tables we can actually stick that data right inside the parent object that it belongs to and that's going to be a lot of the focus of the rest of this talk of how that actually works and what some of the limitations and some of the optimizations you can make to make your application highly scalable so when we think about modeling data in MongoDB we tend to want to think about a few things we want to think about how we're going to manipulate the data so what kind of queries are we going to be running over it what kind of indexes do we need on the data what kind of updates do we need to be able to perform to two objects and also includes capabilities for doing things like MapReduce which you might be familiar with if you've ever played around with Hadoop MongoDB makes it really easy to use MapReduce because you can send JavaScript MapReduce functions right to the database you don't need to maintain a separate Hadoop cluster in order to do that in addition to the way we're going to manipulate the data we want to think about the access patterns of our data right so so for example for a given object what's the readwrite ratio this an object that we're going to be updating a lot of a lot of the time or is it something that's sort of written once and mostly read after that and we want to think about the life cycle of our objects and and how they want to live obviously this is not terribly different than our DBMS but some of the things that tend to be a little different in MongoDB is that my going to be does not support joins now the reason we don't do joins is joins are one of the biggest things that makes it difficult to scale a database horizontally we find in a relational database if we want to make our database work on multiple computers rather than buying a bigger computer joins tend to slow things down the reason for this is that in a horizontally scale database when I perform a join I typically need to talk to an arbitrary set of database nodes rather than being able to process a query on a single node so by eliminating joins MongoDB makes it way easier to scale horizontally now we sort of make up for the lack of joins with a richer data model and that's again a lot of what we're going to be talking about today so let's take the example of modeling say a bookstore right so I've got a bunch of books and I want to store these inside my MongoDB well so first off before I go just let me point out that MongoDB is a JSON database essentially uses JavaScript as it's sort of native language so when you log into the database shell analogous to say the sequel shell you might drop into in my sequel everything is in JavaScript so all the examples on these slides are actually JavaScript code you can cut and paste these examples into your MongoDB shell and they should work you know just as you see them here so here we've got I'm creating a book object that has a bunch of attribute value pairs so we have an author a date text tags now comparing this to a relational database you know most of this looks pretty similar the attributes here are similar to a column in a relational data store the only difference is that instead of declaring the columns on the table I'm just comparing these directly inside of the document now I can say DVD books save and this book object that I've just created now what's notable here is what I didn't do I did not tell the database that I have a table called books simply mentioning it from my application causes the database actually create a new collection called books and save this new object inside of it I also like I said did not tell the database what fields are going to exist in this collection instead every book can have its own set of fields the other thing you'll notice that's pretty different here from what you can do in a relational database is I have the tags field which is an array of strings here I see it's a comic and it's adventure so once I've inserted that document into the datastore now I can actually do queries on it so I can say DB book stop finds and get my book back now this is a lot like saying select star from books in a relational database now this eases the object looks a little bit different than it was when I inserted it first of all we have the underscore ID field this is essentially the primary key of the document in the database you can specify value for underscore ID whatever kind of value you want it might be the title of the book or its ISBN number for example but typically what people do is you just leave this attribute out and let the database allocate one for you we have a thing called an object ID which is a globally unique hash key identifier and it's designed to work really well inside the database you'll also see that the date field is expanded in the previous slide I just said new date and here we see it's actually expanded to the actual date that is actually a little bit of a misnomer MongoDB internally actually stores dates as integer fields and so they're they're much more compact and it looks right here but in the shell when I go look at a date it's going to expand that date and make it more human readable for me now once I've inserted some documents into the collection of MongoDB supports indexes just like a relational database so for example if I want to create an index on the author field I can say DB books ensure index author colon 1 this creates a b-tree index on the author attribute of the documents in my books collection now again this is a lot like a relational database where you know if you don't have any indexes and you perform a query you're going to do a table scan where the database goes and looks through every object and compares it with your query but if I do have an index that supports the query it's going to perform much much more quickly especially in large datasets so here I say DB books finds author colon hair gay and forgive my poor French pronunciation but and now I find my book back so this is a lot the parameters of the fine clause are a lot like the where clause in a sequel query I can specify attributes of my documents that I want to match against and the database is going to go find the documents that match that query now MongoDB has it internally a query planner and a query optimizer that figures out the best way to plan and execute a query so I can access this from the shell using the explain command again this is very similar to an AR DBMS where I might explain a sequel query to find out how the database is processing it and so here from the shell if I say DB books find author colon here J dot explain I see the explain plan for this query this is going to tell me what kind of cursor it's using here I can see it's a b-tree cursor which means it's using an index I can see n scans this is telling me how many how many comparisons the database needed to do in order to find a match for my query I can see the number of milliseconds that it took to run the query and some other more events stuff like index bounds which we'd get into in the duty detailed indexing talk if you guys wanted to attend that webinar whatever that happens next now up until now the indexes perform pretty much like you'd expect from a relational database now we're going to do a neat trick right if you remember earlier our books object had a tags value which was an array right and now I'm defining an index on that tags array so I'm saying DVD books ton sure index tags : one now what's interesting here is in a relational database if I wanted to store a set of tags on an object typically what I would do is I would set up a second table say a tags table and I would have a foreign key that refers to an element in my books collection and if I have five tags I might have five rows in that table that refer back to the book but in MongoDB I can store arrays natively inside of the parent book document there's no need to create a second collection and indexes are smart and they work with these data types so when I define a what we call a multi key index this is an index on an array field it actually takes all of the individual values inside the array and puts them inside the index so this means I can now do queries where I say you know DVD book stop find tags : comic and this is going to now go and find all of the books in my book collection that have a comic tag inside that tags array now you'll note I'm not matching the entire value of the right I'm just matching an individual element inside of it and so this is really neat you doing things like tags or keyword searches I can do natively inside the document I didn't need to do any joins I didn't need to set up a second table I can store all that information right inside the document now mama needs query language you know up till now we just are doing simple cores we're matching on a single attribute but we actually have a full query language you can combine together multiple fields so I can match for example on author and publication date I can do range comparisons so find me all the books published between the state and that date I can do logical queries like in and not in to do set comparisons I can check the type of things so find me all the books where the title is a string you know that might be a useful query for you we also have a set of update operations that allow you to perform atomic updates to a document we're going to go into an example of exactly how you do this in a moment so now we have our book inside the books collection we want to add comments right so you notice that there was no comments field inside the original book document but what I'm doing here is a snippet of JavaScript where I create a new comment so it's a comment from Kyle it's kind of date when the comment was written and he says it's a great book and he's got five votes for this comment and so now I can use the update command to actually push this comment on to a book so I say DB books update and now update command has two clauses inside of it the first is a query clause so I'm saying text : destination moon so this act like the where clause of your query so this tells us which book it is we want to update the second part of the update clause is the actual update command and I'm using two updates inside this single transaction here so I'm saying a push command what this does is this pushes a value on to an array so when I say push : comments : new comment what this is telling the database to do is on this document with the title destination moon I want there to be a comments attribute that is an array and I want you to push this new comment on to that array now if the array doesn't exist MongoDB is pretty smart it'll add the array I didn't need to tell database that there was a new field it simply adds it into the document I'm also using the Inc command so Inc is increment and we're saying I want to increment the comments count field by one so once again the comments count did not exist in the original document so this command is actually going to create that attribute initialize it to zero and then increment it by one so now when I go back and look at my document I'll see that I've extended it so we have the original document modeling my book at the top of the author date text and bags but now you'll see we actually have a comments field which is an array that has my comment as a nested document stored inside of it and I have my comments count fields which counts the number of comments that I have inside my document now it's really interesting here is when you think about it from a data modeling perspective in a relational database I would have a number of tables at this point right I would have a table modelling my books have a separate table for tags I would have a separate table for comments I probably would have another table for authors and and if I wanted to do any operations on this data I really need to do joins and multi statement transactions in order to play with this model but in MongoDB I can sort all of this inside of a single document because after all most of the time in my application I'm probably dealing with all of this data at the same time and so by doing that I get a lot of heavy optimization out of the performance of the database because since all this data is stored together it's very fast to retrieve I don't need to join data together that's stored at different locations on disk and it you know it's very easy to deal with because this is sort of the unit of information that my application is typically dealing with now once you start adding things like nested documents and arrays there's the syntax starts to get a little interesting so we have this thing we call the dot operator and this allows me to reach into documents and address fields that are stored internally inside of documents so earlier we were doing a query where we say find me all the books by this author but what if I want to find all of the books that Kyle has commented on well here I'm going to create an index on comments author and so what this does is that the dot notation allows me to reach into objects so comments Rimmer was an array field that stored objects internally and each of those objects had an author field that was the author of that comment so by ensuring an index on comments dot author what this is doing is this is going and telling the database to go through all the book documents and for every book document look at all of the comments and I want you to create an index on the author field of all the comments so now that lets me do queries like DB book stop find comment author Kyle what I'm going to get back from this query is all of the books that Kyle has added a comments on you know you can also do things like a ensuring index on the number of votes right so this is now an index on the votes field and this lets me do a query like you know find me all the books that have greater than fifty votes on one of their comments so this is a really powerful technique right lets me do a lot of data let me store a lot of structured information it's all grouped together sort of by a high-level entity that I'm actually thinking about in my code and I didn't need to go through a process of normalizing this data I can store this all in a single document and typically like I said if I'm building you know a new competitor at amazon.com and I'm selling books you know book is the natural unit that my application is working with and so it makes sense to have all this information grouped together now if you want your database to go fast you know these rich documents are the way to do it so as I mentioned there's a couple apps of things that make this really powerful first of all all this information is stored together on disk right so normally in a relational database if I had six different tables that I needed to join together all that information is spread out in different areas of memory different areas of disk and so in order to fetch that that a whole book document back in a relational database I'm asking my computer to do a lot of work I'm asking it to go to the books collection and get the information about the book then I'm asking it to go to the authors table and get information about the author then I'm asking it to go to the comments table and get all the comments for the book all this is these are slow operations in a computer accessing disk is slow so the more data locality I can exploit the faster my queries are going to run simply because the disk drives and my computer has to do less work the other thing that happens is that the query planning phase of the database is much much simpler so in a relational database when I have come drawings like weary planners have to figure out okay for this query that's matching ten different attributes in joining six different tables which table should I start looking at first and oftentimes it'll pick the wrong one there's a lot of science and a lot of PhDs that have been granted to people for figuring out great ways to optimize sequel queries but in MongoDB the query language is much much simpler you're always sending a query to a single collection and so that dramatically simplifies the query planning stage of the database and so there's a lot less overhead to actually processing a query before you actually start going and loading things off of disk we like rich documents they're fun to work with it's a lot easier to think about your book and all comments as a single object in many cases than it is to try to completely normalize your data model and sort of as a counterexample to this we have things like this alright so this is the entity relationship diagram for Magento which is an open source ecommerce platform and so Magento is a really powerful ecommerce platform but it's really complicated you can see here there's a there's a massive number of tables is stored inside of Magento and so what this means is that first of all if I'm trying to extend Magento to do something new it's often very very complicated because there's lots of moving parts inside the database similarly if I'm trying to fetch information or update things inside this database it's going to be very complicated because I need to update lots of different tables I need to do lots of complex joins and things are going to tend to slow down so by contrast in MongoDB documents can be very rich so for example here is a sample purchase order document now you know this may or may not be realistic for your application but what's interesting here is that I can store all the information for my purchase order in a single document here I've got an array of line items in my purchase order of the the SKUs of the various products that I'm going to buy I've got the address of where I want to have these items ship I've got the payment information with the credit card details an expiration date I've got the the subtotal of how much this whole shopping cart is going to cost and all of this is stored in a single document it's much easier to reason about this than a more complex relational model for many types of problem domains so now let's look at some of the common data patterns we see and how we actually go about modeling these in MongoDB so inheritance right I take it for granted we use it all the time and any object-oriented programming language and you typically are you know doing things like this I've got an abstract shape class and I say you know all shapes have an area but my system is going to store three different kinds of shapes circles squares and rectangles and each of those different shapes has different parameters that that specify it so a circle and he distorts radius square ice or the length of each edge and a rectangle a storage length and width right so in object-oriented language by modeling this in Java Ruby I might use a simple inheritance tree have a shape class and then extend that with various subclasses to model the various attributes but as we all know when you try to map this to a relational database it gets complicated you probably end up with something like this so a single table inheritance is commonly how you solve this in our DBMS so I have a table in my our DBMS called shapes and I have what we call a sparse table so I have all of the attributes for the base class and all of the subclasses modeled inside my single table inheritance model in the our DBMS and for each row depending on what actual subclass we have only a subset of those fields are filled in now this has a few limitations right obviously first of all if I want to add a new subclass I need to go and modify my schema and add a new column for that subclass of attributes it also means that my table is very sparse if I have hundreds of subclasses you know each individual row is likely to have a very small subset those columns with actual values this can make it very difficult to query this data and in many databases it can be very space inefficient to store this because I actually need to store space on disk even for these columns that I'm not using right so how do we go about modeling this in well in because every document can have its own attributes there's no need to do this sort of sparse table model inside of manga so when I say DB shapes defined here you'll see I've got a circle a square and a rectangle then each document depending on its type has different attributes right so MongoDB doesn't care what attributes you have inside the documents in a collection it can quarry them all just fine so here I've got three shapes in my database and so let's say I want to go say find me all the shapes where the radius is greater than zero so I can say DVD shapes defined radius greater than zero now what's interesting here is that you know many of these documents don't even have a radius field right so the way MongoDB handles this is essentially the query planner runs through the document it only considers documents that actually have a field called radius right so it's going to automatically exclude squares and rectangles because they don't have a radius attribute now I can also index on these fields so sparse indexes allow me to to index on attributes that only a subset of my documents have so it's aversive all just defining DB shapes I'm sure index radius : one if I exclude the sparse true attribute what this is going to do is this going to create an index on the radius attribute that's pretty obvious the default behavior of ensure index though is that if a document does not have a field it gets inserted into the index as if it had a null value so if I exclude sparse that means my index would have all three shapes in it but there would be one shape that had a radius of one and there'd be two shapes that have radius of null when I use a sparse field what this does that tells the database to exclude from the index any documents that don't have a value for that field so that's going to dramatically compress my index because it means that if only 10% of my documents have a radius attribute my index only needs to cover 10% of the actual documents in the collection so it could be a great space-saving it's also going to make things a bit faster because that index is going to be smaller another common data pattern that we tend the model is one-to-many relationships and you know there's not necessarily a hard and fast rule about exactly how to model these types of relationships you know one-to-many relationships might specify weak Association it might specify a composition or an aggregation and so depending on the semantics and ownership of these relationships there's different ways you might want to model it right so one way to model one-to-many relationships is to use an embedded array or array keys so for example if I have a user document and user has a number of blog posts I can store a blog post attribute that has an array of all the IDS of the blog post that that user is created now there's some helpful things that I can do with this so I can use the slice operator to retrieve a subset of that so for example slice let me say you know find me the first five blog posts basically the first five elements of the array however you know sometimes it can make quarries hard right if I have everything stored in arrays like this it's difficult for example to find the latest comments across all documents that I've not commented on so another another way to do this is with an embedded tree right so rather than having a the array with with references to other things I can store essentially nested documents that store attributes and and and documents hierarchically right much like we talked about having the comments embedded inside of the document that that that can be modeled this way so again you want to think about this in terms of you know trying to data model out with several different patterns and see what the quarry needs of your application are and how naturally that representation can be queried for for your use case so another way to do it is the sort of more traditional way of normalizing your data so you know for example going back to our comments example in the earlier example we had our blog post and then comments were stored as an array of documents inside of it now another way to do it is to actually have comments be its own collection right and then you know this is more like how you would model in our DBMS so each each comments might have a reference to the ID of the blogpost that it's from and you know that's going to be very flexible to Cori however you're going to lose some of the performance benefits of data locality right because now much like an our DBMS when you go to fetch a blog posts and all of its comments you need to go to multiple places on disk in order to fetch all that data so again the sort of a decision process you go through and you've got a lot more tools inside of for how to model these kinds of things you can store things embedded inside the document you can normalize things although keep in mind you're going to need to do that join at within your application the database is not going to do that for you so as we look at like the different ways of doing this you want to think about the ownership pattern right so so if a look if an author you know is a is an aggregation of a composition of a number of blog entries then you know I might look at storing those as an embedded array or using array keys if I have more weak associations I might use an embedded tree so instead of having like an array of all my blog posts I might have a document that were the keys on the blog post names and the values of the blog posts themselves or I might go and Furley fully normalize that model and store the things in different collections so again this is a kind of the array model of storing comments and this would be a more normalized model so here we have at the top we have our Book document and at the bottom we have a comment as its own document in a separate collection here we store a book ID which is basically a foreign key reference into the books collection this is going to give me the most freedom with respect to how I query my comments so it's a lot easier to do things like you know find me the last five comments that were posted on it on any blog or or any book but you know again this is going to be the least performance of the models because if you're asking the database to do a lot more work the data is spread out more on disk and keep in mind you're going to need to do joins on your own so for example if I want to find all the blog posts or all of the comments on destination Moon what I'm going to do is first go query the book document to get the to get the book and get it is underscore ID and then I'm going to use the underscore ID of the book to go find all the comments that reference that book ID so when you're when you're dealing with these hierarchical documents you kind of have two choices of how to represent them right so one is referencing the other is embedding so when I reference something it's much like the foreign key model from the previous slide I can within my documents or the object ID or some other key that allows me to perform a second query into another collection to get the thing I'm looking for embedding says I'm just going to put this whole document inside of the outer document right when I embed things I tend to get more performance when I reference things it tends to give me a little more flexibility with my queries it really depends on what you're looking for in your application and again the rule of thumb is generally you know it's so easy to try out different models in that you know it's very easy to rapid prototype your data model and try it out with different different approaches to see how it is how easy it is to query and how well it performs and how efficient your index is for your data model are so don't look at many to many relationships so in this example let's say I've got products and categories right so a product can have many categories and a category can be in many products so modeling this is a little trickier obviously you're going to not going to be able to necessarily embed all the information or to make this work so one of the ways we can do this is we can have a products table and so for in every product in our products table we can store an embedded array of category IDs so this this allows me to you know referencing or this product is in these categories now separately I can have a categories table and you know each document represents a category and every category can have a list of the product IDs that are in that category so what's notable here is that that you know in a relational database I would need a third table that's like my join table that that represents these relationships in I can do this with just the two collections that exist so I got products and categories and each one contains an array of references to the other now depending on the the update rate and the dynamism of this model this might be a pain right and you you know because it means that if I ever add a new category add a new product I potentially need to go update multiple documents in order to add that so depending on how static that this type of relationship is this may or may not be a good way to go right so incidentally on this model I can define indexes on the products and categories and it makes it really easy to go search through it in either direction so another alternative is to actually only store the relationships on one side so in this model we've got our products collection and it contains a list of categories that product is part of then we have a separate collection of categories that has the information about the category so in this model I only store the Association on one side and but it turns out I can actually still perform most of the queries that you would expect off of this model right so if I want to find all the products in a given category I can say DB dot products that find and specify the category IDs that I'm looking for that will give me all the products now if I want to say all the categories that are given product is in I can first go quarry the product say DB dot products I find and give it some ID and that gives me the actual product object which internally contains a list of categories and then I can say DB categories that find where the ID is in the set of category IDs from the product so now I'm leveraging the inquiry to say here here's a set of category IDs now go fetch me all these category documents alright so now I can do this query in both directions I can get all the products for a category and I can get all the categories for a given product I only needed to store the Association once I also only need a single index here on the category IDs nested attribute in order to make this query fast so another common problem is modeling trees this is a tends to be a tricky thing in any database you know one of the nice things about MongoDB is that you know because you can soar hierarchical documents inside of a single document it tends to be pretty easy to represent an entire tree inside of a document so for example imagine I've got you know not just comments but threaded comments so I can respond to an individual comment well I can store that as a as a big tree where I've got you know the entire document is the whole set of comment threads on a particular book at the top level I've got all the top level comments the book underneath that I've got all the responses to to each comment so in this document example you know got an array of comments and then within the comments I see an array of replies right and so now I can store this whole tree of comments inside a single document right so you know the pros here it's a single document it's going to be very high performance is relatively intuitive because it kind of looks as you would expect it to look when you're you know looking at the common threads and a webpage or inside your application but you know on the flip side it's a little hard to search it's difficult to get partial results right so if I want to go find all of the comments from Kyle I'm going to end up getting the entire comment thread for including all the comments where Kyle's was just one of them also in MongoDB you need to be aware of your document limit so each document has a maximum size of 16 megabytes so you want to make sure that you know if you're going to model data this way you don't want to run up against the upper limit on your maximum document size so another way of storing trees is to use parent and child links so with parent links each node is stored as a document you contain the ID of your parent with child links each node contains the ID of its children and actually with child links you can support more complex topologies like graphs where you have multiple parents for a single child so look at what that looks like so here we use a model we call the array of ancestors so each document here is a node in my in my tree and my ancestors is this is it basically it's an array and it's the path from myself back up to the root of the tree right parent refers to the node immediately above me in the tree right so here I've got ABCDE F I can see the ancestry the ancestor path for each node in the tree and the immediate parents of each node in the tree so if I look at like node F right its parents is e and E's parents is a hence the ancestor path is AE right and I can see e in this doc in this collection as well you know it's ancestors RA its parent as a now this model again you know if you're going to be changing this tree a lot this can be a little more complex to update because if I want to say reorganize the middle of my tree it's going to be hard to make those updates and maintain this however this model if it's relatively static I can actually query very efficiently and I can do a lot of very complex operations on the tree with relatively simple queries so for example if I want find all the descendants of B I can say DB tree fine ancestors : B right so here we're leveraging that multi key index where we can say you know find me any document that has B in anywhere in its ancestors list so even you know people that are 10 children down from B will be included in those results and if I want to find the direct descendants of B I can say DB tree defined parent : B and so obviously the parent is just the immediate parent and that lets me get directly to the direct children of B if I thought want to find all the ancestors of F I can say DB tree F find and go get document for F F includes its list of ancestors and then I can go fetch from that list all the individual documents for its parents so this is useful for storing things like you know taxonomy Zoar a hierarchical categorization say in a product catalog another common pattern is a is a queue right so so you might for example be building like a background job processor and you want to keep track of all the jobs that your system is processing and you want to make sure that each job is only run once so what I might do is store a collection and where each document represents a job now inside this document I've got a status field that in progress which says whether or not this this this job is actually running I can store a priority which says you know what's the priority this job relative to everything else and then I've got some you know message inside of there that reflects me and what this job actually is so now for inside my application I want to go find the highest priority job and market it in progress so this is less sort of the quarry that my worker tasks would run to you know allocate a job for itself and start working so here I'm going to use a new command that we haven't seen before the finded modify so find and modify what it does is it's a it's an atomic compare-and-swap operation so when you say DB jobs that find it modify I specify a query so I'm looking for jobs that are in progress I specify the sort so I'm sorting by priority because I want the highest priority a member that's that has in progress set to false and then I specify an update where I want to set in progress to true and I want to mark a timestamp of when I started so what this is going to do is this is going to atomically find the highest priority job that is not currently in progress market to in progress and give it a timestamp so this ensures that only a single clients database will get allocated this single job and I can you know manage my jobs fairly easily without having to introduce a separate message queue or other type of system to manage those jobs so that reaches the end of our presentation here in a moment I'll start answering some of the questions that have come in while we're going and I think you know obviously modeling data for a doctrine data store is pretty different than a relational database a lot of the things you know from relational database are going to carry over directly right so you know you can still do foreign key relationships inside a documents database the only issue is that your application is going to need to go fetch those foreign key values the database doesn't do joins for you but that doesn't mean you can't use them some of the main concepts are that you know use this hierarchical nature to your advantage right you're going to get huge performance benefits by data locality and storing information together so the database can access it more efficiently there's a lot of data models that actually are much more natural to model in this kind of document model things that have a more hierarchical nature where I can store it inside a document things that have you know arrays and lists of things that are more complicated to model in a relational database and because it's all completely dynamic schema I have a lot of freedom to extend the schema and have a high degree of variability in the actual fields stored inside the database so hopefully this is useful I went through kind of the the core concepts of modeling documents and how we queried it and give you a few examples of basic data models and how you can actually access those models with and now with that I'm going to jump over to the Q&A tab and see we've got here so one question uh if you want to constrain what values could be used in tags for books how could that be done well things like constraints basically you're going to be typically you're going to be enforcing those kinds of constraints at your application level the database doesn't have any notion of constraints or restrictions on documents there's no way I can tell a collection that you know filled X can only have values of you know one through five so if you want to actually enforce those kinds of constraints typically you're going to need to devise a way at your application level now to some database architects that'll sound like blasphemy to other software architects who say you know what you know in my rails application you know active record does a great job of enforcing validations on documents or with spring data in my domain model it's very easy to express these constraints there and I'd rather have those constraints live in the code that's sort of our philosophy that it's it you know these constraints typically live in your code also you're not always relying exclusively on the database to do this and so since you've got the constraint model there it's typically easier to leave it there another question can you explain something about scaling of MongoDB yeah so so MongoDB scales horizontally with a technique called sharding so with with charting basically what happens is you give it a collection of documents you choose some fields inside that collection as your shard key so for example let's say we are sharding our books collection I'm a shard it based on the title of the book now what MongoDB is going to do is if I have five database instances it's going to spread all the values of book title evenly across those five database as I start inserting books it's going to look at the title field of the book and automatically send that that request to the database that holds that range of the data so what this does is gives me a great degree of parallelism right because as I I can now add more servers to my database and that spreads my data out over more and more compute and storage capacity now MongoDB is totally dynamic so if I add a new server it's automatically going to spread the data out to use that new server I can also take servers out of commission it'll move those documents off of that server onto the remaining ones so what this means is that you know first of all I'm going to B's very high performance on a single node but once you exhaust the performance of a single node you can start adding more servers to your cluster and the database is going to scale pretty well just by adding more computers so here's a question a document has a limit of 16 megabytes the document corresponds to a row and not the entire table so a table can still have millions of documents with each document up to 16 megabytes yes you're correct so each document is limited to 16 megabytes but a collection can contain the you know collection basically has a 64-bit address space so you can have you know ever many 16 megabytes documents that you can fit in 64-bit address space which is a lot how do you govern the different fields and make sure you don't have slightly different names of what should be in the same field is there a schema analyzer so that's a great question I think again like with the earlier question about constraints typically you're enforcing this at your application level now you can run into situations where you push a version of code and I used to call it a you know user name and then somebody changed the code to call it user and so you know you might be forced to deal with situations like that typically we ask that you you know try to enforce these things at your application code and you know the way most people implement those domain models is relatively easy to enforce that now if you do get into that situation there is a real command allows you to rename a field so you can go find all the things where it said user name and rename that field to user next question MongoDB supports multi column indexes yes absolutely you can do compound indexes on multiple keys each individual key and a compound index can have a different sort order you can do geospatial indexes so notably Foursquare runs on MongoDB and all their check-ins are indexed via MongoDB geospatial index capability you've got sparse indexes covered indexes pretty much like most of the features you'd expect from a relational database on the indexing side you can do in MongoDB so next question is there a benefit of using DB ref instead of just specifying object IDs so the DB ref is is basically a convention for storing a reference into another collection and DB ref basically contains a collection name and an object ID and there's really nothing in the database that's optimized to deal with those it's really just a convention that's useful in many of the drivers for resolving references and knowing which collection to go to so might a really - you know there's not a huge benefit some of the drivers interpret them nicely but storing lists of object the deep IDs works just fine next question how does one refactor schemas say that I started out in bedding documents and articles but a million articles down the road I find out that I need more flexibility so that's the tricky question you know I think that that you know basically you're going to need to migrate the existing data into your new model now one of the ways that we found that this works most effectively is to essentially do a an incremental migration of data so give you sort of a case study of how somebody else did this and that worked really well I'm so so Shutterfly add about 20 terabytes of data in their content management system as metadata about photos now they needed to migrate that data from their Oracle database into so that's it effectively is a schema transition and then once they have changed it to they went through several evolutions of their schema now what they did is they implemented an incremental transition strategy so the idea was when they did the initial transition from Oracle to they actually added code in their model layer that treated is basically a read through cache so what would happen is when a quarry came for an object I would first go to and try to get the object if it's not there a good Oracle get the object then store it in then return it so what that does is as my application is running it's incrementally converting the data from the Oracle schema to the schema the same model can work for data entirely within so I can you know fetch a document I can have code in my model layer the checks whether this document is in the old format of the new format and at that time at that point of access I can transition it to the to the new schema right so this is a really good way if I've got you know many many terabytes of data it's often going to be infeasible in any kind of data management system to sort of batch convert all that data without having any downtime so using that kind of incremental approaches we find the best way to go next question how would not going to you perform if your book had a million comments so that's a great question that's one of the cases where you know it's a you're trying to decide between using a model as embedding the comments inside the book versus having the comments and external collection right so if you're if you expect to have a ton of data you know you obviously want to be aware of the sixteen megabyte limit and store your data in a way that you're not going to run up against that so the expect to have lots and lots of comments you're probably going to want to have some strategy for separating those comments out into separate documents now it's not just a binary decision you don't just you know either sort all the comments in the book or sort each comment individually in a document there's actually a great presentation from our CTO Eliot about a hybrid approach where I store a comment collection documents each of which stores say a hundred comments and so this is sort of a hybrid approach so I'd have my book collection then I have this collection of clumps of comments and so instead of going to get one document for each comment I get batches of comments a hundred at a time if you go to 10gen comm slash presentations and look for the schema design at scale presentation you will see that model described in detail questions does updating a single document lock entire collection or DB so MongoDB locking strategy is relies on a readwrite lock on the deep process and so when you perform an update it actually locks the entire process now that kind of sounds scary however in practice it tends to work pretty well most of the time basically what happens is is you know the actual update operation tends to be very very fast because that update operation is happening in memory the case where that update operation is going to be slow and where that block that lock is going to bite you is the case where you know the object you're trying to touch is not actually in memory and we need to go to disk so older versions of had an issue where if I try to update a document that's not currently in memory it would take that right lock go to disk load the document into memory and then perform the update now that disk access is obviously much much slower than a memory access so that would hold that lock for a long period of time since since then the last couple versions of we've introduced a number of features like a yielding and the ability to predict page faults that basically eliminate that that lock in in most of the cases where it would bite you next question how do you find documents that don't have a radius field so there is an exist Corie operator and a not quarry operator so you can you dbgap finds where not exists radius and then you would find all the documents that do not have a radius field so I noticed we're coming up on the end of the hour there's still a couple questions left I will do my best to collect answers to these questions and try to email them out to the people who asked them but at this point I'd like to thank you all for attending this has been a great webinar with some really great questions hopefully is useful to you guys and I would encourage you to come check out MongoDB org where you can get the latest version of the software and access all the documentation and also check out 10gen comm 10gen provides commercial services around MongoDB there's a ton of presentations and talks up at 10gen comm / presentations and please do attend our next conference when it's coming to your area check out 10gen comm / events for more information thank you very much
Info
Channel: O'Reilly
Views: 96,160
Rating: 4.9541545 out of 5
Keywords: Yasmina Greco, Mongo DB, Jared Rosoff, O'Reilly Webcast, Strata, Strata Conf, MongoDB, 10gen, Jared Rosoff:, Schema, Design, O'Reilly, Tim O'Reilly, O'Reilly Media, OReilly, Tim OReilly, OReilly Media, OReillyMedia, o'reilly, books
Id: PIWVFUtBV1Q
Channel Id: undefined
Length: 57min 45sec (3465 seconds)
Published: Sun Mar 11 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.