33rd Degree 2014 - MongoDB Schema Design - Tugdual Grall

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so let's start a good afternoon everybody and let's talk about document modeling for Mongolia in fact it's a general-purpose discussion around documents and a comparison which is what you can do against or when you compare with relational database tables and record a quick introduction I am tagged working as a Technical Evangelist at MongoDB a few experiences using a no sequel and document databases but obviously I spent nine years at Oracle back early 2000 so my DNA initially was relational database secure statement and I move to document database it's quite exciting so I put today small agenda about why document working with the documents the impact on the application developer and what is interesting about for example when you have to update your database when you have to change your schema inside your application and some quick discussion of run queries around common patterns there you can see inside your application so it's just an introduction but it will give you an idea of the differences between a relational database on documents on where on which kind of application are feature you may look like a look at when you want to build your application so I'd like to start with why are we talking about documents database like MongoDB one small interesting point is relational database has been build in the 70 like me I guess 40 year old and they have been building ecially with two things in mind one it's kind of work present work so where they manipulate data simple spreadsheet but also to optimize the way you store data on the disk because back at this time hardware was very very very expensive compared with what we do today so one of the key of relational database was to optimize using tables on columns create tables with hoes in which you will put some data and when you have to add more data in relation to that for example you want to have this customer or persons in relation to their bank account you could not put them in the same way code because you a user are going to have multiple accounts so you have to build and you know that better than me you have to build another table on the join between this table a relation between this table and it's quite nice for this specific use case because we are user or customer and we have accounts on bank accounts very simple use case but in reality in real life when you build your application when you manage your data it's often a lot more complex than that let's take an example if we look at a product catalog in this case I'm using a very generic product catalog when you have many things going from sports electronics with hosts of office supply really many many things and you will see that inside the product catalog it could be very our to design something for your database we will take some sports stuff it sounds good a baseball bat labels but as some information about the length formation by the name the geometers the type of metals or would you use so many many things that you can represent what do you do you create a table for this specific product when you design your application you want to be sure that the name of the column or the name of the property you manipulate are something you can understand inside your product catalog so in this case we see the category but the model the name as a brand some information about the how it looks how it has been designed as a type composite or wood for example it's great so here you have a very nice for the cattle of a bat do whatever you want compare price query and so on but you want to add different products inside your database for example you want to add gloves so baseball glows so first you still have the size it's great but you have the type of gloves infield outfield picture the type of how ya doin build one piece to piece and many stuff again but the brand how we had been build so you can find some data that are the same but let's just create another table for gloves just to show that we have some differences so here you have the bat on the gloves what are the differences you see that we have all the technical description of each product that are quite different great so far I have two tables and you can imagine that you add more on more tables same for the baseball itself so you add a third table so you cannot manage as this way so what do you do this is where it's become complex to manage without you is your relational database when apart that we see quite often is you use the sparse table so you add more columns depending of what you need inside your application but even if you do that you still have to understand what kind of is a limit of your product what is a limit of your application on your database on the way you want to manage data so doesn't really scale more pollak you have more columns you will add most of the columns will be empty at the end because you have so many different products that will make it very very very complex to manage so you have another approach that I will say it's smarter you use a key value store tables that will be a whole in relation with your main database so in this case all the command informations are located in one single database a product database where you have the product ID is a table the product ID is a category model name brand country price on on then you have another table when you have simple simply the property on the value and the link with the initial product table for example you will see that here we have the bat in green the list of the attributes the leaf of the property for the specific but in another table but you see that here if you want to get specific property for a specific product you have to do a join you have to do a complex query that you have to create but also it will be our to associate specific list our complete complete product of a product that will be a set of sub of sub product for example here I want to buy a full equipment for as for the shield for the user player so you have to be able to create a product that is to join or a union of many many stuff so one product in this case is equal it's equal to a set of product a hard to manage you need to have enough table at following example you have because you are doing an e-commerce platform you have more and more product with more on more different types so I want to have a well on T itself so you will add what another tables another set of properties so it makes very very complex and this is why in a real life you often see this kind of model or schema inside your database and it works I won't say that it doesn't work is just complex this is our to manage this is our it's long time to develop to make a vault to have it's our to change and this is one of the key part here it's very hard to change and when you want to query specific attributes you don't know if it's part of one table or if it's part of a property tables inspect something specific on the other side of the schema it's quite complex to build an application of it so queries are very complex on based on all these statements you should have another way of doing the same thing and also what happened every time you change something in your application what happened every time you build something inside your application it's kind of a nightmare because we know how you add product but suppose you want to add a different way of managing the product price I'll add specific combo when you can add this product on this product it's quite complex and this is one of the reason working with relational database in real life often it's very hard to have a full agile development adding columns adding tables changing the type or mixing the type of specific attribute so this is one of the reason documents has been built to make the life easier if you know the story of MongoDB or if you don't know the story of MongoDB it has been built by two guys in New York that as used to work in many startups and one of the stuff they are is have a half time to work with the database so they wanted to build a database that could not only scale depending of the number of users or the volume of data but also that it should be very easy to develop with on part of it should be easy to develop in an iterative way adding new features should be easy modifying the schema should be so basically if we look to the relational database and we compile we compare sorry high as a high level with what you will do in the relational model compared to the document you can represent a join inside one document itself in this case for example I have performs a list of persons on the list of car and I have a join between them one person can have multiple cars basic relational model in MongoDB using documents what you will do you will create one single document that contains a list of cars so this is another simplistic approach we will see in later on that you have many different things you can do many options you can use but basically by doing that you simplify a lot the way you will manage stuff because first in one single call you access to all the information without doing multiple query on multiple hardware's on multiple space on disk one single document is located in the same memory space in on the same server on the same disk when you do that with tables on week out on join you don't know exactly how we are where we'll handle that but let's take an example that go back to our example with the spot good so here again we will manage the gloves the bad and so on so the same property you will manage in one single document so your field on one important part is you sometime you hear that no sequel database or MongoDB or other engines are schema-less they are not clearly schema-less they are schema-less in the sense that the database itself doesn't does not validate a schema but your application manipulate schema your data that you are storing inside the database are the schema typically we have a list of field this is your schema your application schema but it's managed by your application code so we see the list of field and the list of values on one important part not only a value and field but they also have specific types like string numbers on date so this means that everything you manipulate inside your application will be store using a specific attribute will be store with specific types and specific value so the same way you manipulate your relational database your tables on record you are building document obviously when from when I says the same way it's conceptually from an API from a java programming language you don't need to use the same API but you will use type on street an object in Java that you will be able to store as JSON document inside the database one interesting part is because we manipulate document you have more options ezel in the way you organize attributes on values inside the document one example you can use hours you can in one single document put a list of value inside one attribute so in this case the baseball glove are specific is has been designed for specific position on the field infield outfield picture and you see we put that in the simple list keep in mind that everything will be either you will be able to query on all of this field including a value of an hour give me all the gloves that are for outfield you have the categories that will be glove and you will query on the position that will be outfield for example so you can do the where Clause in the same concept that you do in a relational database not only you can have a simple list of value but it's key that you can embed it inside one single document more complex object so here for example using the same example with using the glove we add new attribute as part of these attributes we use for example who is a professional player that has understa specific brand or the specific gloves and we have one attribute and inside these attributes you have a sub document allowing you to made more complex objects in one single document so what is interesting also to see here is I have complex attributes complex value and I can change the schema on-the-fly if we go back you can start with a very simple document like that depending of your requirements depending of your applications just add a new attribute that is a list you add a new attribute that is a list of documents and because they are which twixel's that will really represent so when you manipulate your objects in your application in Java most of the time what we will see is the Java objects you manipulate in the cache of your application or behind the UI behind your forms or your report will be the document you can store initially inside your database it makes the development a lot easier I need to remove the complexity of having many join to build behind the scene sometimes you don't see it most of the time we don't see it because we do hibernate or JPA or equivalent an object relational mapping tool here it's a more direct mapping with one document on a complex object submitted will present your business so as I said one single product a one single type of document can manage many many different structure we see the bat the glove the category store in one in one single document each time but you can query among the different attributes if you look for that you will get the bat and then all the list of properties that are specific to this object same if you do the baseball same you should use a glove obviously your application has to be able to deal with the different attributes but the database natively use these attributes you don't have complex manipulation to do to kind of build a key value store inside our property value tables inside your schema to be able to add or remove or change the schema or the properties of a specific product so it met really so document flexibility really make the life a lot easier for the developer and what you have to do is just think about how you build your forms or your HTML pages inside your documents you just add pitch feature you just add attributes as much as you need you send that to the server it's when it's become complex but now we simplify a lot because you start a document or more almost as it is inside your application so let's talk about the document design now that now that we have seen why documents are interesting or in which case are interesting one example being when you have polymorphism of the data it's quite importance quite interesting so documents probably flexi being perform flexibility on performance I talked about the flexibility you have seen it in the previous slide we talked a lot also bad performances of performance of your application one of the reason you over you avoid join and join by definition are not bad but keep in mind that many many deployment of MongoDB has been built on many servers so you have documents that are distributed on many physical machines so if you support join you don't know where is the documents that you have you will have to be in relation with it's typically what happened to relational database it's when you start to have distributed database or partitioned database it makes the joint very complex all you have to really have a very complex business logic of technical logic inside your server inside your schema to be sure that all the data you will query with the join are on the same server to avoid the complexity of going to multiple nodes to get the record so we avoid that by just saying onward as much as you can as much as you can inside one single document like that when you want to query you will get everything in one single path and we have also many options or opiate us to modify on add information inside the document as I said before relational database is relational schema has been designed first for the storage but provide also some interesting path queries and unjoin are still interesting just a different way of working one I will say that the hardest part when you start to work with MongoDB its the technical part so how does part is a document design because we don't have a magical analysis tool or saying that you want to use a specific normal form things cert from normal sir normal form of the schema that will represent exactly that you don't duplicate data you don't us-specific you don't repeat data in the same columns on this kind of stuff with document the only thing that matters is how do you use your documents what do you do with your data it's really what it's important when you build your application so think about how do you manipulate data will be do I want to do dynamic queries with many field do I want atomic updates do I want to do aggregation on complex egg regression and also in addition to what is the volume what is a schema itself how do you read on white the data how much white do you do compared to the weight the type of queries you do the type of updates the lifecycle how big your data will be in a month in six months in a year in ten years because it will have an impact not only on the database itself from the server's itself but also on the document and we have to keep in mind that today talking about written white white your SEO we see more and more applications dealing with a large volume of data for example for application logs when they have terabyte of data that are ingested inside the database every day so you want to be able to be sure that you can query but in the same times it's not only about storing the data it's also about preparing the work to be sure that you can query them in a smart way make sense of your beta so it's quite interesting as a as an experience to understand what is your data set how do you work with your data and which type of query you do an important part and this is key when you compare that with the relational database changing kima is a lot easier if you make a mistake today you won't necessarily pay the price very heavily in few months or a few years as soon as your data still inside the database so let's take some example of documents on patterns we can use so for this I took a very simple example about a one thing application for books ebooks also publish shells on platforms or customers that will take the boot out of the library in reality we always have relations so this is why we will talk about I will always focus on the relation because we all know how to design a relational database on the ID year its to try to see how it details so if we talk about the path the path on so the users that want to rent the information of the book and we put that in one single document and the book itself in another document it works and this is exactly what you will do in a relational database but that means every time you want to sorry I'm ahead of what I'm thinking it's a customer on his address so if you design that in a relational way you will have customer or pythons address in two different tables and you can do the same you can do one document for the customer or pass one one otherwise father for is and is address and it's working except that every time you want to get the information about this specific customer the specific path ones you have to get one document on a no sale document remember we don't have join so instead of that just let on that this address inside the document quite easy everything is just you cellulite cellulite and object inside another object in Java you will get the specific structure save the inside the database if we do one too many it's almost the same approach you have one customer is the leaf of this list of addresses in inside the document so same we have one specific document that is the tithe on but inside it you have the list of addresses and if what is interesting is to think about what is happening because it's told that in every customer I have his address that mean every time you move or we have a family we duplicate this information is it bad is it good on it honestly is just how often do you change is data that is key it's now is really not how its design its mouth a fact that if my application and this is probably the case here I am reading the data 99% of the case it's okay I don't care because if I need to update the specific address or add a new address I have everything I need to manipulate this our inside the API and we will take more example with the books except itself or themselves so publisher as you can guess they publish many many books but books have one single publisher so how do your presence at so let's take an example with one of the manga book so you have these different attributes like the titles a health of the published date the number of page language and the publisher that is in this case overly so the first approach to start to design your application is to take this and save it as it is or at least the first part of the design so you will say okay let's take the book the title the author on all these attributes but one of the issue we have its ear we said that the publisher as many many many books so the all attributes of the publisher will be the full document the full attributes will be in every single book is it good probably not in this case because we can change maybe the name of the company or maybe the location or anything else are just in terms of volume you install it's quite good it's quite big so what you will do you will in this case kind of normalize something you have initially the normalized by creating two documents the publisher on the book and what you have to do is to choose how do you want to make the relation between them so in this case we see we don't have a violation I didn't put any attribute in the publisher to point to the book or to the book to point to the publisher one way will be to take the publisher ID overly I have added one field that is underscore ID and I push that also as a publisher underscore ID in my books itself so now we have kind of a drawing not to join sorry we have a reference between the two documents but you can do the opposite you can say let's put an ID on the book and put this specific ID inside the publisher inside and away because I can create a list of value in one single document so in this case the publisher will have all its book and this is where again you have sought to think about what are you data how they are used how big is the data set and how much information do you need on how do you access it so hours of books inside of a publisher makes sense when you have many but when many is just few of them is a handful of them it's not thousands of thousands of thousands what you can say is if you know the size of the list you have to put inside an hour it's a good idea if you will continue to go it's probably a bad idea typically in this case the publisher every month with publish new books so that it will continue to go from a technical point of view also inside MongoDB and I'm not even talking about the design here I'm talking just about how it works we have a physical limitation of 16 megabyte so you may hit this limitation for one single documents by adding books on books on books on books in one single document and also the way it's organized the weights indexed it's probably not the best way so in the same time referencing a publisher in one book first we know the size it's one single publisher for a book so it makes sense and it's useful so in this case if you understand exactly what will be the size for example the tags that you can put on a book could be a good idea to put a net inside an hour so it's something new so it's something that you cannot do today in relational database so it's something you have to think about when you design your application where do you want to put the keys in which kind of order I want to organize my access to the data and I will show you that we can go further than that in a minute so one too many so typically your customer can have multiple books on a books can be land by multiple a customer but in the same times we know from the application business logic or just for just what we do with books you cannot have thousands of books in the same time so what you will do in this case so you will check out you will do the customer on the books and you can add inside so the customer itself the list of books on the date of the check out it's great but here what I have I just have an idea of the book I have a list of checkouts sorted by date but in the same time I have the first it's an IDs and second attributes it's a day if I want to get the list of books so for this specific customer I will just retrieve the ID so I want to put in the same document more attributes to do what we call the data locality in this case I just improve my schema by in the list of the check out I have not only the idea of the book but I also duplicate a part of the information like the title on the author and I keep the date why it is interesting it's because Oh more than why it's interesting is to get everything in one call but how do you design now you just focus on what you do with your data typically you know that when you will look at a specific customer you probably need to get the list of books yes yes where lately from the library and what is interesting inside the book the title on the list of author you don't need to - to get all the information you just choose the ones that you need for your application and what is interesting on important to understand is in this case this value is immutable when you have check out the book so book has a specific title that you will not change so book has a specific author that one charge and the on the checkout date it's a date that was when the customer was you have to take the book we makes a lot of sense to just put that in one single documents because you will access it all the time in a weight manner and you probably never modify the exiting value what you will do you will add more values to the hour so you can add and remove value if you want and we have also operation to manipulate the specific things so if we look at the referencing versus on bearing the data embedding if we want really to compare to relational database it's like a pre join we kind of aggregate all the data that we needed one single document and when we have one single documents with an embedded document with an embedded document the document evals operation are very very easy because you can access to all the tributes on set documents on attributes of the soup documents on it you can have as many as you want at least as many as your brain you can manage it's really part of one parent that you will manipulate so it makes a life easy to manipulate but in so sometimes you lose some flexibility especially when we are talking about a grade of dating many documents or many attributes very often so you have to think about really once again about how do you manipulate your data so absolutely one one thing that I wanted to show that it's quite interesting also is we could do the same with books on author but it was our books on publisher is you can enrich the value you are the normalizing in this case I have books and I have also a list of author we inside another what we call the collections of documents and he has a specific ID a specific name and the location and inside the book I created a sub document and this document contains the ID of the author but it's a in the same time for whatever reasons they could use a nickname or they could use the name before the challenge name when they were married married or something because if you look at it when the book has been published the name of this author was a maybe two days the name is B but the name of the author that was published is really a so you want to store on the moment they normalize at this specific time and you can find many example of embedding data to make the life easier when you manipulate the information and sometime this is the case we have inconsistent data in the data bed because in one case I have here a name that is not even remember okay awesome and inside my author I have another name that is Cristina is it that is it bad honestly it really depends a few application in this case she wanted to be known as Kay wholesome when she built this book in the sometimes you have the ID so you can always check if you need to do massive upgrade or if you want to say give me the real name of the author you can always get this information and if you think about it you have many other use cases where it will be even simpler you will simplify your life as a developer of a standard application talk about an invoice an invoice or another management system when you publish the order also invoice you have a specific name address list of item item descriptions all this kind of stuff has been sent to the docket to the customer on a PDF on an email on the website and you want to be sure other the names that you use into the system is the name that the person has received or the labels of the person that has received because most of the time an invoice when it has been sent to the customer you cannot modify it so in this case by de normalizing and storing the document entirely we just make the life easier instead of having to manage intermediate levels or a kind of checkpoint inside your database so if you want to go further the first time thing will be to test the polycon user product but we have many informations and the documentation that we use to document specific use cases we have product catalog we have application logs we have social network and commands management on this kind of stuff webinars on ebooks and tomorrow afternoon I think I'm doing a workshop where you can come with your laptop and discovers a product we have three hours to discover the product you won't create your own documents we will use an existing document that has used all this complex structure to understand how you can manipulate them so and do you have any questions yeah so the question is what do you do when you are in production and when you have to change a schema or modify the attributes on the structure so it really depends so in many application what people are doing when they know that they will have a lot of evolutions said tag so doc inside the documents they put a version number version of the application schema kind of just to be sure that do I have to do something special with ax and based on the application logic you will choose what you do and you have three ways of doing stuff you can do a big massive update using your code and say I want to change all the documents to add or modify this type and this is your application codes that you will will do that and you have some object document mapping so the same way you have object relational mapping you have some objective commands mapping like for example Mafia or jungle in Java mongoose in energy F that help you to say when I want to modify your type or modify other attributes do that automatically to all the documents another approach is you do that lazily so you get the document out of the database and you know that you since you have a version number of one way of knowing you will change only this document and in some case you just don't change and this is an example of do you know Craigslist Craiglist it's classified it to be a website in the US when you can like an eBay and offer better less advanced system but is very lot very very very used and one of the stuffs I have done is all the history for legal reasons you have to keep everything that relies on our website so they put an archive inside MongoDB but their website that is on my secure continued to be improved so not only is I keep only I think three months I don't know exactly if it's three months or six months but when they are on modifies the structure here so don't touch to the existing million of documents they are in this specific location because this is only used for bi this is only used for analytics so they don't really care if that which is not a or if that ribbit has different types the query will react differently I have some practical question how would you design similar for example we have books and customers and we need to quickly search what are the leaders of the books and the other service is I don't know searching the books which was read at but by customers so we need to first search the bow twice so can we put book ID on customer and customer ID on book or how would you resolve that so you can for sure put the ID on both side and but after that it's really you can choose to and you want to search a I have to rephrase so you want to search by who as a Widow's and okay so without of books and books web by so the way we'll do it I will probably based on if it's not a library where you don't have video enough users I will duplicate the information in boss I will have an hour of not an hour but not inside one single document we have a set of documents that contains the author on the list of books he has read and I can query that in addition to the standout just link to the books inside the user well what is interesting about this this questions not saying that is from it's up to kind of design on the fly that what is very important is typically these kind of questions and do not hesitate to de normalize it's not an issue to duplicate the information multiple time as soon as you have the space on disk it's not an issue to have inter user the idea of the books on the books in the books I won't take I would not put all the users because you don't know how big it will be so you will go too fast to 16 megabytes for example but I will put another document that kind of a log of all activity this one will be very nice for analytics kind of stuff for this kind of queries other questions MongoDB specifies something like write concern I don't know when who is just here okay thank you when as a developer I should concern which write concern is right okay so we first I will explain what the white concern is and MongoDB is has been designed to be highly available on scalable so he has two concept one of the concept is the sharding that allow you to distribute the database on many nodes this will give you scalability and the other part is a replication replication as you can guess I think this term is more known by the user so it's just we copy the data automatically for you so inside one shower or one partition of the database you have multiple copy of the data it's what we call a replica set you can have many of them let's say you have for simplicity reason we have three nodes in this replica set when you saved inside the database you only save on one node that is as this specific moment of time the primary node for this specific operation and by default you will only save on this note and give you back to the application saying I have save the document without checking if it has been cut by a copy on z/os or not it will eventually happen but you don't control it so if you system is variable you will say maybe I can only save on this document or if the specific information is not that critical suppose your application is just ingesting tweets if you exactly are saving this document this book Explorer without copying you may lose some data in this case so what you will do I don't I want to be sure that I never never lose any data in my system you will use what we call the white concern to say I want to write but give me the feedback that has been it has been successful only when I want to write on all the three nodes it's kind of inside your application suppose you build an e-commerce website so when you capture all the clicks of the user when you put something in the cart click I'll choose this project click I choose the product click I show this product is it that bad if in a very very unfortunate case meaning when you save on the server this server is just disappearing from network with on hardware ISM you just say yes we select it so you will say I don't need any white concern because probability will never happen and it's not the core of my business it's sometimes when you see our like sessions on manage session management on your website typically the kind of stuff that happen but what you want to be sure it's when you have selected all the product and you do the check out I have that in my card and I want to be sure that everything that I selected I booked on check out I will do a white concern just to be sure that this specific operation has been not only save on one single node but copy on a replica and in the way you do replication on white concern you can even choose on which data some tile just to be sure that you have some nerds in this specific data center Sumner dinsios our data center and you will say I want to have at least one copy here what could be here does it answer your questions it's really really the durability options as a developer you choose how you want to be sure it has been saved on disk on how many machines okay thank you I know other questions now don't move your home so I'm here for the three days so do not hesitate if you have any questions and see you two more if you come to do the workshop thank you

Info

Channel: 33rd Degree

Views: 22,653

Rating: 4.9276018 out of 5

Keywords: 33rd Degree

Id: csKBT8zkRf0

Channel Id: undefined

Length: 47min 44sec (2864 seconds)

Published: Mon Sep 29 2014