Understanding MongoDB

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

now we're going to talk about MongoDB MongoDB is pretty popular in corporate environments and you might find a need to integrate it with your Hadoop cluster or would spark it something like that it's easy to do so let's go find out what MongoDB is all about what's different about it and how to integrate it with your Hadoop cluster let's talk about MongoDB next to MongoDB is a popular choice in the corporate world in particular because it is built by an actual corporation that actually supports it as opposed to just being kind of out in the wild and open source MongoDB s name comes from humongous data humongous get it which is kind of weird because really what sets MongoDB apart is not the fact that it can handle big data but just it's document data model it's not going based data model which is very flexible and we'll talk about that more shortly so don't let the name fool you other no sequel databases do just as good of a job at managing a big data where are we in the triangle of the cap theorem here well MongoDB sits down on the consistency and partition tolerance side of that triangle so since it does have to deal with big data partition tolerance is something it just has to do and MongoDB chooses to favour consistency over availability so MongoDB has a single master a single primary database that you have to talk to all the time to ensure consistency but if that master goes down it will result in a period of unavailability while a new primary database is put into place so the big thing that's different about MongoDB is that you can stick pretty much anything you want into MongoDB basically any JSON blob of data you can shove into a document in MongoDB it doesn't have to be structured you don't have to have the same schema across each document you can put whatever you want in there so here's an example of what a actual MongoDB document might look like let's say that we want to store blog posts in a MongoDB database well this is what it might look like and this is really what it looks like so MongoDB will automatically give you an underscore ID field let's just automatically appended to your document that contains some unique identifier for you and that's done because there is nothing in MongoDB that says that you have to have some unique field in your document at all so within that document we might have a title the content of the blog post itself and then we can have a comments field that contains an array of other documents so this is an example of an embedded document where we have a document representing a com comment that itself contains the name email content and rating and I could actually have multiple of these embedded within this blog post document so that's a little concrete example of what a document might look like in MongoDB like I said no real schema is enforced in MongoDB at all you can have different fields and every document if you want to go obviously not necessarily a good idea if you want to actually do fast look ups and database but you can you know you don't have to have a single key value like you would have to have in Cassandra that's some unique identifier but you can create indices in any fields that you want you can also create indices on combinations of fields so one nice thing about MongoDB is that it's very flexible in how you can index its data to achieve fast lookups on whatever queries you might be doing obviously if you want to actually charge your MongoDB database which is how they talk about actually horizontally partitioning it so that you have different ranges of data and on different servers then you have to have some unique index to do that shorting on and we'll talk about that more in a bit so with MongoDB you have a lot of flexibility in what you can store in it but with great power comes great responsibility just because you can show whatever you want into MongoDB doesn't mean you should you still need to think about what the queries are you're going to be performing on this database and design your database schema accordingly so make sure that if there are think about what indices you might need for fast look ups for the queries you're going to do at the end of the day it's still a node sequel database so you cannot do joins efficiently so you want to make sure your schema is denormalized as much as you can in MongoDB world we talked about databases and collections and documents instead of databases and tables and rows so this kind of gets away from the notion of there being some sort of fixed schema which is kind of implied in the words table and row so a MongoDB database contains collections any collection contains a collection of documents so instead of tables containing rows we have collections that contain documents conceptually you can think of them the same way but just keep in mind that collections can contain pretty much anything and the main restriction here is simply that you cannot move data between collections across different databases so if you do need to reference data between different collections they do need to be within the same database so I can editorialize a little bit here if you go to the MongoDB website you'll see it's really aimed at more of a corporate environment and kind of rubs me the wrong way to be honest if you look at the about MongoDB tab for example it doesn't really tell you anything concrete it says with MongoDB these organizations move faster than they could with relational databases at one tenth of the cost with MongoDB you can do things you could never do before wow that sounds great too you know the sort of CTO that hasn't written code in 20 years right but pathetical people like you and me not really very helpful kind of rubs me the wrong way but for corporations this can be a good thing you know you want to be able to pay for a professional support and have guarantees about support if you need it so you know MongoDB has that sort of service available to it and at the end of the day it is still open source and you can still get the documentation you need as a developer if you just go looking for it but and websites like this just really bother me alright let's talk about MongoDB architecture so the first thing you need to understand with MongoDB is what they call a replica sets so like we said before MongoDB has a single master architecture the idea being that we want to have consistency over availability but you can have these secondary databases that maintain copies over time from your primary database so as rites happen to your primary database those rights get replicated through an operation log to any secondary nodes that you might have attached to it so in this talk in this diagram here we might have a primary MongoDB server that your application talks to you and maybe we have a couple of secondary backup nodes in one data center and a couple of secondary backup nodes in some other data center MongoDB will automatically replicate those operations to those secondaries so that in the event that the primary goes down one of these secondaries can take its place and the right way that replication chain works is kind of arbitrary just tries to figure out which server kind of talk to most quickly you know where's it getting the fastest pain times from so you know you don't necessarily have this sort of structure where you have a primary talking to a secondary and another secondary backing up from another secondary these arrows could be pointing pretty much anywhere in practice so the good thing though is that if that primary does go down a new secondary can be elected and take its place within seconds so it happens pretty quickly you're not talking about massive amounts of downtime in the event of a primary failure but you do need to make sure you get that primary back up online pretty quickly because if your operation log runs out of space during the time that it's been down recovering that primary is going to get a whole lot more difficult so you know you need to make sure that you're still you still have some operational responsibilities to actually get that back up and running quickly and I want to stress again that we haven't even talked about big data yet what we're talking about here in replica sets is just having a single monolithic MongoDB server where all of the data sits on that single server and we're replicating that data to backup servers okay so we're not talking about big data yet we're just talking about durability and actually having backup copies of a single monolithic MongoDB database here there are a lot of quirks with MongoDB and it's a you know something that it does get its share of criticism for one thing is that you have to have a majority of servers in your set to agree on who the primary is so you can't have an even number of servers because you can't get a majority and that implies that you need to have at least three servers if you want to have replication or some sort of durability and I can get expensive right maybe it doesn't make sense to actually have three giant servers just to keep your one monkey MongoDB instance reliable so to get around that limitation they have something called an arbiter node that you can set up into place of a secondary node where it's only job is to vote on who the primary should be in the event of a failure so that's an option but you can only have one arbiter node in your cluster so a little bit weird the other thing is that your applications need to know about at least a few servers in your MongoDB cluster so it needs to know about you know your current primary and a few secondaries at least so it can actually ask one server who the primary is that should be talking to so that means that if you're going to be changing the configuration of your servers or adding more secondaries or removing secondaries at the end of the day you need to push that information all the way up to your applications which can be kind of a pain and again I want to stress that replica sets only address durability we haven't talked about scaling out to Big Data yet if your if your replica set goes down for whatever reason your your database is down okay so there is a way to set things up so that you can read from secondaries but generally that's not recommended so we're just talking about durability here rather neat but one neat thing about replica sets is that you can set up something called a delayed secondary and the idea there is that you can set up a time delay between the replication between your primary and a specific secondary node and you can do that as insurance against doing something stupid so for example let's say I set up a one hour delay between primary and secondary replication and I do something really dumb like accidentally drop an entire database on my MongoDB instance if I can catch that quickly enough I can shut things down and restore from that delayed secondary to get back to where I was an hour ago and restore that information relatively quickly let's talk about Big Data that's why we're here so for actually scaling out data across more than one server with MongoDB we need to set up something called sharding and the way sharding works is that we actually have multiple replica sets where each replica set is responsible for some range of values on some indexed value in my database so this in order to get starting to work it requires that you set up an index on some unique value on your collection and that index is used to actually balance the load of information among multiple replicas sets and then on each application server whatever you're using to talk to MongoDB you'll run a process called s and s talks to exactly three configuration servers that you have running somewhere that knows about how things are partitioned and then uses I figure out which replicates that do I talk to to get the information that I want so let's take a look take a minute to and it's architecture here we can have many application servers these might be web servers on some big web app for example where each process of your web servers is running an instance of s s has some communication with three configuration servers you're running somewhere these can run on on top of other servers you might have they'll have to do a whole lot of work but you need to have three of them and from there can figure out which replicas set to talk to to actually read or write the information for a given say user ID or something that you're indexing on and that replicas set in turn can take care of durability and actually backing that data for that replicated replica set up to a bunch of secondary nodes so they can failover - now s is running something called a balancer in the background so over time if it finds that it's actually doesn't have an even distribution of values in whatever field you're partitioning on it can rebalance things across your replica sets in real time over time so in this example we might have replica set one that's set up to handle user IDs you know from the minimum value to user ID 1000 maybe replica sets to is handling user IDs 1,000 to 5,000 and replica replicas set 3 might be handling user IDs 5,000 to whatever the maximum value is so these can change over time and get rebalanced over time as the need arises so that is how MongoDB handles big data you can see it's actually pretty complicated but you know if you actually to be fair if you compare this to something like HBase where you're using something like zookeeper to maintain these sorts of configuration it's not that different charting itself has some quirks in MongoDB so for example Auto sharding where it's trying to rebalance thing over time sometimes fails there is a rather nasty failure mode called a split storm where it simply cannot split things quickly enough and it just keeps trying to reel it things over and over and over again and your entire cluster goes down as a bad thing another failure mode is if your s processes on the front end get restarted too often things will never rebalance so it actually takes a look on each s process over time to see how data is being distributed throughout your cluster and if you keep restarting it it basically starts the clock restarts the count on those things so if you are restarting those processes too often and sometimes depending on how you set up your web server that might be pretty often that might be the case things won't be balanced properly so very easy to get into a bad state got a better make sure if someone's really keep an eye keeping an eye on things from an administrative standpoint you do need to have exactly three config servers and if anyone goes down your entire database goes down this really isn't any different from HBase where you have you know master nodes that are maintained by a zookeeper so again we're trading off intentionally consistency for availability and the other thing too is like I said before even though MongoDB offers a very loosely defined document model it doesn't mean that your document model should be loose if you're going to be doing sharding and actually handling big data you still need to think about having some single primary key that is unique to each document that you're going to be starting on now let's kind of talk a lot about the limitations of MongoDB but there are some very neat things about it too so again you know the the big plus of MongoDB is that it's not just a no sequel database but it can store pretty much anything you want it also has a shell that has a full JavaScript interpreter so there's a lot of power there you can do you can actually run JavaScript functions across your entire MongoDB database pretty easily it also supports many indices although you're still discouraged from doing more than two or three in a given collection and you can only have one that's used for sharding but you can actually set up things like full-text indices for doing efficient text searches across MongoDB so again MongoDB is really a good choice for things like storing you know big documents of information or text you can also have spatial indices regular you can actually do searches across you know latitudes and longitudes for example and try to figure out what database objects actually intersect a given position for example which is kind of a neat feature another thing about MongoDB that's worth talking about is that they're kind of trying to make MongoDB into a replacement for Hadoop to some extent so it actually has built-in aggregation capabilities you can actually run MapReduce code on MongoDB itself and it actually has its own file system built-in as well called grid FS that's kind of like HDFS and some then where it's storing documents within Margo DB and actually chunking those documents up kind of like HDFS does so MongoDB is kind of value proposition is in part the fact that for many applications you might not need to do but all MongoDB might be all that you need but if you are integrating MongoDB with Hadoop or smart or something like that it's easy to do is we'll see in a moment and the good thing is that it can actually leverage some of these features in MongoDB to do things more efficiently so for example if we're tying MongoDB to a spark data set and you're telling spark to go perform some MapReduce EE tasks on MongoDB that work might actually get pushed down to MongoDB itself so it might actually have to use to do but all that can actually lead to you know more efficient data analysis and you might be able to get from other no sequel solutions that are integrated with something like a duper spark and there is actually a sequel connector available for MongoDB so you can actually write full-blown sequel against it if you want to but bear in mind it's still not really a relational database even if you have the ability of executing sequel commands against it you still can't do efficient joins and can't deal with normalized data very efficiently so with that we talked a lot let's actually go play around with MongoDB let's actually look at integrating MongoDB with SPARC and get some data into it and then we can play around with the data in MongoDB and see how it works from within the shell so let's go have some fun

Info

Channel: Frank Kane

Views: 191,914

Rating: 4.8709679 out of 5

Keywords: mongodb

Id: UFVFIKduXpo

Channel Id: undefined

Length: 16min 54sec (1014 seconds)

Published: Thu Feb 23 2017