Exploring the Aggregation Framework in MongoDB

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
um thank you for coming to this to this webinar and good morning good evening or good afternoon wherever you may be in the world again my name is Jason Minich I'm a senior consulting engineer with with MongoDB and so what that means is is I travel around and visit with all different kinds of customers with MongoDB and from very small startups to large corporations and help them follow best practices help them design new application them new applications help them set up server set of things like ops manager all different kinds of things every week is is something new and interesting but today I'm here to talk about the aggregation framework with with all of you and before we get started I wanted to just take a take a few seconds to get a do a survey here so I can get an idea of kind of the different members we have in our audience so if you could click on this and getting results in like crazy we've got about half the people so it looks like this is very heavily focused developer audience which which is which is wonderful great so let's get started with the presentation thank you for responding all right so before I get into things I wanted to just lay some basic ground rules here so this is definitely a 101 level or a beginners talk I'm going to assume that you you know some basic things about MongoDB you can get into the shell you can do some queries you understand documents and schema lists and all those kinds of concepts but but are basically new to the aggregation framework so if you have some experience with the framework probably some of this talk is going to be a little bit basic for you we do get into some advanced things but I do want to set that as a level sense the other thing I wanted to point out is I have quite a bit of content and we only have about an hour and I do want to have time for questions so we we may not get through all the content with that said you know this deck will be available as well as a recording of things after the after the webinar all right so in terms of agenda we're going to talk about you know what it means to do analytics with MongoDB we're going to look at the aggregation framework what it is what the syntax of it is and what some of the various different operators on that then we're going to use some actual US census data to do look at some some real world examples of using the aggregation framework and get getting some more information out of out of some raw data look at some different options on the aggregation framework and then towards the end I wanted to throw in some a topic on MongoDB 3.2 because there's quite a few new features in in the aggregation framework that people are really excited about there's something called dollar lookup which is one of these new features that lets you let you do basically simple joins if you want to call them that and so hopefully we'll have time to get into that and if if we're running short on time I may skip some things to to at least bring that up because I hear a lot of interest from customers on this so I do want to at least touch on it during our hour here today all right so what does it mean to do analytics in MongoDB as you all know DB is a document oriented database and we store documents in JSON notation as these these little JavaScript objects and the basic capabilities of a database are you know the ability to create data go query that data to read it to update it and then get rid of data that no longer matters but in all real-life applications you actually need to do more complicated things with your data you're looking to do things like count records that have a similar value filter things look for averages group things all different kinds of things to build you build to build your applications I see it a little chat about 3.2 change from geese onto JSON um no so the on disk format of data in MongoDB is is beasts on I'm talking about the logical format that your programs use and in the way you in the way you think about it so the aggregation framework is really a way for you to do these types of analytics and I also want to point out that the level of analytics that I'm talking about here are kind of like the you know working at the low raw data level we're not talking about higher level you know BI tools with with very rich visual visualizations and fancy user interfaces these the aggregation framework really provides you the the low level you know operators and and functions to actually massage the data out massaged your analytics out of your raw data so that you can then feed them to those visualization tools okay a word on some of the sample data that we're going to be using to kind of walk through the aggregation framework here we grabbed some data from the US Census Bureau that contains information from you know the last 30 years with 1990 2000 2010 25 years I had 26 years now of censuses and that's that's broken down by different areas in the in the United States so for example the data is broken down into regions and then and then States and things like that and we'll look at some of documents we're going to try to drive toward asking a fairly complicated question out of this this data set we'd like to find which division in the United States has the fastest-growing population density where this density is is basically the area of of a division so if you take all the states in a particular division and add up their areas and then you divide that by the number of people so we can look at the differences in in population density between these different time points 1990 2000 2000 2010 and then try to determine which one is growing at the fastest rate so when we get to the answer to this you're going to you're going to see it's a fairly complicated aggregation but we're going to start off with with some basic things and and look at how how you can solve that now in in the sequel world you'd probably right some sequel that starts off with something like select stuff group by something and having some other different clauses but but of course in MongoDB we don't have anything like sequel right because MongoDB is no sequel database so how do you solve questions like this and and and simpler questions using using MongoDB well that's where the aggregation framework comes into play so this little cartoon actually does a good job at a sort of a meta level of describing aggregation framework because in the end it is just a framework so it it is a set of a set of functions operators and features within MongoDB that you can leverage to to run complicated queries that do things like group and count and things like that but it doesn't really do anything unless you program it to do it so we're going to start with the looking at the our concept of the design of the aggregation framework and that is actually really simple there's there's really just one core concept that you want to keep in your head as you're building your aggregations and that is of a pipeline so if you're familiar with the UNIX and Linux world of the pipe operator so this operator is used to redirect the output of one command to the input of another command so you're all somewhat experienced MongoDB users and so I'm sure you many times logged onto one of your your production servers or test servers and we're wondering if was running and so you probably type some kind of the command similar to this to just follow the processes and then find the ones that contain the string D so in this case the output of the PS command becomes the input of the grep command this is the fundamental concept of the aggregation pipeline and really the one thing that you need to keep in your head all right so let's dig a little deeper into that and what do we mean how do we apply that concept of a pipeline to to an aggregation so within the aggregation framework when you're running an aggregation you're going to send your data through a series of really document transformations or doc your different kinds of operations and each of these transformations is occurs in one stage such as something like a match or a project or a group or a sort we're going to go into the details of what these things are but you can probably guess or have a rough idea from from the name of them so the output of a dollar match becomes the input to a dollar project the output of the project the input to the dollar group and so on keeping this in mind it's very easy to build complicated aggregations by just concentrating on each stage and looking at the input and output of each stage and then stringing these together to do to do very complicated things and we'll look at some techniques that you can use on the development side as we get into two more examples here one thing to note each of the operators and that's with this rich library of functions with keeping in common functions or operators they do all kinds of things like group summarize filter things they all they'll all start with it a dollar sign this is a syntactic note so very similar to the standard MongoDB query language where you use things like dollar EQ and dollar LT that's that syntax is just built upon with the aggregation pipeline so if you're familiar with that the syntax over here what the aggregation framework should should feel very natural so let's take a look at an example of an aggregation so here I have a collection called orders in order to do an aggregation I call the aggregate method this particular aggregation consists of two stages one is a match and one is a group we'll get into the details of this impact a little bit later but as this example shows what you really need to focus on has is you know the input and then the output of each stage so in this case the dollar match takes as input the entire orders collection so match is going to is going to iterate through the entire orders collection looking for ones where the status is equal to the string capital A and so the output of dollar match is just precisely going to be those documents where status is a this is exactly like if you did DB orders fine status : quote a very similar I mean exactly the same thing but then the group is going to do things to do something a little bit more complicated it's going to group up on something called the customer ID so take like documents and then actually sum up the different amount values and and we'll look at some more examples of this as we go through so the syntax for an aggregation within the shell is composed of essentially five different parts so you always have you know your DB variable which points to the current database you have the name of the collection you have the aggregate method and then the one thing that is probably a little bit new that you have to get used to when you're writing these is that the input to the aggregate method is actually an array so if you notice the square bracket here for the beginning of the array in the square bracket at the end for the end of the array now what's in the array is a set of again JSON objects but each of these objects is one pipeline operator so that is the basic syntax now not everybody lives completely in the shell I understand that so I thought I would also just show you a quick example of what an aggregate might look like from the from the Java perspective if you're coming from the dotnet world c-sharp is very very similar to this the names of the the names of the different classes in the different drivers is slightly different but the concepts are the same so in the bottom of this slide you can see an aggregation over a collection called hospital that's going to group on on patient ID and then do some matching and up here you can see the same dollar group dollar match dollar sorts within within within the Java syntax I wanted to show this to you so you can just see what the translation is for the rest of the talk we'll be sticking with the shell since it's sort of a lowest common denominator that regardless of which driver that you're using so at a high level there's a number of I'll see popular pipeline operators things that are used quite often like match project group to summarize unwind which deals with with arrays sorting and redact is is actually a pretty complicated one that lets you restrict data coming out of a particular stage of the pipeline will also look at geo near which allows you to use geo JSON data to to do geographical type queries within your aggregations and there's also a few operators to let you define variables for the beginning of this will concentrate a lot on the first on the first four here maybe five because a dollar group and dollar unwind of probably the the more complicated ones to wrap your head on around when you're when you're new to the aggregation framework so there's a couple of things of that I really want to really want to emphasize but now as of MongoDB 3.2 and this is almost a full list I mean there's 80 almost 90 different operators available so you know as as MongoDB evolves and we try to tackle more complicated use cases you probably see this set even expanding more to take on to take on new different kinds of features and then operators that people need to solve their their analytical problems okay so let's take a look our sample census data and and look at how some of these aggregations work okay so we have a few different collections within this sample data set we have a collection called C data that has a document for each state again and then as I mentioned it has we have another collection that contains watching in the C data collection there there is information for for the different senses and that's actually contained with an array so I'm going to try to share my desktop here and [Music] quickly just give you an idea of what one of these documents looks like so here for example is one of the C data documents this happens to be for Al which is Alabama so you can see within the data array here there's an entry for each of the years with population and those are probably household numbers or something like that and one of the things you really want to do when you start doing an aggregation is you know you're going to be given some sort of problem you know written in English or whatever your language may be you're trying to solve some problem and so you need to translate that problem into the different operators that you need to use over your data and to even start to do that you really have to go look at your documents you have to look and see what's there in order to figure that out it's not it's not something that's just completely audience every different collection is different you may even have collections where the documents don't all look the same so you have to get to know your data you have to look at it and if you notice here we have 52 documents in this collection so we must have some four other things that aren't really States like I believe District of Columbia maybe Puerto Rico's in there so here for an example so yeah we can see we do have district of Colombian and a few other things so just wanted to point that out that it's that's an important thing to do and okay so now we should be back to the slides okay so this is what I was just talking about um so take a look at your data you know look at some documents see what's there alright so there's also a collection called regions in this data set which basically organizes each statement into one of those different into one of the different regions so let's say we were interested in here's an example document from from the region's collection state and then and then division and so how can we find out how many states are in with or in each region region here being northeast so in this case we need to use the dollar group operator because we want to take all these documents in the region collection and group up the ones that have the same region that that would be just this region key and then we want to count up how many of those have that particular region so let's look at the the answer to this so up at the top here is is kind of like the hello world of the aggregation framework it's a dollar group operator and I want to talk through the syntax here because it's very particular um and if you don't if you don't understand this you're going to have a lot of trouble with with using dollar group so the input to this stage will be the entire collection and Simpson tactically dollar group has one kind of thing you need to remember and that is the argument or the value of a dollar group so this is this is one document here has a key dollar group so that's the name of the aggregation operator and the value to that is a document which you can think of that as what you're specifying in this document is the shape of the documents that you want to come out of this stage of the pipeline and by shape of the document I'm saying you're giving a template for how what the structure of those documents should be you're going to define a set of keys in it and what those value should be and there's some special syntax and how you specify that in this in this template mode in particular there's a couple things number one you'll see here at underscore ID T that is required to the template document that you give to dollar group the underscore ID key for this document will point to the the key in the incoming set of documents that you want to group on and use the second part of the syntax of dollar group and in fact all operators in the aggregation pipeline is that you use a dollar sign to refer to the two keys in the incoming set of documents so what this says is in the region's collection for each document coming into this stage I want to take the value of the region key and group on that so the underscore ID for dollar group always needs to be there and then this should point to either a key or the underscore ID can actually be a document with with multiple keys that you want that you want to group on so dollar sign refers to the key on the incoming set of documents the second e that we have in our set of output documents we're going to call count and you can see down below here that this is the output of running this aggregation so each of the output documents has an underscore ID and it has a count now the value for count we're going to actually use this accumulator operator called dollar sum and the value for dollar sum is one so what this means is every time as the input collection is is getting fed into the dollar group-stage every time that you see a new document it'll add one to the count variable and you can see the output there okay I see we have some questions I'm going to flip over there and and check these out real quick there's a question about the order of the of the operators in the aggregation pipeline yes that that definitely does matter I believe I may have seen another question about you know do you always have to match before you group well that depends you might do matching later in a complicated aggregation you may match you may match a group and then match again that all depends but I guess the general kind of guideline there if there is one because everything is very specific to your own data is yes you would want to Matt if you can filter your data and make the set of data that you're needing to feed into another stage of the pipeline smaller that's always better right because then let's work down the line so in general you do want to filter before you do other things but that doesn't necessarily hold for all possible all possible um you know scenarios question about what are the right tools to tune your aggregation pipeline and and we'll look at that a little bit as we work through this in general if you're wondering about that if you're familiar with using explain on queries there is an option you can pass to the aggregate method that will output essentially the same kind of information as it adds and explained and it will show you if what different stages if they're using indexes and things like that so so tuning your aggregation is in general just is basically the same as tuning your queries right you want to have the right indexes on but then you do also want to think about the ordering of things and I'll look at one other question here and we'll move on okay so this is a good this is a good question Justin here ask when you get to the dollar group operator how do you get a unique ID as you would a document that has its own underscore ID field that MongoDB puts in okay so using underscore ID for dollar group while it's called the same thing quote underscore ID it is different than the underscore ID in a regular collection underscore ID in a regular collection is essentially a primary key it needs to be unique across the collection dollar using it in the dollar group context we do not have that restriction and then in fact we couldn't write because underscore ID it's trying to group up similar things so you would expect the input documents two dollar group to have multiple documents that have for example the same region because what we're interested in doing is grouping them up the the use of underscore ID with dollar group and then underscore ID as like an object ID or order the unique key within a collection is is is really just kind of a syntactical thing to help you remember like it's it's the important thing that that you're going to group on and underscore ideas is an important thing within your documents that's that's really the only only connection there okay so I'm going to pause on the questions and get back to the content and I'll come back and after a few slides and and check the questions again okay so this is the basic thing with dollar group you always have to remember the underscore ID and then you get to have to get used to this syntax using the dollar sign to refer to two keys or fields in the incoming document those are the two main things to remember one other thing if you look down on the bottom here I wanted to point out is so right now we're looking at simple aggregation this only has one stage in the in real life you're going to have very complicated aggregation which get big and by big I mean to have lots of stages and they're complicated and things like that this unfortunately is just like in the sequel world where not everything is you know select star from from person or anything right in the world world in the real world you have very complicated sequel that goes to like five hundred or a thousand lines and unfortunately in it with within the aggregation framework at the MongoDB if you try to do something complicated you may have complicated aggregations to no no we don't have a magic tool for that yet no but I'm sure we're working on it so one thing you can do at least in the shell is you can actually store the different stages in your aggregation in variables so here I've just taken the same thing is up here and I've just said bar group equals my whole dollar group document and then I've just passed that in as a variable into my aggregate now in a simple case like this with just one stage it's doesn't make a lot of sense but if you have multiple stages and you're trying to match up your braces and things like that using this trick of putting things into variables can can help out a lot okay so dollar group spoke about underscore ID there's also a bunch of other operators you can use in your groupings you can look for Max's men's averages you can use ad subsets to add things to an output document who's that that is an array and and in particular ad sets that will only add things if it's if it's not in there already so here a set meaning you know an array where each item only occurs once on dollar push is familiar to just push a value onto the array so in that case you could have multiple values multiple instances of the same value within an array first and last if you know something about the ordering of your data in array and then you know by default all this all this processing has those happen in memory although there's ways around that so actually back to the question about underscore ID being unique we can extend this and the fact that we using it when you're doing a dollar group you do not need to do that oh we can take that a little bit farther and and try to see if we can use dollar group to find the total area of the united states and here's and here's an example of doing that and this is even a sort of this may look really odd to you but here I'm saying I want to I'm going to go over the C data collection now I'm going to do a dollar group where my underscore ID is no that means it's going to match every possible document and only output one one document out of this out of this aggregation step and what I'm going to do is just reach of the input document I'm going to build a key called total area that's going to sum up the area of each of each each state within that within that collection and then another key little average up those values and you can see see the output there so in this case I'm collapsing an entire collection of 52 documents down into one document that that adds up the area and then also calculates the average here's another one extending that to the to the region value here I'll group dollar group on region I'll calculate a total area that this is the same as before so that'll just be to some of the areas in that region what is the average area how many states are in that region and then build an array of all the states in that region and you can see down below here what the output of that is and in particular here there is there's two documents you can see seen um states two district with Columbia and Puerto Rico they have I D know which that means they didn't have a region in specified in the documents for DC and in Puerto Rico but Northeast Midwest South Dakota I'm sorry Midwest less and so forth and so forth those did so you can use dollar group to build up these you can collapse things down to a single document you can group them by by other attributes within your data and we're kind of running slow on time I'm going to skip through a little bit here and talk about unwind this is probably as I said that the second one that is a little bit more tricky to wrap your head around so unwind is used to flatten a rate and let's go to this slide here real quick and look at an example so if I have a document like this so A to B is an array one two three and I did dollar unwind : on on B here I would get about three different documents one for each different value occurring in the array B so if we go back to this here what I want to do is I want to try to figure out the total population across all documents in C data by year so if you remember back to our C data documents how each of them had an array called data with with three entries one for each year so by calling unwind I'll flatten those out and the output of unwind would actually be you know three documents for every one document in the two data collection so I have one hundred and fifty-six documents there and then I'll group those by the year so here I'm using my underscore ID to group I'm saying dollar data year so that's going into the sub document within the within the data key to find the year so that same syntax that you use to the dot syntax to get values for embedded documents within the query language of MongoDB that carries completely over to the aggregation framework and similarly I'll use dollar some to add up to all the total populations and then finally I'll do a sort on total population one and the dollar sort operator that works just like the the sort method from the query language and here I'm sorting one so that means it's going to be getting bigger ascending order and you can see here the output of something like that so another another kind of closer look at that bye unwind on the census key here kind of breaking down the the key elements I have an input document like this I'm going to get three different output documents for that for New Jersey happen to only have census for 1990 and 2000 I'm just getting to output documents over that I'm going to there's one question that came in from Andy about being able to branch or split the pipeline during later stages no there's there's not really a way to do that you probably to do that you could there is a way and I have this waiter in the slide deck that you can actually dump the output of an aggregation to another collection so you can probably have one aggregation that outputs to a collection and then have two different aggregations that operate over that collection to do kind of a splitting thing okay so there's a question do we have an operator like wine in contrast with dollar unwind yes there is a wine which does basically just the opposite it'll take you group on something and then you can find those same keys and build an array in the output documents so so yes that is there okay I'm going to move on through the slides now I think I didn't talk about that one this this is just a little bit of an extension of using unwind here doing a match before to just to just look at the states the states in the south otherwise otherwise that's basically the same as the previous example dollar match this one is probably the simplest this is using the same same syntax as as the query language when you get finds so this one should be pretty straightforward I think I'll go through that and now let's this is actually a bit more complicated as you can see and I apologize the spacing on the slide got a little off here so it's a little bit trickier to read but let's let's talk this through now if we want to start to look at the population change so the Delta here meaning meaning the change by state from 1992 to 2010 this is this is one way to do it and actually something I forgot to mention a little earlier is when you when you're giving these kinds of problems like hey find the population Delco between these years there's probably multiple different ways to solve this problem within using all the tools available to you with the aggregation framework there probably some ways that are more efficient than other ways nonetheless there's usually more than one way to put these together so so just you know to sort of keep that in mind if you're wondering like is this the only way to do it okay so in this case we can unwind then we can sort by the date a year that's going to put so remember the data array had three entries for each year sorting by date a year would put you know all the the ones for 20 1990 first and then and then 2000 and 2010 then it groups by name and then it builds two keys called pop 1919 pop 2010 using the first and last operators and so these these work because this is going to say give me the first value for that and then the last value for that as it groups over all the documents with the same name the first document that hits the group the total pop and the data and the data key will go in there and then the last one that hits back group will go in there and now dollar project is a way to transform the document what I'm doing here is a project it project the value of project is is kind of similar to group in that you give it a template of the documents that you want to come out of that and so here I'm saying ID column 0 that says I don't want to project the underscore ID key that comes out of my dollar group-stage instead I want a key called name whose value is the underscore ID from the incoming group and then I want a key called Delta where I'm going to use this subtract operator to to subtract these two to subtract top 1990 from top 2010 and then pop 1990 : one this alignments off here and pop 2010 : 1 those are saying I want those keys in my output and so this is the kind of output that that you'll get from something like that dollar project is actually really powerful to do data transformations so you might have a consumer that is really interested in your data but they need the documents to look different they can't handle the names of your T's and things like that dollar project is a way to transform those documents all right so there's there's three other operators that are actually exactly the same as what you're used to when you when you're writing regular fine queries sort limit and skip are exactly the same there's there is a way to one of the options that we'll get to later on there's a way because by default everything it will happen in memory but if you have really large sorts you could use up all that memory so you can say allow disk use going true to to allow you to sort larger data set first and last I talked about dollar project I basically talked about all this I'm skipping ahead because we're going to run out of time here just again the syntax for project you know a0 and a1 so stamp or true or false this says don't give me the underscore ID that says give me this field otherwise you can use the same syntax like dollar dollar underscore ID to refer to the incoming documents so that this is just walking through the example I've already walked through so if you wanted to answer a question about using some some geographical data so we want to count people or compare the population sizes between these three three years within people living within 500 kilometers of Memphis you use this geo near operator and here that this is going to assume your data if you're familiar with the geospatial indexes with MongoDB your data has to follow the you know the Geo JSON format which which I'm not going to get into during this talk but here's just an example here we're here I'm saying that the max distance is 500 thousand meters wow that would be 500 kilometres and near this point here is the coordinates of Memphis and spherical to use the spherical calculation and then there are old friends and wine group and sort to kind of massage this data running something like this gives you gives you an output like that and you can see here the states where we counted we counted things where were those states you know right there bordering bordering Tennessee and in the Memphis area so it's geo near we've only got about ten minutes left I'm going to skip ahead a little oh here I mentioned this if you wanted to save the outputs of an aggregation to another collection you can use dollar out and then just specify the name of the collection so that's a pretty nice feature one note on that is every time you run the aggregation with a dollar out it's going to overwrite that target collection that you want to write to so you need to be careful that and you know if you're using that to to design your strategies to take that into account okay so so back to that original question we're looking for the US division with the fastest-growing population density we only want to include states with more than a million people and divisions larger than one hundred hundred hundred thousand square miles and here is the final answer to that which you can see we've gone from you know very basic group to now here's an aggregation that has you know one two three four five six seven eight different steps and you can see the pattern here of doing the match first so I'm going over that see data collection but I really only want to match on states that have that that big enough population because that's one of the you know one of the characteristics of the question I'm trying to and then in the week and then there's a couple different groupings to massage the data that we've looked at each of the pieces of that and then another filtering on the total area and and so forth and output of that is something like this and it looks like the answer to our question is that the South Atlantic Division had the had the fastest-growing population density all right I'm going to check the questions here real quick okay I'm going to hold up on the questions right now so here's a couple of the options I believe I mentioned most of these already one thing to note is that the return of of calling aggregate is itself a cursor so if you're familiar with how find returns of cursor aggregate returns a cursor as well so you can iterate over that you have the ability to specifies different different settings on the cursor like the batch size this is the allowed disk use option to be able to sort larger things storing an immediate result on disk and then if you did want to see the explain you can say explain : true again following all good MongoDB api's the options is always an optional last argument to to the method so it's one document containing all these different options all right so three to let's jump into some things there so there's a new operator called dollar sample where you specify a size and this will do a kind of a random sampling over your over a given collection it uses some different strategies depending on your storage engine but if anybody has gotten a chance to download and use our new compass product if you if you fire that up and point it out one of your collections you'll see you get some sort of sampling of the data and that in fact does use this dollar sample if it's connecting to a MongoDB that has is version 3.2 it has other strategies for to connect to your older versions but this is very useful on for example if you want to just browse your data if you're in charge of doing unit tests or you want to you know just grab random random documents and validate things I think this would be a really powerful feature for all you testers out there so so check that out dollar lookup is one of the other new popular features in DB 3.2 I'm going to go through an example so I'm not going to go in detail with this is just the basic syntax of it the one there's a couple things to know about dollar lookup is in the output of dollar lookup you're going to get a new new field that contains an array of documents that match what you're joining on and we'll look at an example of that to make it a little more clear and then the collection that you're joining on at this point can't be sharted so I'm not going to talk about the details there but if you are in a situation where you use shard your data and you're considering dollar lookup um you just need to keep that in the back your head that there are some restrictions there and you may or may not be able to use these Donald lookup in those cases so the example here I want to look at is some kind of hypothetical on a sensor data so I have a collection called data that has a bunch of values and then a key called key and then another collection called keys that match up the sort of internal ID 0 to 2 the name of a sensor East meter central meter and - monitor and so if I was interested to to you know average up the values from these sensors but I wanted to out I didn't want to output you know a 1 for the sensor I wanted to actually go into this key collection and look up the name of the given sensor I can use the dollar lookup and and some other aggregation operators to do something like that so I'll talk through this example quickly here so here's a dollar look up and I'm saying I want to go into the keys collection I want to take the local field K which is going to be decay on the input here and I want to map that to the underscore ID field in the keys collection so jump back here so I want to match the one to the one over here and then I want to unwind on name because I'm going to return those back as a day value and if we unwind it unwind that because that's going to come back as an array and I want to hold on a second here I'm going to I know we're running slow running short on time but this is actually pretty important so I'm going to try to pull up from an example here oh really quickly here okay so hopefully you can see my screen I've got that that same aggregation here just just in an editor and I just wanted to show you quickly what happens I'm going to all I'm doing down here in the editor is I'm just commenting out all the other stages in the aggregation because I want to do to see if I just go over the raw data so this is my raw data collection if I just run this script and I just do the dollar look up let's take a look at at the output that I get so for each input so here's my for each input documented data just running look up I'm going to get back a key called name because that's what I specified for as but I'm going to get an array of all the documents in the key collection that match it because there could definitely be more than one it just so happens in my example that I used underscore ID to match so in this case there would only ever be one but that's not the case in general when you're doing lookups and into into another table so you're always going to get back into array which is why pretty much always maybe not every single time but almost always you're going to want to do a dollar unwind right after you do your dollar lookup so if if yeah so you pretty much always want to do a dollar one wine that's that's basically what I wanted to say with that and then if you add back in all the other stuff that actually calculates the average that the grouping you can get an output like that so let me go back to the try to unsure my screen here okay so we should be back to back to here okay I think I'm going to skip this since we're okay so here's a question we're getting to the end I just have a couple things I want to say but the question Ken asks is does the lookup really rely on indexes absolutely it will and since I'm short on time I'm going to have to skip over the details of this but I had an example here too to use dollar lookup to do actually like a self join but if you're interested in like kind of graph problems like finding friends of friends or taps or things like that here's an example of using download lookup to do a self join and actually I have a slide that shows what indexes matter so if I run this and again the slides will be available so you can look at the details of that but here's an example of running that aggregation to find friends of friends but without index is taking 48 milliseconds so here I've just been grabbing output from the d log I did DVD set profiling level 0 comma minus one to make the slow milliseconds threshold minus one so it'll everything will go to the log 48 milliseconds without indexes I add an index on friends and name and I get down to two milliseconds so so big difference and actually within these there's a lot of other statements in the log but without the indexes you see you know tons of collection scans and what the index as you see IX scan so index is definitely definitely look definitely definitely matter quickly here tons of other operators to do different mathematical things in three two as well as some more array operators all this stuff is online in our wonderful MongoDB documentation as as I'm sure most of you are aware and so I'll just wrap up and I'll get to the questions you know can you do analytics absolutely and really complicated ones the one thing you want to think about when you're building your aggregations though is even though we had very complicated ones you know with 10 or more stages you just want to concentrate on each little stage and that little trick I did by coming and commenting and out of stage can really help you debug things because you can just remember that the output of one stage is the input to another so if you're not getting the results that you expect then then please just you know comment out things and try to diagnose what where you're where you're going on ok so I'll I'll take a look at some questions now and then also I want to just thank everyone for attending if you have to drop off if you have a hard stop that will stand and do some questions and have fun using the aggregation framework and then finally there will be a survey available and we really value your feedback so please if you have a moment take a second to to fill that out and I will look at some questions now ok so there's a question from Justin sharded means you just have a size limit on your collection right no by shorten I mean you are actually running a sharded cluster so you have multiple different deeds that are housing that a given collections data is spread across so it's it's the way you scale MongoDB horizontally and so so that's what I mean not not a size limit per se but an actual sharded collection okay there's a question here about if you use grid FS can you use the aggregation framework I'd probably need a little bit more clarity in what you're trying to do I mean at the end of the day grid FS is storing documents in a collection so I don't see any reason you couldn't use the aggregation framework for things but I'd probably need a little bit more answer on that one sorry about that [Music] there's a question about David asked is it possible to group things to an array [Music] I'm not precisely sure what you're meaning there you can you can the value for underscore idea a dollar group can itself be a document so that you could actually group on multiple keys so a good example of that is let's say you had a data set of like zip codes and they had city had a key for city in a keeper state and you wanted to count up how many ZIP codes were in each city then you could do something like dollar dollar group underscore ID : curly bracket you know F state : dollar state city : dollar city and that would make it unique a unique group for each unique city state : city pair so that's a way to group on multiple keys if you want to output an array then you can use the add to say add to set and the push operators so I hope that helps with that one so here's a cool Scott asked during the webinar could you mention how Mangal compass fits in the aggregation framework yeah I mean it does use the dollar sample a feature to to do that sampling of your data so I mean that's the simplest way I know that it fits into not a data become available with the slide deck yeah so probably what I can do I think there's different means that they post the slide deck but I can some of this stuff is already up on a github site so I'll add a link to that in the slide deck and then I can post some stuff up there with with some of the examples and things so so good so that everybody could can grab it from there okay no sunil asks about multiple fields for underscore ID and group yes I get to answer that so absolutely how does so Jean asked how does the aggregation framework compare with Hadoop MapReduce well that's a kind of a large question I will say that you may be familiar with MongoDB having its own you know map and reduce kinds of features and the aggregation framework was really built to be much much better than those if you looked out there if you google around for these kinds of things there's a few blogs out there that they take the same problem and they try to do it with JavaScript MapReduce it in MongoDB MapReduce in MongoDB and then use the aggregation framework and the aggregation framework performs like many orders of magnitude better so if you're doing that produce now this isn't particularly talking about Hadoop because that's an entirely different technology but if you're doing aggregation or MapReduce kinds of things now with the applications you should probably rethink those and use the be a gracious framework all right so I'll just do a couple more questions the shank has about suppose there are millions of records will aggregate load all the records first will it eat all the memory map Ram oh so yet you do need to be careful there it's going to process things in stages right so yes just like a big query that did a full collection scan of a logic data step could trash your working set aggregations can do the same thing so you need to make sure you have the right indexes in there and then if you're doing lots of sorting and things like that you probably want to use that allow disk use true option as well now if actually you will get an error if you try to do a sort that can't be done if it runs out of memory so you'll be able to detect that quiet develop depend okay so let's okay so I'll take one more question here about from Justin can you use a dollar lookup operator more than once in an aggregation query yes you can definitely it and that's that's kind of the beauty of the whole design with the pipeline is all the all the steps are independent of another write the input to the beginning of the collection and then the output and the input down the pipeline is just a set of documents so they don't really know anything about what happened before or what happened later in general okay um I think that's probably good for the questions um we're about 10 minutes over so I want to thank everybody again for joining and again we'll get all the samples and the end of slides out to everyone and on our website and appreciate your time and have a good rest of your day or evening they keep
Info
Channel: Miloš Matić
Views: 28,966
Rating: 4.8362575 out of 5
Keywords: mongo, mongodb, nosql, aggregation, framework, sql, analytics, reports
Id: PrBSuxXURYs
Channel Id: undefined
Length: 67min 45sec (4065 seconds)
Published: Tue Jan 10 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.