MongoDB + Python #2 - Schema Validation, Advanced Queries and More

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this mongodb with python tutorial i'll be continuing from where i left off in video 1 and showing you some advanced mongodb features specifically i will discuss schema validation bulk inserting data modeling advanced queries and introduce you to a great module called pie era if you've not yet watched the first video in the series please do so by clicking the link in the description or the card in the top right of your screen now i will mention that mongodb is the sponsor of this video and their team has helped me come up with this fully guided tutorial for all of you lastly if you haven't already claimed your 25 dollars in free mongodb atlas credits you can do so by clicking the link in the description and using code mkt tim without further ado let's get into the video and learn about some more advanced features of mongodb [Music] alright so i'm here in visual studio code i have my mongodb extension installed highly recommend you install that if you don't have it and i have some code here that we wrote in the previous video that just connects me to my mongodb cluster now the database i'm going to use here is production and again if you haven't watched that previous video please go back and watch that it's going to give you some fundamentals and kind of a basis for what i'm going to explain here now for this video i need some kind of example and some data to work with so the example i want to create is something where we have books so those will be stored in one collection and then we have authors those those will be stored in another collection and then we'll have a reference between the book and the author so we know which authors wrote which book and this way we can have one author writing multiple books without replicating that data and i'll talk about why we actually made that kind of design decision as we start creating this however the first thing i want to do here is talk to you about something known as schema validation and how that works in mongodb so let me open up the documentation i will reference quite a bit of documentation here as we go through this video i'll leave it in the description and you can feel free to read through and get some more details if you'd like all right so we have schema validation now essentially what this is is a way of creating some type of structure in your mongodb database so typically when you insert a document into a collection there's nothing that enforces it has to have a specific field and when you create a new collection you don't have to specify what columns you're going to have what's required what the type of everything is you literally just create the collection and allows you to insert any type of document that you want the documents you insert could be completely different that's one of the great benefits of mongodb however sometimes you want to use mongodb and you also want to have some type of enforcement on the data that's being inserted in a collection and this is where schema validation comes in so what this allows you to do is similarly to a sql database set up some predefined columns or in this case fields and require those whenever you insert a new document so when we're talking about our book for example we may require that you have a title you have an author you have a published date all of these fields must be on the document to insert it otherwise it won't actually let you insert that so you can read through this documentation if you'd like to see exactly how this works you can see here that when you create a collection you have the option to actually pass a validator and this validator allows you to specify the schema here of the collection so we can have things like different properties so the properties would really be the required fields here we have a name year major gpa address etc so we specify all of the different types here by using bsun type we specify our descriptions there's all kinds of other stuff that we can do and yeah we set up our schema validation so let's actually do this here in mongodb i'm going to just copy in a validator because i don't need to write all of this out it's not really that helpful for the video here so this is one we have a book validator so we start with a json schema we say the bson type of the schema right is going to be an object and then we have our required now for required these are going to be the fields that we need to pass or that we need to have on our documents and for properties these are all of the fields and their corresponding types so i have my title it's type string it has a description we have our authors this is going to be a type array and since it's an array we need to specify the type of the items so i'm saying the items in this array are going to be of type object id so this is going to reference an object from the the author collection we then have our description we then have a publish date type date we have type and this is a field which is a value enum which means we can either have fiction or non-fiction as the two allowable values here and then we have copies minimum value of zero the type is int and then we have our description if you want to write a more advanced schema validator again reference that documentation there of course i can't go through everything you can possibly do and if you want the bson types you can find those again from the documentation but anything that you would think is a type is probably one and you can likely guess what the types are of course that you can check from the documentation okay so now that we have our validator though how do we actually use this well the first thing we need to do is create a collection called book and then once we've created that collection we can actually modify it and kind of add this validator to it so what i'm going to do here is say db and actually it's not going to be db it's going to be production because that's the name of my database and then i will create a new collection and i can do that using the create collection function like this i also could just write dot and whatever the name of the collection is but i would need to insert something in there then for that to be created so this way i can just specify what i want to call this so i will call this a book and i'm just going to put this in a tri-except block just to ensure that if i run this code multiple times i don't get an error where i try to create a collection that already exists so i'm going to say accept exception as e print e okay so now that i've done that i can actually modify this collection using the following command i can say production and that's not what it is it's going to be dot and command like that and then i can pass a mongodb command this command is going to be call mod like that i'm going to pass the collection that i want to modify which is going to be book and i'm going to pass a validator equal to the book validator and that's going to add that and then when we insert documents our schema will be enforced so let's actually just try running this code right now and see if this works so let's run and let's see what we get and we get an error so let me have a look here what this area is and we'll be right back all right so as you saw we were getting this error here it's saying user is not allowed to do action actually good error to run into i can show you how we can fix this so it's kind of two steps to do this we need to go to mongodb atlas here which i already have open we need to click on database access once we're in our project then we need to go to our user we need to change its permissions so it has access to the admin permissions so what we do that is we click on edit we go here to database user privileges select on built-in role and then change this to be atlas admin so to perform the command that we're trying to perform you need to be an admin user makes sense so we need to do this now we also create a new user we can you know set up custom roles all kinds of stuff we can do in here for now though that's all we need to do for our current user okay so i've changed that for our user now that we've done that there's only one more step and it involves this connection string right here we just need to make sure that when we're authenticating we're authenticating as an admin this doesn't happen by default so we need to add the following line here at the end or the following i guess query parameter we're going to put an ampersand because we have an additional parameter here and then we're going to say auth source is equal to admin again that's going to sign us in as the admin and give us access to this command that we're trying to use so let's now save and re-run this let's clear the screen and hopefully this will work for us this time and this is collection book already exists okay so that means i've already created the collection that's fine however the uh the production command here should actually be working right we should now have modified this wasn't giving us an error and that is all good now if we want to verify this is indeed working let's go to mongodb compass let me go to refresh here we should have a production database which we do and if i go inside of here you can see that i have my book database awesome uh we can go to validation and when we go to validation we can see that we have these json schema here or the schema validation actually set and we can change the action as well as the level here for validating the schema okay so that is good for now we've created our book collection however now that we've done that we want to create our author collection as well so i'm going to put this inside of a function here i'm going to say define and this is going to be create book underscore collection and we'll indent all of this just so we don't repeat this code multiple times and now we can make a new function i'm going to say this is going to be create the author collection and after we do this we'll insert some sample data so for now let's copy our author validator i'm going to put it right here okay so we have our json schema bson type same thing as before i won't go through this again and now that we've done that we want to create the author collection and then add this schema to it so we're going to say try and this is going to be not db but this will be production dot create collection we're going to create author we'll say accept exception as e print e and then we'll run that same command that we did before so this is going to be production dot command call mod author and then the author underscore validator like that okay so i think that's the exact same as this just want to make sure everything looks good looks good to me so now we can call this function so create the author collection and let's run this and hopefully everything will work fine i don't get any errors no output that is good if we want to verify this is working we can go back here to atlas and now we see that we have author and if we go to validation we can see that we have this now i guess we could have a look at schema as well or there's nothing in here so we can't really see anything right now okay so that is all good to me we have author we have book now that we've done that we need to add some sample data so let me stop this function call let's create a new function here let's say create underscore data and in here we need to add a bunch of books and then we need to add a bunch of authors or i guess we're going to add our authors first actually and then our books because our books rely on the authors right so let's do this i'm just going to copy in some sample data here that i have for authors and now i can show you how we perform a bulk insert now i believe that we saw this in the last video but it's important to mention again because this is a more efficient way to insert data rather than individually insert every single one of these entries so let me slow down for a second here and really go through what we're about to do first of all we have this dt now that is actually imported from date time so i need to import that at the top here i'm going to say from date time import date time as dt just so i can get a date time object very easily okay so now we have that so that's what we're doing with our dates and time we then have our first name last name and date of birth and if we go back here to our author schema we notice those are the only three required fields so it's fine for us to insert those what i'm doing again is i'm going to insert all of my authors once the authors are inserted i'm going to grab the ids of these authors and i'm then going to use them to create my books because the books rely on what authors are well writing them and we're going to use again this reference type relationship as opposed to the embedded documents and i'll talk about that more in a second for now though let's insert this data so what we can do is say i guess we'll go the author underscore collection is equal to and this will be production and dot author and then we already know how to do the bulk insert but it's going to be author collection dot insert underscore many and we can just pass a list here of the authors that we want to insert then what i'm going to do though is get all of the ids of this these insert authors so i'm going to say inserted underscore ids is equal to this and then this is going to be dot inserted ids and actually let's just call this author underscore ids like that actually we can just call it authors for now that's fine even though this is called authors we'll just override it so now we'll store a list of all of the inserted ids of all of these elements okay so that's how you do the bulk insert again much more efficient than individually inserting so make sure if you are inserting a lot of stuff you use insert many as opposed to insert one okay now that we have done that we need to create the sample data for our books again that relied on the authors so let me copy in this okay so this is our books let's make sure the indentation looks good and what i'm doing is just accessing the indices of the insert ids from my authors for what author wrote what book right so for here i have one mongodb advanced tutorial this one we're going to say is written by me so i'm grabbing author 0 which is going to be the id of this inserted document so that's what i'm doing here i'm putting this inside of an array or a list because i could have multiple authors i then have the publish date i'm just saying well that's today so datetime.today type nonfiction copies 5 and all of these here have the fields that are specified in our schema validation hopefully that makes a bit of sense uh not super important those are just what our books are and now we need to insert our books so to do this we're going to say the book underscore collection is equal to and then this will be production dot book okay then we need to insert all of these i don't really care about the ids this time so i can simply say book collection and not update but dot insert underscore many and i'm just going to insert my books like that okay so that's it for creating our sample data and that gave you a little bit of taste of the bulk insert and now what i can do is say create data like that i think that's what i called the function and when i run this everything should be good so let's run the code notice here that i don't get any errors and if i go back to mongodb atlas i give a little bit of a refresh here we can see that we have our different books and we have an array of authors which are referencing object ids of the authors from the author collection okay so now that we've done this let's take a quick pause let's talk about data modeling in mongodb and why i made the decision here to actually put the authors in a separate collection because there's arguments to be made for embedding them versus having a separate collection so i'm here again on the mongodb documentation i'm just looking at data modeling examples and patterns and i want to quickly go through how you kind of make a decision on if you prefer to have embedded documents or references because that can be difficult to figure out and that can have a large impact on performance so typically speaking when you have a one-to-one relationship that means maybe i have say in this example one address for one person it's totally fine to store that address as an embedded document inside of the person now that makes sense because typically when you're retrieving the person you're going to be retrieving the address only one address per person or maybe multiple addresses per person but they're specific to that person makes sense to store it inside however if you were to have one address that was associated with say five people six people or one object for that matter that multiple people are going to have a relationship with so you have kind of a one-to-many or many-to-many type relationship then what you usually want to do is take that document so let's just use address for this this example here store it in a separate collection and then make a reference to that address from all of the different documents that reference it so all of the people that share an address they would have a reference to that address now the reason why you may prefer to do this is because it's going to take less space it's going to take less space because now you don't need to duplicate it on multiple people it's in a separate collection and also means when you make a modification to this address since it will be stored in a separate collection so like this anyone referencing it will automatically have that update made or that modification made whereas if you were storing this address and you had it embedded in four five six different documents you would need to change it in every single one of those documents to make sure you have consistent data so those are two of the main things you kind of want to keep in mind is the document that i'm embedding going to be referenced by multiple different documents or is it only referenced by one if it's only referenced by one totally fine go ahead and embed it if it's referenced by multiple you still can make an argument to embed it but you definitely want to consider the two things that i just talked about before so hopefully this is giving you a little bit of a sense of why this can be a difficult decision but now i want to talk about something here called the subset pattern so often times when you're retrieving a document there's other documents either embedded or referenced that you want to have a look at now it can be very time consuming and slow down your server if you're grabbing say every single reference for a specific document every time you grab it or every single embedded document for a document every time you retrieve that document so this example here they have a bunch of movies and it can be very time consuming to grab every single piece of detail about all of these movies every time so the subset pattern here is saying well what you should do is store a subset of the information that is important and that you're frequently retrieving in a document as opposed to storing all of it and store the additional data somewhere else which can be retrieved when you need it that's essentially the premise you can read through here and see exactly how they would implement that but the idea is that if you store kind of a slimmer version of a document it's much faster to retrieve you can get all of the data that you need very quickly and if you do require the additional data then you can always grab that and ask for it as opposed to always grabbing every single piece of data which again can be slow and time consuming so i mean if we read through this here it says here currently the movie collection contains several fields the application does not need to show a simple overview of the movie such as full plot rating information etc so instead of storing all of the movie data in a single collection you can split the collection into two collections you have movie and movie details and then you only grab the movie details when you actually need that right so that's kind of the subset pattern and what's this is talking about here now there is some trade-offs on using the subset pattern this can require multiple calls right or multiple kind of query operations it can require you to do some type of join operations which i'll show you how to do later on and something you just need to consider when you're kind of making this design design decision sort now this was more on one-to-one relationships okay so when you have one document referencing another document in some way however now we have one-to-many relationships so in a one-to-many relationship you're going to have one document that references many different possible things right now in this case we have the embedded document pattern we have the subset pattern and we also should have here the reference pattern all right so here on this page we're looking at modeling one-to-many relationships with embedded documents now this is going to be similar to what i talked about before but i just want to quickly read this to you here it says embedding connected data in a single document can reduce the number of read operations required to obtain data in general you should structure your schema so the application receives all of its required information in a single read operation so that is for efficiency purposes and this is kind of stating why you would use this embedded document pattern where when you have again say multiple addresses as i've talked about you store them in an array here of addresses this is totally fine it's going to make it very quick to retrieve the addresses however again if you were sharing these addresses with other people it might make sense to put them in a separate collection because of the reasons i stated now though we also have the subset pattern so same thing can happen here as with before when you have a ton of nested documents that leads to a large document in general that you need to retrieve because that's to retrieve all of the embedded or nested ones and sometimes it can be common uh to just store a subset of in this case say reviews as opposed to every single review and then grab all of the reviews when require so i won't go through this we already kind of talked about this before this is just more specific here to a one-to-many relationship okay let's move on to the next one again you can read all this from the description so continuing here i'm looking at model one-to-many relationships with document references now this is a great example of again why we want a reference when we have a publisher and we're treating this as an embedded document this publisher likely is going to publish multiple books and that means that i'm going to have this publisher embedded in multiple documents and i'm going to continue to repeat it and duplicate it when i really don't need to do that so rather than doing it like this we're going to create a reference now they've done it in the opposite way that i did in our example but they've made a reference here of books and books is essentially an array that has all the ids of the books that this publisher has published now we did it the other way where we had an id on our book that pointed to the authors or we had ids because it was an array either way it's fine but again this avoids this kind of duplication that we had before and makes it easier for us to say modify the founded date or the location or add a field to our publisher without having to do it in a ton of different places so that's really all to show you here that's kind of an introduction to data modeling of course as your database grows larger there's more design decisions that you have to make but this gives you kind of the core i guess choices that you have you can use embedded documents you can use the sub sub pattern you can use a reference and that's what you're going to be using in mongodb you have to make your decision and kind of lay out your arguments on which one you would prefer to use based on the type of program you have and how you're retrieving information with that said though let's go back now and actually start writing some advanced queries in mongodb and use some of the sample data that we created alright so for a majority of the rest of this video what i'm gonna do is just write some advanced queries and demonstrate how we perform kind of multiple operations and do just a lot more advanced stuff that we haven't yet seen now what you can do if you're really kind of a beginner and you don't want to write these super advanced queries is you can always just grab all of the documents from a collection for example and then you can use python to parse through them so you can you know look at the fields manually write your python code now that's not the optimal way that's not going to be very efficient and you're going to be retrieving a ton of data that you don't need which obviously is not ideal but i just want to mention that there is times where maybe it's way too challenging or difficult for you to come up with the query on your own using kind of the mongodb language if that's the case write as good of a query as you can and then use python to kind of parse and modify the results as you see fit anyways though i will show you some advanced queries of course there's a ton more everything will be available from the links in the description and the documentation however what i want to do now is i want to retrieve all of the books that contain the letter a so kind of a random thing but something you may want to do especially if you were say searching for something or you got a search string that needed to return some books and i'm going to show you how we use regular expressions here so i'm going to just write a variable called books containing and then this is going to be a and this is going to be our production dot book dot find and for the query here we need to put the field that we want to do the query on so this is going to be title because i want to see if a is contained in the title and then i'm going to use this operator here called dollar sign and then regex so standing for regular expression now for the regular expression i can just use a and then inside of curly braces just put a one and this specifies that we're looking for something that contains at least one a okay so now that i've done that i can just print this out so i can say printer actually i don't know if i have that defined but we'll look at that in a second dot p print and i can print the list of books containing a now let me just come up here so i need to import p print which i have and i'm just going to create my pretty printer here so printer is equal to and then this will be print dot pretty printer so now i can use that to get a nicely formatted output okay so let's run this here and see if we get what we're looking for and when i do this notice i get a list containing books that have the letter a and the title so of course we have advanced here and we have the gray gas b of course we have an a we have another a here great so there we go that was the first query now these operators that i'm going to use there's an entire list in the documentation i'll go through a bunch of them but quickly let's bring up a list of some different operators just so that we can have a look at some that we can use okay so i've just clicked into operators here from the side and we have some options so we have aggregation update and then query and projection right now we can look at query and projection where we have like equals greater than greater than or equal to in less than we have logical operators element operators evaluation operators geospatial array bitwise projection all kinds of different ones and then we can even continue down here right and go to like update operators where we have ones like current date increment min max i just want to show you this so you know that there is a lot more and anything you want to do you can likely do by using these type of operators okay back to this though let's write another query the next query is going to be a join operation where what i want to do is essentially grab every single one of my authors but i want to have a field on my authors that contains all of the books that they wrote so how do we actually join these two collections together and get our data kind of aggregated so this is going to be authors and books is equal to and this will be production dot authors not dot find but dot aggregate and what dot aggregate allows us to do is essentially pass a pipeline of different operations that we want to perform in sequence so what you can do is perform one operation that will give you a result and then the next operation you perform will be performed on that result so you can kind of chain or aggregate multiple operations together and that allows you to get you know more advanced queries and write some more advanced stuff however aggregate also allows you to use specific operators that can only be used in the aggregation so that's why we have to use this here so in this case what i want to do is use lookup now i'll go to the documentation and show you what this does so let's go here all right so i've just pulled up look up here just took me a second to find this but it says the definition is performs a left outer join to an unsharded collection in the same database to filter in documents from the join collection for processing so obviously a bit of a mouthful here but essentially this allows us to join two tables together uh the left outer join is exactly what we want so we'll grab every single element from the original collection which will be author and then any matches if there is matches okay so let's have a look at how we use lookup so in lookup we start with from not form but from this is going to be the collection that we want to look up in or that we want to perform the join with so this is going to be book we'll do book because sorry this needs to be author because we're doing author here if we add book here then we would have done author here okay after the from we need to have our local field and this is going to contain kind of the id or what we want to match with the foreign field from our author collection so the local field is the field on author that we're going to look at to perform the join so this is going to be underscore id and then the foreign field is not going to be id but this is going to be authors right now mongodb will automatically handle the array for us so we don't need to worry about that but again for the foreign field we go with authors because on the book collection the authors field is what contains the ids of the authors and then id is well the id of the author that we're looking for in the foreign field hopefully that makes a bit more sense can't really explain it more than that and then lastly we have we have as which is going to be the name of the field that's added to our author documents here that are returned that contains all of the books so it's actually all we need to perform the join so let's just print this out i'll just copy this line here we can comment those out for now we no longer need that and this will be authors and books okay and let's run the code here and see the result that we get let me clear the screen and notice that i get all of my authors so i have the id of my authors and then i have books and this contains embedded documents of all of the books and that's what our aggregation operation did so we have this book another book that tim wrote date of birth first name last name etc continuing we have the next books date of birth first name last name you get the idea there you go that is how you perform a join operation in mongodb using lookup actually fairly straightforward and again mongodb handles the array for us so if there was multiple authors then it would give this field on all of the authors that wrote a specific book okay let's comment that out and let's continue now and write another query now this query actually i shouldn't comment this out is going to be the same as this but we're going to perform some more steps in the aggregation so let's put this down here and rather than authors and books we are going to get authors and account of how many books it is that they wrote so we're going to say author book underscore count so the first step in my aggregation here and there's multiple ways to do this this is just one method is i want to join the two tables together so for every author i want to get a list or an array containing all of the books that they wrote that's great that's the first operation that we perform now let's just make this look a little bit cleaner so it's a bit easier to write in okay so this needs to get indented and now we can put our next aggregation operation here okay so this is the first operation that we perform the next operation that we perform is going to be add and then fields like that i think this is intuitive as to what this is going to do but it's going to add a new field and the new field that i want to add to all of my results here so all of the documents from this is going to be total underscore books and i'm simply going to say here that this is going to be the operation of dollar sign size and then i'm going to reference and this is going to be books like that and i need to make sure i have my dollar sign here so i'm adding fields i could add multiple fields in this case i'm just adding one the field is total books and the value of this field is going to be the size of whatever the value of books is for each author that we're adding the field for that's how that works that's how you add fields now after we add fields though i want to perform a projection so i only get specific fields in my return values because i don't need everything so for here i'm going to do an operation this operation is going to be project so in the aggregation pipeline this is how you pick specific fields that you want returned and then i'm going to specify all the fields that i want for my author so the field i want is first name i want the last name as well so this will be a one i want to get my total books field which i just created in the last step so i'm going to say total books equal to one and i'll grab the underscore id and i'm just going to manually set this to be zero so that we don't get the id okay now that we have that we have an aggregation pipeline that does three operations so we look up we add a field and then we project and again this will be performed on kind of the last result so we have this result this gets performed on that and then this projection gets performed on all of this or whatever was returned i guess from this step then what we can do is simply print this out so i guess i can just copy this again and rather than authors and books this will be author is it authors yeah authors book count okay so let's run that and let's see what we get here and notice that we have our author total books author total books author total books and the final author and their final number of total books okay awesome so now that we have done that let's go and do some more advanced queries so we just got the authors and their number of books the next thing that we can do is grab all of the authors uh and their books but only for authors that are a certain age so in this case let's do it so that we're grabbing authors that are between the age of say 50 and 150 years old you'll notice that if we go back to our authors here we have some that were born quite a while ago so they would be quite old and we're not going to retrieve those so again what we want to do is grab the authors and all of their books but only authors that are within a certain age range this is actually much more difficult than it looks but of course i will show you so i'm just going to grab this query that we wrote already i'm going to paste it here uncomment it and change it a little bit but we will start with this lookup operation so actually i lied we're going to start from scratch here and i'm just going to say books with old authors is equal to production dot we're going to use book this time and then not find but of course dot aggregate and inside of here we're going to have a list of the operations now the first thing i need to do is for every one of my books i need to find their authors and i need to determine the age of the authors because what i want to do here is essentially filter and only return books that have an author that's within the age of 50 to 150 years old so this is fairly complicated because if we go and we look at our authors here we notice that we don't have an age field we have a date of birth so we actually need to perform a calculation to determine what the age is we then need to know the age of all of our authors and we need to only select books that have authors that are within that age range so the first thing we need to do is join our books with the authors that we have the author data then we need to essentially loop through the authors set the age of those authors and then we need to filter so we only have the correct books with authors that match the age range hopefully that makes a bit of sense but let's go through this so let's go with lookup as our operation because we're going to perform a join and notice this time we're looking up from the author collection because we are querying from book so we're going to say from and then author i don't know why it keeps giving me validator but that's fine and then we'll go with our local field and the local field is going to be the authors okay because we're talking about the local field on the book so we have authors and then the foreign field will actually be the id of the author so underscore id kind of the opposite of what we had previously and then as well this time it's not going to be books it's going to be authors because we're adding this to our books okay so now we have our lookup operation done the next thing that we need to do is set the ages of our authors so i have this authors field now and actually what i'm going to do is replace the values in this author's array because we could have multiple authors with the age of the authors so this will look a little intimidating but i will explain what we're doing we're going to start with the operator set now this actually replaces the value of an existing field so we're going to change in this case the authors field and the issue here with authors or i guess a fact about authors is that this is an array so it's not as simple as me just grabbing the date time uh that the author was born and doing the math with that i need to do this for every value in this array so this is where we're going to use the map operator and we're going to map essentially some operation to every element inside of this array so i'm going to say map and i need to pick the input for the map so for the input we're going to use and this is going to be dollar sign authors we're referencing the field authors now i'm going to have in here and what in is going to specify is all of the fields that i want to have in the array and it will fill those in for every element for my author so i know this is a little bit confusing but you need to specify inside of in what you want to have in the array for each author so i'm going to put an age a first name and a last name because those are the three values that i want so just follow along here we're going to start with age now to calculate the age we need to use another operator and this operator is going to be a date difference operator so it's going to be date diff like that but of course with our dollar sign because it's an operator and for date diff we need to pass a start date and for the start date we're going to do two dollar signs and we're going to say this date of underscore birth now this is referencing the current element and that's going to be the current element that we're looping through while we're performing this map operation so you kind of have to visualize this in your head as soon as we see map that means we're iterating over every element inside of this array the input array is well this because that's what the map operation is doing and then in is saying okay well this is what we want to have in the array for each element so i want to have my age my age is equal to the date difference i need to pass my start date which is going to be whatever the date of birth is of the current author that i'm on right now and then again i keep opening this domain to do that we're going to pass the end date and the end date is just going to be the dollar sign dollar sign now this just references the current date kind of a nice easy variable to use and we want to have a unit here for the calculation and the unit will just be year that's the level of precision that i want right i just need to know how old they are so i'm going to get year and that's calculating the difference between the start date and the end date just looking at the year if we wanted to be more precise we could go with month day et cetera but i don't care for now we'll just go with year that should give us the approximate age and then we add that as the age field okay so after age though we need to have our first underscore name and the first name here is just going to be this dot and then first underscore name and then we need the same thing for the last name so we'll say last name like that and then of course that's going to be equal to last name and if you wanted more fields then you could access them with this or you could do the calculation like i did inside of here okay so that's it for this operation again i know that looks fairly complicated it is but i wanted to show you some more advanced stuff so i figured i'd throw it in now that we've done that we should have the age for every single one of our authors the next thing that we want to do is filter it so that we only get books that are written by authors within a certain age so to do this we use match and essentially what match will do is it will have some query and if this query is true for an element it will keep it so it'll run all of our documents kind of through uh the condition that we're about to put here if the document is true for whatever we put here it keeps it if it's false it doesn't keep it just like when we write a query when we're finding documents okay so i have match here and then inside of here we're going to put and because there's two things that we want to check inside of the list here the first thing that we want to do is we want to say authors and then since this is going to contain multiple things inside of it i'm going to say dot h now by default since this is an array and i'm accessing a value from an embedded document inside of this array it will loop through and look at all of them so i'm going to say authors.age again that's referencing the age attribute of each of the authors or any of the authors that we have inside of that array and i want to verify that this is going to be if we go like this greater than or equal to and then i'll just hard code in 50. so the minimum age is 50. and then the next one that we want to have here again is authors.age but this will be less than or equal to 150. so if you're greater than or equal to 50 less than or equal to 150 you are all good and the last thing we will do here just to throw in a nice touch is we'll perform a sort so i'll say the last operation i want to do is sort and i want to sort by what fields well i just want to sort by the age when i do one that stands for ascending so i'm going to sort an ascending order okay now that we have that let's print this out so printer dot print the list of books with old authors i believe that's what i called it i'll zoom out so you guys can read all of this at once if you want if you're going to pause the video and have a look at it do that if you'd like let's zoom back in and let's run this code and see what our result is okay so when i run here we do get everything that i expected we get our authors or we get our books written by authors that have uh you know the specified h so first book 1984 age of 119 for george orwell and then lastly here we have our age of 126 and this is the great gatsby and the other books that were written by younger or older authors are not returned because while they they weren't specified right and you can see that we get the age field for each author uh as i requested or as i wrote in the aggregation operation all right so that has been it for the advanced query section of this video hopefully that gave you an idea of what's possible in mongodb showed you some of the different operators gave you a decent example to run through here obviously you're not going to have this memorized there's a lot of stuff you need to look up from the documentation but i wanted to give you kind of a good in-depth example and some different examples so you could see how you perform some common operations here and do some more advanced stuff now what i want to end you with here is kind of a quick demo of this awesome library slash module that mongodb has made called pi arrow or pi era now what this allows you to do is essentially take mongodb data and read this in as either a data frame so a pandas data frame a numpy array you can also read this in as an arrow table or an arrow object if you're familiar with what that is this is super useful for different data scientists specifically machine learning people or really just anyone using python that wants these specific formats because a lot of times you want to take data and read it in in a specific format right you want a numpy array you want a pandas data frame whatever so this module you do need to install now the command to install it is the following it's going to be pip install and then you need to install all of these dependencies some may already be installed but i just want to run through them to make sure you don't miss any so you're going to install jupiter you're going to install pi arrow like that inside of quotations here i'm going to have pi and then srv although you should have installed that in the previous video and then we want pandas as well if we're going to be using pandas now i guess we might as well do numpy while we're at it just use numpy so run this command should install everything you need i already ran it so i'm not going to go through it again if this command does not work you can try doing a pip 3. if that doesn't work you can do python hyphen m pip install or you can do python 3 hyphen m pip install and one of those should work for you so i'm going to assume that that has worked and that you've installed that module and now we can start working with it so i'm going to say import and this is going to be pi arrow like this while we're here let's do our other imports so i'm going to say from pi arrow and this is actually going to be pi arrow sorry my arrow dot api i want to import a schema i'm going to say from pi if we could spell that correctly pi arrow dot and then monkey we're going to import patch underscore all and then i'm going to say import pi arrow as pma okay so those are the four imports that i need for right now again i'm not going to give you a full tutorial here i'll leave the documentation in the description so you can have a look but the first thing we're going to do is we're going to call patch underscore all okay so patch underscore all like that if i could type that correctly and then we're going to specify our schema so quickly though let me just describe what this actually does so patch all essentially makes it so whenever we deal with a collection object here in mongodb is that it has access to all of the api features we need for pi arrow to read this in as a specific object like a data frame numpy array etc so don't worry too much about it but just run patch all to make sure you don't get any errors here and then as i was saying we need to specify a schema so what we do is we specify a schema we then can load our data using this schema and that's what allows us to load it in as a pandas data frame as a numpy array whatever because it can be very difficult if you take data directly from mongodb and try to read it in as one of these formats you usually have to write some custom python code which can be time consuming and not necessarily intuitive to write so this is a really quick shortcut very easy to use so for now i'm going to say author is equal to and this is going to be schema again we imported that up here and i need to specify as kind of a json or python dictionary here the values that i want in their corresponding types so i'm going to say id and this is going to be object id but i need to import that from bson so i'm going to say from bson import object id okay continuing here i'm going to say my first underscore name and this is going to be pi arrow dot string now you don't need to install that it should just be installed for you by default pi pi arrow is a module in python that defines some specific types for you a few other things as well and this allows you to grab kind of the string type right from pire so pyro.string you're then going to say last name and this will be pi arrow.string as well we then want to get the date of birth so i'm going to say date of birth like that and the type for this i'm going to say is dt so date time remember i imported that all the way up here so i have from date time import date time as dt okay so that's my schema for now if you have some more advanced types you will have to look those up from the documentation for now though let me show you how easy it is to read in our data as a data frame from our database also all my imports sorry they would have just gone to the top because i saved my auto formatter does that for now though let's say our data frame so a panda's data frame is equal to and then this will be production dot and we'll go with author and we'll just say dot find pandas all like that and then we can pass a query just like we would when we're finding data normally and i'm just going to leave it empty so we get all of them i'm going to say schema is equal to author then i can do something like print df dot head and this will give me a pandas data frame that contains all of my item sorry from the author collection based on this schema now any fields that i have not specified here automatically will just not be included we'll just leave them out and if i've specified a field here that's not in any of my author documents it will just have a default value of none or no so let's run this though and see what we get and if we get our pandas data frame and it says it's not recognized sorry let me run this again and notice here that we get our data frame now it's a little bit messed up just because we're reading some stuff in as binary the id specifically is kind of a binary object we get the point this is our pandas data frame we can treat this as a pandas object and i guess it was running my previous query as well so that's why we're getting that information let's just comment this out for now and continue and look at the other ways that we can read this in so rather than just a data frame we can read this in as an arrow table so i'm going to say arrow table is equal to and then this is going to again be production.author dot and then find underscore arrow underscore all and we'll say schemas equal to author and then i can print my arrow table so let's run that and when i have a look at my arrow table here you see i get the arrow table if you're familiar with pi arrow then you know what this means otherwise i guess it's not of concern to you consider uh continuing story we have our numpy array so i'm going to say nd arrays like that is equal to and this is going to be production dot author dot find underscore numpy underscore all pass our query pass our schema and then we can print the nd arrays like that and we will get our numpy race so let's clear let us run and you notice here that we get our numpy arrays again this is kind of a binary object you would need to deal with that won't show you that right now but you can see we get all the numpy arrays that we request alright so has been a short demo of the pi aero library slash module hopefully you guys found this helpful and you can see yourself using maybe some of these features specifically if you're going to be reading in or trying to convert mongodb data to a data frame arrow table or numpy arrays with that said i will wrap up the video here another massive thank you to mongodb for sponsoring this make sure you guys get your free credit from the link in the description using code mkt hyphen tim with that said again hope you enjoyed if you did leave a like subscribe to the channel and i will see you in another one [Music]
Info
Channel: Tech With Tim
Views: 27,161
Rating: undefined out of 5
Keywords: tech with tim, mongodb and python, how to use mongodb and python, what is mongodb, mongodb atlas, mongodb compass, python, what does CRUD stand for, CRUD operations, how to setup mongodb, pymongosetup, mongodb document model, what is the mongodb document model, how to setup mongodb atlas, fix pip, how to fix pip, mongodb cli, download mongodb, what is mongodb compass, what is mongodb atlas, tutorial on mongodb and python, schema validation, bulk inserting, pymongo arrow
Id: nYNAH8K_UhI
Channel Id: undefined
Length: 49min 27sec (2967 seconds)
Published: Wed May 04 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.