MongoDB Quickstart with Python and PyCharm

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome to today's webinar organized by JetBrains I'm Paul Everett pycharm developer advocate and I'll be your host the topic for today's webinar is MongoDB QuickStart with Python and PyCharm today we're very fortunate to have a special guest Michael Kennedy is well known in the world of Python as an author speaker instructor great guy probably best known for his talk Python to me podcast and its companion talk Python training and Michael is it safe to say that the Python bites Python bites podcast you do with Brian has gotten pretty popular too yeah it's it's growing with new users every every day it's amazing so it's great to see people appreciating that of their format as well good format for busy people right that's right Michael's done over 100 week-long Python training courses bring the good word of Python to the huddled masses worldwide in the past months Michael has shipped two new courses mastering PyCharm kudos to you for that yeah and also the topic of this MongoDB QuickStart with Python and pycharm Michael has the responsibilities people have really really loved these courses the response has actually been better than expected I it's it's probably close to hosted 20,000 people have taken the course that or we're about to sort of cover similar or similar stuff so yeah it's pretty amazing Wow 20,000 people did you expect that when you when you made it where you're like oh I bet that is free right but that's still yeah yeah all right well let's get going okay so welcome to Marga to be in Python with high charm we're going to go through and basically create a small application and we're gonna model something that as people familiar with Python you should all actually be really really familiar with so I think it'll be it'll resonate a lot with everyone Paul did mention I have a MongoDB QuickStart course that is not the same content that I'm doing here so if you happen to have taken that course you wouldn't get something different this time so that's good it's additive right some of the concepts of course are the same but the demo we're gonna build I made this yesterday so I wanted something fresh for you all so it's fresh off the presses so what are we to cover we're gonna talk about the tooling I'm going to use we're gonna talk about MongoDB really briefly like what is it why you care about it why is it a decent choice for a database and then we're spend most of our time building something fun and then looking back on some of the concepts here and then if we have time I'll take questions we already discussed that a little bit so feel free to jump in right away another question you know maybe that pairs with is this being recorded is can I get the code to take with me yes here's a URL github.com / Mike security slash JetBrains webcast build with MongoDB I'll be sure to give that to Paul to share with everyone as well and and whatnot so I'm gonna be working in this fresh github repo and then everything you see me do I'll commit and it'll be public you can grab it pretty much right after we're done with this session all right so what we're gonna use one use MongoDB that's pretty obvious you can't have mommy-to-be without that we're actually gonna use this library called engine which is an odium object Hakeem at mapper which is like an ORM for a document database there's a bunch of these from MongoDB this one I believe has the write feature set write ap I'd really make it super useful we're gonna be using this thing called Robo 3t also known as Robo a nice visual way to interact with our database and of course PI genre so I just want to set the stage really quickly for everyone out there who maybe doesn't have a lot of experience with MongoDB I'm going to assume everyone has some basic experience with Python at least maybe you note have tons of it but some however for well let's just talk really really briefly before I show this this this thing that's on the screen you know I just wanted to set the stage and say that it's important that the databases you work with are popular it's not it's not like high school it doesn't it's not a popularity contest but by virtue of these databases being popular that means they are put through the ringer under some extreme circumstances right they've really been used in workloads that are probably massively heavier than what you'll ever add for it I never do for it so if you can survive that it can survive what you're gonna give it all so that means there's a lot of libraries to work with it so I'm like additional idioms so if you look at the various document databases that are around you'll find that one of them is unlike the others the blue one MongoDB is clearly more used more popular and growing compared to a cassandra couch Raven and so on also if you look at the Stack Overflow survey developer survey 2017 it was the most wanted database by almost a factor of 2 over Postgres which gets a lot of love and then it tails off pretty quick from there so this is a sort of a good place to be in terms of data let's talk about how document databases work MongoDB is a document database but so is cosmo DB rave and DB some of the other ones cash TV there they're all that sort of this general style of working and the idea is you create embedded documents sometimes is actually JSON in MongoDB it's a binary representation but it's effectively the same as if it were JSON but you can't read binary so they'd put that on the screen so what you get is you can model stuff like you would in a traditional database so you've got your columns like these are your columns right ID title course ID and so on but then you have something funky down here like what is this nested thing alright so you've got this single lectures but the lectures is actually an array of sub objects which have ID title video URL and and so on and so this here this is actually an embedded document and so when you when you see things like this you can think of this almost as like a precomputed join so if we're going to work that you know like this is a chapter from one of my courses in the database if we're going to almost every time we have a chapter we want the lecture we have one lecture we want the other lectures in its chapter like modeling like this sort of pre computes the join so that you can get to it much faster and actually the the modeling as simple as we'll see so this is like I said is this pre-computer join another big question the $64,000 question is if you do this can you still ask interesting questions about the stuff that's embedded like could I go to the database and say give me the lecture one zero one zero six right can you query deeply into these things and luckily the the answer is yes you can do this you can do super fast you can do this with an index and we're gonna see this as as we go all right so what are we gonna build we couldn't I told you be familiar with it we are going to build pi pi for our demo today now just set the stage pi pi is a web application written in pyramid but at least the modern newish version of it and it's got probably a relational back-end I don't actually remember what they're doing so it's not to say that they're doing this with MongoDB and and whatnot but what we're gonna do is basically focus on the data access layer that would be a theoretical pipe I implemented a MongoDB and Python then you would of course layer that in to say a web application you can use pyramid like they are or you could use something else okay so this is what we're gonna build now let's let's look before I show you this relational model here let's look just pull up the web site and poke around for a moment I believe that changed my resolution I don't appreciate that let me put this back hopefully that is gonna be okay okay while you're fixing that it's okay if I pop a couple things to you please do yes all right first someone mentioned and it's true the github repo currently doesn't have anything in it but to readme is there some stuff you're going to push later yes absolutely I'm gonna write it it doesn't exist yet so I got to type it first oh okay soon as I type while you're figuring this out I'll answer the next there's a question why despite arm not have support for MongoDB it's a good question it's a very frequently asked question pycharm professional includes data grip our IDE for sequel and that's why we don't include support for MongoDB because data grip only supports sequel oriented databases available under JDBC so there's a very popular ticket over in the data grip issue tracker you can go and vote for it now one last question this one can go to Michael how does DB handle financial data such as sales data compared to a traditional database you talked about that in your to our course but here you can give a little plug sure I think it handles it entirely fine though there's only one area that I know of that is really critically different then you might think of as a traditional database and that has to do around transactions now MongoDB doesn't have transactions however that is not as bad as it sounds because updates to individual documents or transactional so what you can what you can see is that if I want to change say like I might have that example of that course with the chapter 2 lecture if I want to change a bunch of lectures I could do that transactionally but I couldn't change lectures across courses that are unrelated to each other transaction so if you were doing like banking stuff or I have to say debit this account credit that account and I have to do that in transaction that probably is not best done there but certainly e-commerce sites are easily built on a an example underarmor had their e-commerce site built on MongoDB and they were doing two billion dollars a year through that site so that seems like it's fine like my e-commerce site runs to MongoDB pop for example it doesn't yet do two million dollars but we'll see about that anyway that like but that's kind of what I was saying before about the popularity thing right laya like oh well I've got a pretty busy commerce psychic well Under Armour did two billion like do you do to it no we only do it okay well see all right that's sort of the the the why you care about these fakes being popular so anyway I I think the one thing you got to think about is the cross record transactions but it's not as bad because the records here are much richer and often you'll find you really want to update different parts of the same record alright that's our question that's responsive whoa great so let's let's look at pi PI right PI bi is when you pip install a thing in I sorry in Python this is where it comes from and then this this place appears where you find stuff so let's go to find engine this is for example the actually let's look at request I remember there's not enough there's not that much information filled out there so if we look at requests over here you'll see that it has a name it has an install thing and has a latest version you come down here it has like a description and other stuff you'll see a bunch of details about it sometimes not on this one up hope another it have like the health code coverage continuous integration whether that's working and so on here you'll see we have a release history these are all the different versions if you click on each version you get the same details but for that version if you go go back here back to this mean you can see it's got like the download files it's got the license who maintains it there's three maintainer zuv it it's licensed it's here notice this is really interesting programming languages some of them have topics like let me see if I can pull up the engine and you want and get some slightly different information for you yeah so here you can see topics or database software development and so on and up at the top it has the builders passing the code coverage is this and the health is that and this is a per release history thing okay there are users the users can be maintained as users can contribute packages and so on so what we're gonna do is we're gonna model that in MongoDB I'm gonna build out a bunch of the that sort of thinking about it getting started in with just from scratch live and then I'm gonna drop in a little bit of finish code not because we couldn't build the app in time but because we can't I'm gonna work with this huge database with millions of entries that will take like ten minutes to generate from whatever we build so we're gonna have to do a quick little swap they're like part way through to actually play to see it in real used right not like oh I have two records in the database and look it works fast yeah like of course it does it as two records but with millions records okay so that's that's great we're gonna build this thing out now let's look at the model here and I don't want I don't want the screen to be messed with anymore by PowerPoint so I'm just gonna leave it like this for now so this is what I think a reasonable relational model works I didn't actually look at what they're doing I didn't actually go and take their real database this is just me looking from the outside but I think you need to generate pages so we notice right here we have our packages and the packages is super intense with lots of relationships to it that has the topics right one-to-many there's multiple topics applied to a package like database and queries and ODM and whatever there's different OSS this will run on the various package there are different programming languages like Python Python 3 Python 2 and so on we saw that listed on both requests and others this is the health this is like the continuous integration is running this is the code coverage and so on there's the maintainer which is actually just a normalization table between packages and users to say okay these set of users are maintainer z' of these sets of packages and then we have the dependencies that the package depends upon all right if I'm going to install this it also has to have those other pieces we have our release history and it turns out that even though packages looks like it's super intense release history is the place where most of the details go and then what you don't see on that screen but exists is a download record every time somebody pip installs something or download something off of pi PI it gets recorded the version of python the app they were running you know is it pi it was a pip was it something else all that stuff is recorded so they have a bunch of analytics so this is the relational model and I think this is probably you know this is a simple view of what would be a real application would it be even more intense okay like a real app would have many more tables it's just trying to render that one page now how would this look if we model this or how is this going to look when we model this in MongoDB one of the things that I think is often overlooked about document databases people here know sequel to here your document databases and they think performance is what matters I don't that's nice that it may be faster or fast but what I think really most people have not everyone has a performance problem in the extreme but everybody has a complexity problem like my program is hard to add new features to or it's getting complex or whatever and so you'll see that working with the document databases actually simplifies your data model dramatically and that means it's easier to add features it's easier to evolve there's very rarely migrations or anything like that there it's easier to bring new people on and so on so take this model that you have here and if we look at it this is the rule the the document database model okay we have packages they have a release history you can download a release users are related to packages indirectly okay so I'll just flip back real quick for you relational style of modeling document database of solid modeling which one do you think is easier alright which one is easier to understand as new which as a new user which new developer which one do you think is easier to evolve over time and so on okay so this this is nice if we're not gonna have a lot of models but it does add the the complexity in different place it's because what packages and release history are is richer all that information is still there it's just embedded in this document so if we look here you can see our packages actually is super super wimpy we have an IDE we have a name we have maintain errs but the release history because that cop that that description may change the dependencies may change the operating system it runs on may change as you release new versions all that detail actually goes here so we have sort of the column our stuff here ID package version and so on but we have topics notice we can store these as strings operating system as well programming languages as well the health we can embed a health object into this right so we have a nested object that has properties like CI coverage and so on we could have dependencies which actually have a rich an array of rich sub objects so that's all really interesting my purple my legend didn't quite come out quite right but then the rest of stuff is pretty much like flat flat things okay so it's time to go build build this thing let me get this power point out of the way because the real good stuff is actually writing code so that's why I called my folder code okay I will go ahead and tee up a couple of questions that's alright yeah yeah team up two good questions first one Thomas says he recently listened to the talk Python episode with the court author mm-hmm and got really excited and asked how well does MongoDB play with async IO and Python 3.5 plus oh you're stabbing with a knife no I bar as I know you I don't think it does to be fair like neither does sequel alchemy right like there's really it's it's really unfortunate that the ORM folks are not embracing some form of async the only one that I know of is the well I guess there's the extensions the async extension for peewee ORM you've ever heard of that and there's also tornado sequel alchemy but it doesn't use a sink in a way it uses the co-routine yield style of programming also not the best so yeah there's just not not a great option for it a database well it's a perfect place for a sink but you also see that the database is super fast for what we're gonna build like the response time is so fast that it may it may be okay but it's still it wouldn't be nice if it supported it but I don't have a great answer for that sorry all right next question from URI what kind of data would you not recommend to be stored in manga what kind of day would I not recommend I think is a good question from your team or your the from your video anything bigger than 16 megabytes well that has more to do with how you model the data not necessarily whether it goes in there I mean if you can store a binary blob there's a mechanism for storing that outside of that restriction but what would you not put I think it's not so much that there's data itself that doesn't fit well in there it's it's your usage pattern of the data like do you have to do heavily transactional stuff right it's like really dispersed like I want to transactionally add two users and update a package and modify the download history like that basically cannot be done you know what it means would you consider a different database for relational data and use MongoDB for only part of your data again that's something you have in the QuickStart you have this diagram that talks about a multi database approach sure I think there's two ways to look at this two areas to focus on one is can I have Mogga DB just be my database or do I need something different and in terms of operational stuff I think actually you can totally have MongoDB be your only database so a lot of people will try to like step you know they'll sort of like step gently into the cold water type of thing and they'll say what we're gonna do a little bit in one database we'll keep the other in a safe one and I don't know that that makes a lot of sense but where I do see people saying look we need to have this other database is around reporting so if you want to do like suppose you have people who know the sequel language and they write reports and they use like the crystals reports plug-in to like map to create these charts and it has to talk to a relational thing you could have an operational document database and then a cron job that like will replicate that over into a precomputed better for reporting warehouse database that is relational that would make a lot of sense I think right ok that's it let's let's get to the code yeah so let's go to the code so we're gonna open this in Python room and on Mac OS you can just drag and drop it that's really nice now there's very little code here I go to source I'm gonna say set this as sources or sources root here and I want to add a virtual environment so I can actually go over here to the properties and go to the project and this is cool this thing you add it here all is they add a new local environment and it'll automatically suggest to create this new ve and be there right at the top so that's great we'll do that and it's thinking about it and I'll go ahead and hit that set of tools with an updates give it a second it said there's an error but if there's not actually there don't believe that so we come down here we can ask which Python or which pip and it's the one on my desktop which is great so we're gonna have to install a few things when install engine which is going to drag along the base package called PI that we're going to need a few others as well like six I'll go ahead and get that installed straight away now I have this program here that we're going to run and what it's gonna do is it's going to you know set up the DB connection and it'll let us go around and ask a question like I'd like to get information about a package effectively display that page I showed you or show me an analytics report about downloads or just exit now this is not written as you can see here so our goal is to go write the code that defines the models I think that's actually the most interesting part and then we'll write a little bit of query syntax not here because that would be wrong we're gonna put it into its own layer in our application and write it there so let's begin we don't read you like a screen sharing the right-click goes a little funky so we're gonna create a directory I'll just call this data and into data where do you want to start let's start actually with I guess we'll define the packages first so say packages and because this is a git repository you see down here and get master pycharm says we're gonna add this to your get repo yes thanks it doesn't commit it just you know does a stage basically so we down here what we need to do define one of these models that we were talking about is we're going to define a class and we'll call this unlike package it's gonna be a singular thing and it's gonna drive for a engine now pycharm says probably not you probably need an import statement but of course if I Luongo and gin I do this right it'll say I think it's already installed why does it want to install it yeah anyway it'll put it up here for us so what we want to do is derive from document so everything that's going to be stored as a top-level item in MongoDB derives from document there's also embedded document like here but we can say for health record or a thing or our release oh sorry dependency that we talked about previously things like that so we're gonna do a document though and then if you've done any of the ORM this is really similar it's like Django it's like seek walking me what we're gonna do is we're gonna say we're gonna have like a name of a package and this could be a engine dot string field notice if I'm you can say capital SF sort of cap cap word it kept letter it and it'll pull that out that's really sweet and we can even say this is required is true which is pretty sweet okay so we also want to have I think the maintainer x' now this is where it gets a little bit different from relational database because this is actually going to be a list it's gonna be a one dominion relationship but it's just gonna live inside this document and the basically what that means is every time we get a package we are saying we would always like to take the main of the details at least the info a little bit of info about the maintainer with us but these map over to the users table so we're gonna say this is going to be a list field and what's gonna be in there is a language and that object ID field these are the IDS of basic the IDS of the user so we don't have a list containing the IDS of the users who are the maintainer now I think that might more or less do it for us now there is a quick little thing here that we want to do and I'm going to borrow this from my example because I don't want to get it wrong let me just grab a little bit of metadata that we got to put at the end here and it's a sort of meta piece I could remember it under meta or just meta and we needed a little bit of information to tell how this maps out of this class maps in the database so even though it's called capital P package the what you would think of as a table name or just the collection name is what they call it MongoDB is gonna be packages lowercase plural excuse me and then of course we want to have indexes so we're gonna have a name here and we would like to be able to say find me the package by the name and we want to do that in like a millisecond across you know many thousands there are millions of these not super slow table scan style and interestingly we can also put in index on all of those even the items within this list by just saying that now there's two other things that we could put here whenever I have a database I basically find I almost always want to know when was this record created right that's something you want to know all the time so I'm gonna come over here and say it's got Engine date/time field like this and we'll set the default to date time and we got to import that so use this code intention to put that at the top Thank You pycharm dots date/time not now it's super important you do not put the parentheses there because that will just set the default time of every record created when the programs to the would and the program started you want the function not the value so that will let us record that and then we'll come back to this download this total demos thing I'm going to comment that out for a minute so we can also say if you don't order it how what how do you want the the default ordering of these to be is we could say give them alphabetical we could write name we could say I want them created on with the newest one first by default something like this so you can sort of put this board or anything on here alright so this is pretty simple now let's go and actually add the more interesting pieces and just look at the time I'm kind of thinking I'm going to grab grab those from other the other place and just throw them in because this is interesting but also I want to kind of cover all the aspects so here's a few other pieces we can look at basically the same thing so here's a download history like this is this user this person at this IP address has downloaded this version of that package and so notice we've got it's created told you about that and then it's got this relationship sort of foreign key relationship to package ID and the release in that package now you technically could just do this all right you don't need the package ID because the release is gonna know what package it's associated with but you'll find that just a little bit of extra data and these document databases sometimes makes the queries so much faster and easier so anyway we're gonna store both of those things and then just information about who downloaded it like they were at this address they use pip the version of pip was 370 - the Python version is like three six you know things like that and similarly we're gonna add some indexes so we could make that fast let's look at the release health this was an embedded thing it lives in the release history and it has the CI the code coverage and the health index here okay the release history is the one that gets pretty interesting so the release history is when it's great of course has the package ID the version the description the topics notice again this is an embedded list the programming languages this is an embed list as well and the dependencies we could actually have this in a better list of more interesting stuff but we're going to make it strings and then here we have an embedded document of health so this this maps over to that model that we talked about there and of course you know it's just going to be called releases and it has some indexes all right so that pretty much sums up the data that we're working with but in order to actually work with we're going to need to connect to it right it might be on a different server it might be on a non-default port hopefully it has a username and password hopefully it has encryption SSL all that kind of stuff so we're gonna have this thing called set up here and we're going to define a function called global in it you call it whatever you want and what this is gonna do is you're gonna call this once hence the global call it one time and it's going to basically set up MongoDB it's going to tell it how this is actually you know where does it find the database how does it connect to it so when you do a query this is just going to sort of ambiently be available there so let's go over here we'll say engine and import that at the top I will say register connection and then the first thing I want to set whoops is the alias this this is pretty interesting if we go back to any of these others like this first one here notice this alias right here so we say DB alias is core so we're going to need to register what core means here in a second but we have this downloads and maybe we don't want to store the download history in that same database all right like I have this example for talk Python right the actual talk Python data is like 20 megabytes but the download data is gigs and so I don't care that often about that frequently backing up the gigs but I do care about backing up the 20 megabytes all the time and so what I actually do is I'll say this is an analytics database for this app where is the main data would live in like the real one but just to keep it simple I'm gonna say core but the idea is you could have these different sort of categories of database and actually means different connections potentially different servers and then I think I just say name equals DB name and that's it it's it for the connect to localhost simple version it's is it's much much larger for here's the way I'm connecting and here's the encryption here's the password here's the etc right so in production this is a little more interesting but here it is so we're pretty much ready ready to go let's go up here and play with this app here so at the top it says we got a register the connection so we'll say set up and we'll let PyCharm and import that if I spelled right no I'll read it at the top sent from data import setup what am I missing here what do you think I will just go with this for now so let's say this is going to be pi PI now let me do just something really quick time I say underscore two so one question is how do i define the database like what tooling do I use to generate it what create scripts do I run to generate it all that kind of stuff and what you'll see is actually there's there's none of that so let's just do this really really quick here I'm gonna just say let's create a package I'm gonna want to import that now we're not really gonna do this here in the real app but do it for now I'll say the name is just plain how do I put this in the database it's super hard you call Pete save boom done in the database another way it works with MongoDB is there is no schema in the database it just says I'll take what you give me and put it in there but what's really nice about this odm is package is a class that has lots of structure and so it helps keep that data integrity there you can have required values and min/max values uniqueness all sorts of stuff some of that's in the database some of that's in engine and they go together nicely so if I just run this so I and over here it's right click and say run here it runs and it already did its thing so let's exit out really quick and let's go look at our other tool so we'll connect to our local database server much smaller than necessary and notice there's a PI P I - and there's a packages folder and if you look at it now there is a the two things were created automatically the IDE and the created and then this is the name we set and then maintain errs has a default empty list if we added something to it it would have put them in there and so on so this is this is really nice we don't to do anything to create the database and I say that because is over here I did create this one with all these things and notice it even has the indexes that we put and so on so let's look at probably the most interesting one like here's a standard release instance of some package I don't know it's whatever link to buy that but it has all this is fake data by the way and I'll give you the script I used to generate it but here's the version number here's some random lorem ipsum type stuff here's the topics notice how those are embedded in there accuse the programming languages that are supported here's the dependencies here's the health like it's failing at CI it's 81% code coverage and 88% healthy I don't know what that means there's a failing CI but there it is two wheels you'll see that we even added these indexes against say like programming languages so let's see all the ones like will you come over here write a query I'll show you how to do this in Python in a moment but we could come over here and say I would like to find programming languages that have this in it so I only want to find I would say I earn Python I think that's an option I only want to find those so you run this and boom there they are notice the response time one millisecond let's do a count and see how many of those there are there's 70 9500 to the support ironpython right that is really sweet so pretty pretty cool and if you only want some of them back then it then it runs really really quickly ok if we go through some questions yeah yeah let me have them all right sure first one from the deme would you suggest anything in place of transactions because MongoDB doesn't have transactions I would suggest what are called atomic updates or optimistic concurrency one or the other so what what I'm talking about is if I wanted to both add a programming language and update the topics and change the health and do this you can do this with sort of what are called in-place operators so you can push the change so you could say something like push JSON on to programming languages and push this dependency onto that this is really good for like if you you need to like increment a counter like page visits or something you can do that atomically using these operators that's that's what I would say all right the next one I don't need to spend too much time on it currently we're using Cassandra were the differences between Cassandra and Cassandra is not feasible for frequent rights is okay for frequent writes mami DB is perfect for frequent writes I would say there's a couple differences in 2b to be totally upfront I've not done very much with Cassandra I think it's more of a column or database and less of a document database but it's also sort of eventually consistent type of thing I believe a MongoDB is strongly consistent there's none of this like eventual consistency weirdness of a lot of these distributed no sequel databases like you write something you ask is it there it's there all right it's sort of you know strong consistency in that sense which is which is nice but yes there are ways to sort of write asynchronously to it but to make it slightly faster but it's typically not necessarily you can set up clusters and charting and all that kind of stuff which is way beyond this presentation next question and if you don't have much experience with it I can take it I already use JSON in PostgreSQL for my non-relational data can you compare it to I think I'm gonna let you take that poll but I would say also you know I think there is a difference between I can have a JSON column versus the database is entirely built around the concept of modeling and documents right right I mean if what you're doing is you're using Postgres and you have one you move one column and it's just JSON like that's probably not leveraging it correctly I also not be interested to hear from you Paul on this can you say put in index like on this in the element in the JSON column in the JSON B support that came out a few years ago is to accomplish the thing that you were just talking about I can have an index expression I can index the value of an expression rather than just a column in that expression can use the JSON operators and to go fetch a child or perform a comparison or extract some text or whatever so that's pretty clever and it really is powerful especially if you want to combine it with the things that you already like about a relational database the columns and the transactions in my opinion at least going into PostgreSQL 11 I think the query support looks weird and is not as rich as it should be a PG comp last year they said in either PG 10 or PG 11 they will land the standard sequel standard for like JSON path or whatever it's called so at the end of the day you can tell you're not dealing with the document database but you do have far richer unstructured JSON support at scale than people would associate with Postgres yeah that's yeah that's kind of what I thought Thanks all right next question with the indexes in release history that you had there was one with health dot C I which presumed the indexes on the C I field of release health can this be with further dots for more embedded fields absolutely as deep as your document goes it can go you just say that that thought that dot that and that can actually be multiple items like so this could have a list of objects which have properties and you would just say programming a manages dot property dot dot dot and it'll even do like a Multi multi element index yes you can do we got a lot of questions it's the benefit of being famous like you know that you have a lot of people asking questions someone made a note that you were missing the dunder and knit or was asking if the reason you had that import problem I think it was because you were importing the module instead of a symbol in the module and you would have had to do import something dot something is that are from something import something what you really wanted was to go get global under an it in there and it would have all worked correctly the Robo thing is it compass replacement moral compass I think is the tooling that MongoDB itself has I haven't done very much with compass this is a super nice tool I can't speak to the other ones what I can tell you there's really nice about this is you write in the shell language up here with autocomplete but your results come back in a graphical form and and that is really like really the power of the CLI but you have actually have a GUI to analyze what comes back it's really nice right and a small point back on the PG and JSON B column that thing that you currently have highlighted would look extremely ugly in the current JSON B syntax for spelunking through JSON data okay and you can actually make it simple it could look like this oh yeah I will say data grip does have support for JSON B just not for can where's a extension that will give you a UI somewhat like this for PyCharm do you ever forgot yeah it's interesting yeah alright it's just not a robot a 3t so I just how many use it's like it's okay by climbing to an IntelliJ in that adds MongoDB kind of browser and query tools alright yep alright alright how many more questions we got pause ok sorry we are coming in faster than we can get to can we write and read null or none values yes engine will actually not persist null values to the database but they come back as none when you read them and the reason is that you actually store that per record and so if this was none you would effectively you would get the same results by having it effectively missing it's it does make some of the queries slightly more tricky if you don't use the ORM but yeah you can definitely do it no problem okay we will speed round through the rest of these is there an auto generation or ORM a ping which you were just talking about like I'm not sure what you start with to generate it um I know I think they're asking for engine a no DM instead of an ORM yeah I mean there's there's engine right I don't know if there's like like if I had the database and I want the models I don't really know what that is maybe the reverse right but I suppose you could write some Python code it's just a dictionary it's just JSON and you could spit out classes based on that but not that I know of okay when he has an hi to access your database instead of going through a mapper right so the mapper does go through pymongo pymongo lets you write this instead of work with classes as we'll see in a moment however [Music] there are two things that you might be aware of that you might want to temporarily or sometimes skip an engine one is the D serialization of many many objects is slow in PI sorry in engine so suppose you do a query and you get back a hundred thousand packages the actual query time of that may be super fast and the response of the database super fast but the from the time it hits that layer until you have objects of that type in memory in a list and Python could be I don't know it could be like half a second or something unacceptable okay now should you be pulling a hundred thousand records back in or and probably not but sometimes you need to and that can be a problem so you could use it then and there's some times if you just can't make the query system work and it is style then you could drop down as well right mappers are magic sometimes you got to remove the magic a follow up on transactions nadine's points out that things are atomic only within a single document if documents are being targeted then it's not atomic right yep absolutely that's absolutely its is it can be as complicated as you want but one we've been only update the thing you see on the screen and nothing else atomic life for him is it possible to do aggregations yes there's an entire different API that is like this sort of data analytics aggregation framework it's a replacement for MapReduce it's very powerful I don't do it almost ever and one more and then there's still four more in the queue can store binary JSON or non JSON objects the thing that it stores is binary JSON it never stores texts a text it just shows that you as text because you don't read binary it will also store binary objects including very large ones using something called grid FS which lets you store arbitrarily large file objects split potentially across your charting even to like make space for it and stuff but I don't do that I just like well there's s3 or digitalocean spaces or something like that it's just or the hard drive that's like but it's possible yes sorry for all the questions resume and if we have time at the end we'll get back to some of them yeah it sounds good so we've this date like I said this data is it's quite large and it takes like 10 to 15 minutes to generate it so I don't want to do that live that's why I dropped in the pieces that match it directly so what I want to do now is go work with this assuming we had written all of those classes that you just saw saw they all right so we've written these we want to go work with them you know look at this this is from Matt import this I don't know anyway it seems like it's happy now so what we want to do is we want to go over here and we want to write our stuff and then we put this back to our real database not a little one and I don't want to do my own stuff there I want to create a different different set of now this is simple so there's only be one but a set of services that let me sort of encapsulate the data access for for working with our data so I'm gonna call these services not as in like web services but as in provided to the rest of our app I'm gonna call I'll create a thing called package service and in here we're going to have some functions like for example let's do something really really simple let's go down here to this print header thing and it wants to do actually I did that in class let's just do it like this I'll put it in the class is fine so I'll put a class here and we'll have a class method actually we'll just let it do this for us so what we're gonna do is we're going to import package service like so and then we want a function called how many packages are there how many releases are there how many users how many downloads so let's just write these really simple ones first and notice that's like got a red not red like a yellow background because it knows pycharm knows that package service doesn't have this so I can hit alt enter and it'll write this for me okay so let's go and write this function so here's what I do when I returned we need to go to package this is how you query the database this is our object that we wrote and we can import the top and we're gonna go to this thing called objects and normally you would put like I want the name to be equal to whatever right but we don't care about the name we just want to do a dot count on it okay and we're gonna do a release so we're just gonna crank these out they're gonna look super similar but it's gonna be release history I think yes and we want another one that is for user so we're just going to bust these out really quick this is going to be user I'll have a nice side effect of importing everything that we need and we have one more which is going to be downloaded don't count that's gonna be download or that there's a little squiggly right here because pit bait says there's a new line so it commits although and then that'll put it at the end so let's run this really quick that I'm comment that yes let's run this and you can see a little pipe II a data Explorer is out we have 50,000 packages 250,000 releases mapped to those we have 20,000 users and half a million downloads and what we want to do to show all this off is to exit because we knew how to write some more stuff so this is really quick and easy but now let's go actually write the interesting thing so if I say query packages it's gonna give me a name say hey what one do you want to search for so let's go over here to this package service and I'll leave that note there for a minute so I'll say package is this most a package service dot a service not find package by name let's say and we'll pass in the name again we can write this here it's gonna take this so we're gonna come down here we're going to do a query just like before we'll say package equals like this and we'll say objects and now it gets interesting so we'll say name equals name like that and then we'll say dot first because we're looking for one right we could put a uniqueness constraint on this to make sure there's only one but I guess I didn't do that and then we just return that package okay so that's working pretty well and we might want to do a little test because they could type in anything and it might not exist we'll say if not package if it didn't exist it will just come back as nuttin so we'll say print let's go crazy and do some F string safe sorry no package with name we'll just sort of inline this here name like that and then we'll bail good thing we want to do is get release we need the latest I'll call it latest actually make it more clear get the latest release because remember the package just has like the name of the maintainer Zitz really the releases each version that is most interesting so this is slightly more interesting query so we'll say package service latest release and we'll just give the package here all right so go and use PI chunk right that and we might want to get a little bit of help here so if I type package dot oh my gosh well done pie chart well even well let's do this though let's go over here and actually give it a type int and say package as well and say this returns an optional release history like that type ants right okay so this comes from typing not optional it means it could be none or it could be a thing so down here we'll say release equals release history dot objects the way we want it we want the package there's no healthier equals package dot ID okay and then this is going to give us a collection a query sad so we'll say first sorry take that back so it's not quite right let's do this and this but we have one more thing we need to do first here we need to say dot order by on the way you view in this part you put a string here so we'll say negative created now remember when I put this here oh it's ordered by version number so this is perfect so we're gonna say give me the most recent one you could sort by a version number but I randomly generated it in the database and so it's not gonna have the effect you're hoping for all right so we got this all good right we're gonna go find all the ones that match order it by the created the reverse you know show me the newsmen first and then just get that and notice pycharm is saying we expected you to actually return you know maybe a release history maybe it's none but once I say return or at least then it's cool it's like all right gotcha you're doing what you said okay so let's go back here and let's do the same thing if not latest now why might there not be a latest when there's a package not every package has a release like 49,000 out of 50 of them have releases in the database but there's a few that have no releases because well because of randomness so we'll print sorry put in upstream sorry the package curly curly has no release is and this will be package dot name okay and of course don't forget their turn now what we're doing is we're actually timing this and then I'm going to show some details okay and the reason I'm timing is I want you to see how long it takes to do these two queries and go talk to the database and in fact I think I need one more thing I need maintainer and this is going to be probably the most interesting query we do in this whole thing maintainer z-- equals package service dot Phi pine get whatever maintainer x' and it'll be the package alright so how are we gonna do this remember this is like a foreign key relationship and a regular database this would be a joint so how's it look here and this is going to be a list and list of me user again from typing so what we're gonna do we have our package now this is going to be this very pretty cool so we'll say users equals user objects like this now what have we put to say I would like to find let's go look at package real quick I would like to find all the users who have an ID that is contained in this subset here okay and also let me put this total downloads in here we need to do this equals and dot and I think this is a long field not a list a long field okay so we're gonna go over here I'm going to get all the users and we want to say something like their ID it's kind of equal to package maintainer z-- right but this is a list of ID so what we really want is like an inquiry I want to find all the users whose ID is in that maintainer z-- so you say dunder in well double underscore in all right and that gives us I mean it basically says go find me the users where the ID is contained within this this list of ID's now I want all of them so I'll just return now this is a query set and my philosophy is the database should be done executing here in this package service layer it shouldn't bleed like ambient sort of late lazily evaluated operation so I'm gonna say I'm gonna turn this to a list there's a slight performance hit doing this like because maybe you're just gonna iterate it but it's actually like nice and clear like if there's gonna be any errors access in the database it's gonna be and as these functions so that'll find us the maintainer over here now there's a bunch of stuff I want to print out so let me just grab that from my other example that I pre-prepared because I want to answer your questions and not print like formatting text okay so let's here get our maintainer is this release I had called our set of latest I think we're gonna call package I called that there's my maintainer I'm gonna put this down here that go alright I think my little t1 is at the wrong place so we're going to time how long it takes to do those database queries and then I'm at an odd time how long it takes to print them out remember all of this stuff is there's none of this like lazy evaluation stuff is all done here so let's cross our fingers and run this and hope all that hangs together so let's query the packages now I just randomly name them package 1 through 50,000 so how about that 100 it worked amazing so check this out so we went down here and we said the package is this we got this is its health status right this could easily be like on a relational database this would be another table like a health took like a package thing you do a joint against but it's embedded so it's just right there we got our current version here's this description and that's random the maintainer there's only one now pick a different one a bunch of supported languages and two dependencies the very first time that took forever took 11 milliseconds to do those 3 queries let's try again query this time I want package 7 7 7 7 now it takes 5 6 milliseconds right it's a little bit more warmed up ok so do we see this one's also failing I should have like tended towards more positive outlook anyway this one is a bunch of maintainer so that's that in query that we did which is pretty awesome and then these are the embedded elements that came back all right so let's do one final thing but I think is interesting and then I'll open it up to questions so the last thing to do is to say I want to I want to go over here and grab this t1 is I want to write a query that says give me the most popular elements now there's two things that are interesting and it'll just take a moment to cover and then wipe it up so let's say we want top top packages will be package service dot or it loves take your pick package service dot pop Euler packages and let's say we're gonna set the limit to be equal to 10 right because that's how many we're asking for now let's go write this well excuse me let's let high term write this beautiful Thank You pycharm it takes a limit on that limit it's going to be an int and this is gonna return a list of package service no package so down here this gonna be great we're gonna write a query and it's gonna look a little like this one so I'm going to say packages equals package about objects and we don't care about which one we just want to sort them say order by total downloads what's up no I want the biggest ones first and then I'll say that's fine I could make this plural that would be fine then I want to limit the results to the number that I passed in right so I don't want all of them I just want that one now let's make this more readable like this like this then again this is a query object that when evaluated will hit the database and do its thing so let's make it hit the database now all right so this is gonna return those items here and then let me also grab the just the text that printing that out again just for time sake yeah like this okay so that's gonna let us print them out we use a school little F string will do that in milliseconds let's run this again oh I noticed there's a bunch of these running now let's chill this do a single instance like restarted if I haven't stopped it so let's go down here and make these go away because it has that infinite loop okay so let's query package one more time package seven seven seven seven great four milliseconds now let's just ask about the download analytics cool there they are so these are the most popular packages I know you recognize these names I use this one all the time so we've got those back in 2.9 milliseconds and we've got all that information so very very sweet Paul I think that pretty much concludes our little demo we can do the two things it says in the menu and we can exit so maybe open up to questions then I'll do a quick how to keep it pretty quick because the timing but really quick review of con actually let me review the concepts and then we'll do the questions okay all right yeah just for time sake so let's look back at the concepts that we've covered you we didn't talk a ton about designing the data one of the real important this is actually the hardest part and it's a bit of a black art there's not like it's not like third normal form do that and that's how you model your data it's really kind of open-ended so there's five or six questions you want to ask yourself is the embedded data so in our example like the dependencies for example wanted you have a package 80 person the time like most the time you get a package do you care about asking questions about its dependencies then embed it if not then don't similarly how often do you want the embedded data without the containing document so do you only want there's not a great example here but in my course so I had the chapters and had the lectures are embedded like do I just want the lecture without the chapter and other lectures you can't easily get that when you embed it you could work with it but you can't pull it back without making the database you to work so like this tells you how sort of broken apart and related your data is worst as nested Paul mentioned that the the documents can only be sixteen megabytes so if you're going to put like analytics like pageviews in a pager record and it's going to grow infinitely that's a problem it's a problem then they make it a limit to save you because if you're going to do one query and it's gonna read sixteen megabytes off the disk over the network and deserialize it that's crazy amount of work so they they reckon they don't have the sort of limits on the size that's why the bond should be small and then how varied are your queries how many different angles are you asking for this data the more the angles the more separate the data is the more focused it is the more is clear like well these things are always needed so let's just embed them versus separate them and then this concept of a sort of inner application database for like a lot of micro services or do you like one massive shared database so these are kind of the rules of thumb that used to decide how to model your documents the great book called something like doozy MongoDB design applied design patterns by Rick Copeland the title is not quite right but the author is and they he has some really great examples of very non-intuitive ways to do that so registering connections you saw on that super easy just rich connection and then you refer to it creating documents like a lot of the orange define the columns here except for it gets interesting because you can have embedded documents now this would do an animation that's cool but I can't get it all most of my resolution for the recording I think so you can have these embedded document placed and put them in here and that's really cool so this is like one of the powers of modeling like in this case this is like a this is like a little fake Airbnb thing for snakes so you can like a book you can a book that cage and just store them this way okay we do a query you can do either objects and put the stuff in there you can do dot filter and then append mini filters we call into wine objects so it depends like if you have an if case to add on another item you'd use filter and just say this call first right very nice you can do queries deep down into the objects we didn't see this but like the bookings contains a list of things which contains a guest snake ID and then when we want to ask the question of show me all the records where that's that ID across that list is contained within this other set that I have here and so you just use the double underscores to go from a higher level to the next level down to the next level down and then apply the in operator on it so the is really rich what you can do here alright so that's pretty much it if you guys want to check out more of this stuff and you don't want it rush through right I've got a two and a half hour free course and I've got a seven hour like super detailed with deployments and performance and all kinds of stuff other course feel free to check those out other than that Paul I think I'll just open it up to questions all right cool be careful what you ask for we got a lot of questions I only have like probably 10 minutes maybe 12 minutes answer question but I got to run so we got a time box it okay then let's go through them fast hand-wavy consultant talk about scaling is it better to have large files with lots of data or small files with data spread amongst them you talked a little bit about it a second ago on analytics yeah it depends on what you're trying to do it's if you want all the data it's probably better to have fewer large documents because that means you're not doing like second queries like for example for here my little service thing actually well I had to do one query to - which one this one one query to get the package and then one query to get the maintainer like if you're always one of the maintainer in this case it doesn't make sense but maybe that gets embedded you know there's it's a real tough trade-off to make but it can handle lots of lots of your course says the 80% rule right yeah I would say yeah what I would also really emphasize like it's critical that you put these indexes here indexes are more important in document databases than they are in regular databases why interest all right here's a Z you set up a local server yeah that's what this is all right cool can we use run out with a config file you just say ma D - - config and the has a config file and that's that's what that's what we've been working with here all right on the null data one we can't read an old data through the shell that's something related to Python mapping is that true the problem is that if use the PI mop the the engine the mapping doesn't push it through is null in JavaScript in JSON what it just doesn't put it there and so it depends right so there's two ways like if programming language was null and JavaScript none and Python this query would still work and it would still find only ones like this but if I wanted to say this I or it gets tricky because there you have to say it doesn't exist like null and not existing these are not the same so you've got to write like it does not exist query totally can do it but it's slightly annoying in that regard but it saves you space so you know and speed so it's a trade-off okay if your data is very simple and can go in one relational table would you prefer my sequel over a or over my sequel here yep no no the reason is migrations if I want to add in something I want to add something to this yet another thing - engine dot int field like say right this totally works this fails if this seek wakame pointing at a relational table I got to go do a migration and migration script and I think it's just the evolution of is simpler in the document database that's why I would choose it there's not a huge other reason okay does Engine provide a connection pool or even something with thread-local yes it has a connection pool connection pooling and it talks to replica sets and if there's a failover it'll auto failover to the other replicas all that kind of stuff yeah all right URI already does this but wants another opinion how do you store and process regular expressions there is a are you storing the regular expressions themselves or are you trying to expressions as part of a query I don't I think just regular expressions are stored as regular text I would yes but there is a there's like a way to do like a like a carrot type of thing or I don't remember the syntax exactly all right but you can actually do regular expression queries in the shell and and things like that but indexes only work under certain circumstances like if the regular expression starts to the beginning of the string or something like this what are the most common performance killer mistakes that we should avoid common performance at number one no indexes right you can always ask is there an index if you say explain tell me what you're doing on the end of any query and if you see ID X good if you see table scan uh-huh number one is missing index and the reason is with these nested objects those would normally be like foreign key relationships against primary keys when they're nested they're no longer automatically given at least one key so there's just fewer indexes by defaults you got to put them back basically the other one is I'm gonna say bad document design or like a document to find mismatch and I don't like it's it's really hard to say what that means it could be over embedding it could be under embedding but just not aligning your document design with your most important common queries yeah if your common query takes 800 requests to fulfill its function you didn't embed it correctly yeah exactly exactly and I was answering questions that I missed this one what was the shortcut to hi charm Auto right the shell of a function I didn't see what function you if you say like this package service that other thing okay and I okay any of the code intention it means time you see a light bulb regardless of its functions import whatever its alt enter no I didn't you just choose from the list yep that's what I did alt enter is your magic janitor how do you feel about Couchbase compared to don't spend too much time on this one you know I honestly haven't spent a lot of time on it but it really comes back to this picture there's many more than ways to access MongoDB many more libraries many more ORM or ODMs it's been taken through the beating a lot I mean it's just it's more huge I'm gonna say more mature I mean there's a team of like at least a hundred people that work full-time on MongoDB for like since 2010 it's like an insane amount it's really an interesting story around open source business success but there was an article by it was a tree think DB that folded last year yes there was sadly here long yeah and sadly it was a better architecture better design and it lost and the founder wrote this long article about why beat it yeah it's it's not always technical well I mean business a software is not just software and sadly but along those lines MongoDB I believe is bringing in new features around real time like I'm subscribe to changes on a query and it'll push those to you kind of like to fill the gap that rethink DB fell out of alright last - can you rematch in the book and author for data modeling ah sure let's say Rick Copeland I'll even do this so give you two resources all right not one but two so on talk Python 109 I interviewed him and we talked all about this stuff and then I mentioned so then his book is MongoDB applied design patterns it has the Orion has the O'Reilly and the Amazon link here and I yeah I'm already friendship FM / 109 is the link you need all right last one how to explain aggregations in the shell really quickly right can we just show me what this is please feel free to just show aggregation enemies look I just want to show people what it looks like it looks really really different so it's a competitive eating idea to MapReduce but it's richer right and so it's this is the kind of stuff that you would write you would say go to your collection and aggregate and match this is like sort of the where clause where the status is this and then create like a one-to-many mapping by idea of customer and then do a total where you sum up all their items and the problem is you don't have a lot of operators like some and group buy and stuff like that in document databases so there's aggregation framework lets you do that and still execute them say like within a cluster a database cluster and then give you back the answers all right believe it or not we're at the end of the questions good because we're also the end of time all right let's do the wrap-up all right sounds good all right we're gonna switch it over to my screen all right thanks Michael for taking the time to talk with us about MongoDB and Python applications as a reminder to everybody lots and lots and lots of material a tell us real quick about the all products bundle pack for people who need help on everything sure so I have a bunch of courses on Python web apps pycharm we DVD over and thanks over at trading doc Python dot FM and I set up what is kind of a subscription where you can buy like all the courses that we've got there and all the ones are coming and they'll probably like 15 to 20 courses come in this year just all under one banner but I don't like subscriptions I don't like making people pay for stuff they've always bought kind of holding their reference materials ransom so what it really is you own everything you pay for and just if you want the new ones next year you'd have to kind of renew but it's kind of a buy it once own it forever things so that's what I got going on there right it's kind of like our subscription model you have a perpetual right to use for the version you bought the version I back when you start yeah I hadn't thought about it but it's exactly like that yes no all right so if you if any of you have any questions later please don't hesitate to reach out to us by email or social media tweet at us we will probably see it go and fill out a comment on the blog post for the recording for this we will answer if you'd like to get more information on pycharm please go to our website at JetBrains comm slash pycharm we'd love your feedback on this webinar seriously you're gonna get an email and it's gonna have something about filling in a form fill in the form we read it we learn from it we pass information on to Michael but otherwise feel free to contact us on Twitter the recording will be made available on our YouTube channel and announced on Twitter and the blog it will be soon by some definition soon either tomorrow next day probably this week if you haven't already please check out our pycharm blog we're really proud of it we put a lot of work into it on our blog you can find up-to-date pycharm news about releases and events in addition to educational resources so for example the recording of this webinar will be published first on the blog we'll also provide some additional links and information from the presentation on the blog when Michael pushes the stuff to the repo so that's why they do today thank you very much for joining us and hope you all have a nice thing thanks everyone bye
Info
Channel: JetBrainsTV
Views: 27,758
Rating: undefined out of 5
Keywords: PyCharm, webinar, python
Id: rlvGCTE4MI0
Channel Id: undefined
Length: 80min 6sec (4806 seconds)
Published: Mon Feb 05 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.