Master Using SQL with Python: Lesson 9 - Using PostgreSQL for Data Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to my channel we're going to be talking today about python and postgres and how to use them together effectively really exciting topic and as usual i brand myself today using uh mug spark mug which relates to a book i wrote on apache spark i'll talk about in a minute and of us i'm also wearing my pipeline shirt in honor of this topic i don't have a postgres shirt yet got to get one maybe a post-stress mug that would be cool the emphasis i hear is i want to focus on is these two services these two products which are both free and open source and have a very powerful development community are better together if you're not aware i wrote a book i left microsoft a little over a year ago and spent most of this last year writing a book called master azure data bricks this is a book all about how to use apache spark which is the most popular by far open source big data in the analytics platform the creators of apache spark founded a company called databricks and databricks is really a wrapper intended for the cloud around spark and it runs on both aws and azure and i got very integrated and familiar with it when i worked at microsoft so excited i left to write a book because there was nothing out there that covers databricks or spark for that matter there's no sort of single place where i can kind of learn both together and really cover both spark and data bricks it's a comprehensive resource and it takes you from beginner to expert very methodically very visual step by step hence the title and you'll learn it says you know this company wants to start with a project and you're assigned and kind of walks you through you know we're in common san diego almost you know kind of follow through till you get to the end and say now you've got a machine learning pipeline with all your data analytics and everything going it covers pi spark and python coding and sparkly r and spock r and i was even able to reach out to some of the rock stars around sparkly r spock r and uh people who wrote these things even at databricks to say could you take a look am i missing anything and i got some really good input now it is available on amazon you might ask can i have this as a pdf no you cannot however there's a free kindle reader available for any device so that shouldn't be an obstacle it could be your phone could be anything and i want to point out something you may not be aware of but kindle devices are dirt cheap i mean obviously amazon wants you to read their books and spend money these this is a seven inch kindle online i just checked it's like 49 chains so i mean it's so dirt cheap this is a pretty powerful device i was so excited about how cheap these are i just started buying them i have a collection now i have three of these i have a couple of eight inch sizes i've got two 10 inch kindles one so that i can use it downstairs another one upstairs because it's so cheap so i just buy kindles i kind of collect them i even have a kindle paper white because sometimes when i'm reading fiction in particular i don't really like the glare and the color but i do like it like this one i prefer when i'm reading technical books because sometimes there is some color and contrast i want to see and it's a generally a larger screen anyway go get it it's worth it where are we going in this video first i want to call your attention to the links which i will hit up in a minute but one is getting postgres itself because obviously you're not going to get too far without postgres or feel free to use a cloud service all three major clouds offer postgres as a platform as a service what does that mean it means you kind of just go click a few buttons and suddenly have a database sitting in any cloud environment that's ready to connect to and use that simple you don't have to worry about administration or anything because that's what paz does platform as a service microsoft has one amazon has it and google cloud has it those are the in my mind the three big clouds okay so you can get the follow the link and there is in i like to sometimes have a database to start with where i don't have to build it that i can just play around and query and if you follow that link there postgres sql tutorial it actually has a demo database you can install and even a tutorial which is awesome we're talking about what is postgres first why use it and a riveting demo so let me step through here and i will just quickly click on this just to show you that if you click you can go here and pick your flavor i use it on windows i have a thing when i talk about i did a consulting contract and i knew nothing i didn't really know much about linux i have used unix knew nothing about postgres and i was able to get up to speed within a day and said sure i'll take that contract so that's how powerful it is it's really easy to pick up if you've worked with other database systems like oracle and sql server let me get back to my slides for a minute okay what is postgres and by the way you can say postgresql or postgres the reason is originally it was just called postgres and then it ended at the s but then sql support was added many years ago and then they decided to emphasize that by saying post i would say paul cross sql but that doesn't sound good but yeah what is postgres so i'm going to take the easiest form because if you know me you know i'm lazy i'll just work night and day to find a better way to be lazy that's my motto it is a free and truly open source enterprise database management system what i mean by that is truly open means unlike my sequel which many would say kind of got taken over by oracle when they bought sun i believe it was kind of lost that sort of independence and now you've got this concern is oracle trying to manipulate my sql users could they stop it so the creators of my sql then created mariadb and now you've got this conflict and you've got two different paths now of development of separate products postgres does not have any of that sort of luggage it just works it's been around and somehow it seems like no company as far as i can tell owns postgres so there's no fear that somebody's gonna buy it and own it and uh that's why i think it's getting more freely adopted it performs really well i say performant meaning it can scale and perform well it supports large databases and i mean like anything you can think of that oracle a sql server can support you can consider postgres able to support it has features and this really surprised me when i started working on postgres a few years back like wow it has the ability to connect to other databases and read flat files it has the ability to create all kinds of indexes including things like include indexes and things that you don't typically you can even do partitioning that's a relatively new addition but true partitioning and uh and features you don't find in other databases which like table inheritance go figure but really good stuff it has resiliency and high availability and i mean things like load balancing multiple servers it has the ability if you look at amazon of scaling out it has high availability the idea that you can do failovers this database goes down flip on another database really important too when you think of licensing because in many scenarios if you have that failover you have to pay more sql licensing whatever vendor you're using because they're going to say that's another database postgres how about it it's all free and it runs on many platforms and that's another thing i find intriguing because i think of oracle as being a linux unix-based database when i used it that's what it was and i still think of sql server as a windows platform database and i know sql server says you know we run on linux and containers not exactly native running in my mind but okay so but let's face it if you're using linux or unix you're going to be on oracle and if you're running windows you're probably on sql server for a database if you're using a commercial product not so with postgres i i needed it and tried out and so i downloaded windows it installed with binaries and i'm up and running in minutes so i was really excited just about that now amazon is a very you know obviously large successful high-tech company and their cloud is by far has the largest market share i mean way ahead of anyone now to say that and yet they adopted sort of postgres as their child if you will data service that they wanted to integrate heavily is no small complement to postgres and what it does redshift if you've heard of it is a scale-out platform akin to microsoft azure's azure data warehouse but it had been now synapse i'm a little cautious synapse is becoming more of a collection of things but that scale out sql relational database and they have all three cloud platforms i mentioned have postgres as a platform as a service database and i want to you know emphasize that means you can just go in and you won't pay licensing costs in if you were to go and you say use sql server on aws there's going to have to be some additional cost just for licensing and if you install of course the on a vm well yeah you're going to get full licensing cost anyway these are really great reasons why postgres is really good now what about python sql with postgres let me try that again not python sql postgres and python well postgres seems to have been as i learned python i was surprised to see that it seems to be the favorite database the default database if you will of python except for of course a client database which is built in called sqlite which i have videos on but if you're using an enterprise database postgres is the standard it seems to be everywhere you'll always find examples everyone's talking about postgres with python and again it's free and you can use it so it's a great place for data analytics or if you're doing you know like django or something and building full-blown websites and web apps with it it's also a really good choice it is in fact the default and up all ready to go with postgres so it's like instantly you can be doing django websites with postgres now there are several ways to use postgres with python because of this sort of integration and a lot of features the first is and i've demoed this i believe in another video i did which is on using the open database connectivity odbc that is the way you would access to something like sql server in python it is also an option to use that with with postgres there are also a set a number of postgres drivers specific with packages you quickly use and we're going to see one of those which is very popular and how to use it we're also going to take a look at sql alchemy now normally i would not for data analysis probably choose sql alchemy sql alchemy is a really great form which is object relational mapper kind of database and ideally suited for something like developing web or mobile apps where you're looking for that oltp connection from your app to the database it's it's really well suited because when you have an object-orientated language or platform you've written your app to be classes and objects then it's nice to have a mapper to communicate and allow you to wrap code and functions and objects around the interface to the database and that's what sql alchemy is ideally suited to do it's a bit of overkill for data analysis but i'm going to use it in a very simple way to get some simplicity in my code with pandas and that brings me to pandas pandas is my favorite way to try to deal with data whenever possible because it minimizes a lot of code it minimizes jumping out of the sort of way i'm approaching the data the paradigm i don't have to think i'm going to write sql here and jump around here i can do a lot more with less work and that's key to me i want to get to solving problems not writing loads and loads of code they're not necessarily buying me much and that's where pandas is my first go-to whenever i'm dealing with data and then i will back into other places if i must all right so integration with pandas simplifies things and that's going to be the main thrust of this presentation now if you've stayed with me this long the benefit you're going to get here is i'm showing you the easiest way to use postgres with your data analysis there are a lot of different blogs and videos and honestly to me they just do too much work to get to what i want which is just get me the data throw it into tables and let me stop querying many of them start to you know loop around and write sql insert statements and start inserting from here and putting it there and it's not really necessary as far as i can see in most cases in most cases you don't need it i'm going to show you some some situations like that all right now there's also something i want to talk about which is whenever you're using a relational database management system it is really good to have a graphical interface into the database okay because you're writing code in python you know you're in a notebook you're in spider or some ide you can't see what's really in the database and you want to and you can use a prompt but i'm not a fan of typing prompts i don't like typing too much i want things to see what's there i've been spoiled with sql server management studio and so i want something like that and sql lite has a user interface even to go and look at these things what about postgres okay so we can see here i am in pg admin and we've got this database i mentioned the dv rental so if you install the dvd rental you can see we can see all these different tools tables are typically under what's called the schema and under the schema we can see here's our tables that's a little strange but in sql server they would a schema would be like dbo the generic schema sort of default in postgres is public let me see we have our tables and i can even do something like right mouse click and say i want to um create a script around this i can say show me the create script for this i can say view data show me the first 100 rows and you can see here i get the rows et cetera i can see what's down there so it's just a nice visual into the environment if i'm having trouble it says that table doesn't exist this doesn't exist i can go in and look at it you know it's kind of like i got my weather app on my phone and it says it's raining outside and i look outside and say it's a sunny day the app is not going to be the best source of truth the database in this case wrapping this over it's better to look through something like this then trust what python is telling you okay so we've seen that we've seen how to get postgres and we've seen our pg admin so let's uh let's jump in and get to our riveting demo on how to actually use this now wherever you haven't downloaded this zip file for this and installed it go into that folder and a little trick you can do here is just type in cmd enter and it opens a dos prompt for you that's really weird to think das prompt these days but anyway it's a windows prompt and from here i'm going to type in jupiter lab because lab is the sort of new notebook environment and it gives you a few more features it's sort of 2.0 of jupiter it's really cool and now that i get in here i'm going to open the folder up and you can see that you should have this python postgres notebook open that up and you can kind of just close this so let's start walking through what we have here okay so first thing to note is obviously my disclaimer if you use any of this code make sure it fits your needs make sure it meets your requirements that kind of thing i don't want you to use something and say i didn't know it did this make sure you always read and evaluate any code you're going to use now if you have not my recommendation i have a video on anaconda anaconda is a data science data engineering to some extent install distribution for python i highly recommend you get that because it will give you jupiter lab it will give you a number of ides and it gives you just about every library you could ever want to work with data analysis including the libraries we're going to use here but if you don't have it you can use pip the package installed package manager for python and install it we're going to need this driver package called cycop g2 not really sure the best way to say that library name but that's what you're going to do or you could do a condo install if you use if you do have conda or anaconda installed condo is sort of a package manager when you have anaconda i have it installed so i'm not going to go through installing that again and what i want to do now is i'm going to import that package cycop g2 and then i'm going to connect to it now i'm going to show you a way to do something which is a little unusual i'm going to connect and i'm not going to give it the database because i'm going to use the python code to create my database just to kind of demonstrate we can actually do that so here i'm going to run this cell and then i'm going to run the cell below it and i do want to call attention to a little bit of this i've got this pg con cursor so let me review this a little bit first very common paradigm pretty consistently when you're using databases even sqlite from python and most really interfaces is the idea that first you establish a connection to the database so typically in python you call it con or something like that that's my connection object sometimes you also create a sort of segue or a way to send messages if you will to the database and that's a cursor not to be confused with you know sql server you kind of think of a cursor as just a way to step through rows in a table or some query but the cursor is really more like think of the the path we started with you know we're in c colon backslash backslash let's go down the database connection but the statements we get the shell is actually the cursor in many cases we're going to pass statements through the cursor i hope that's helpful a little confusing but we create the cursor all right now once we have the cursor we can go through here now i'm going to do something strange down here i'm going to actually drop and recreate the database to demonstrate that i can do everything within python however i was running into bugs and so you benefit from my failings and problems i put a link here you can reference but they said the problem is it wasn't letting me do any of saying that you couldn't make the changes you wanted it needed to change the iso isolation level what i need to do is add something called auto commit so i'm going to add that now it's making this setting and i'm bringing in yet another library all right and now i'm going to do pg cursor dot execute drop database of exists i'm going to drop the database in case it's out there and i'm going to recreate it so this is connecting to my postgres instance now let me kind of rehearse this a bit we've got a host here and the host is localhost meaning it's my default instance if i've only installed postgres once with one server that is the default name and i don't have to do anything more the user postgres is the default admin for the install and typically there's a password to go with it and i use the highly secure password of brian it is not a good practice in notebooks or anywhere else to put passwords this so this is a demo do not do this at home postgres is unfortunately by default not a windows environment as i mentioned and so it doesn't support windows authentication without some additional gyrations can be done though so this is another really good reason why you're better off if you are worried at all about your data to use a path service in one of the clouds and they're pretty cheap then you can use security features like something like azure key vault or something similar in aws where you can make sure your password is never displayed and you can use something to get it out there is also a pg encrypt i think it's called which will encrypt passwords and allow you to sort of put them somewhere and bring them in not gotten into that yet but just so you be aware this is not a good practice okay let's continue let's see if i can run make sure i run this statement run this and ideally this will create my databases for me i have learned by the way that once you start querying and doing things with postgres databases it doesn't want to let them go it it sees attachments any kind of attachment even within pg admin if you've got some window open that was looking at something in your database it will not let you drop it now i'm kind of done for the time being with my pg connection object and i'll explain why in a minute okay first i want to connect directly now to the new database i've created so that's the biggest reason let me go over this notice i don't have a database specific because i'm going to create it so i connect to the instance but down here i've got a database now i'm naming it aw as an abbreviation of adventure works because my data is really extracts from microsoft's dummy demo database adventure works let me run that now as i've mentioned many times in my videos i like to take the laziest way possible because the laziest is also the most efficient why waste time why write more code than you have to it's more stuff to maintain and as a result i found that pandas is generally speaking the best way to handle data in 99 of the situations and i'm going to use it so i'm going to import pandas standard import pd and then using that tag pd right or whatever you call alias i'm going to use read csv to bring my data in from csv files i could bring them in for anywhere i could use odbc and go after sql server got a video on that i've demonstrated those kind of things but i'm going to use read csv and i'm not going to give it an index column and then i'll use this display just to show you a couple of rows head 2 is just going to say give me the first two rows to look at and you can see you know it's basic customer information now here's the thing i this is actually copied from online you can see this link when i try to use pandas to sequel which works beautifully with sqlite i can take any data frame and just say save it into sqlite and it does it with no problem takes a data frame and turns it into a table columns types all forming no work at all does not work when i use the cycop pg2 driver instead i have to go through that sql alchemy and if you follow this link here it actually tells you that pandas sql is designed to support sql alchemy and uses that api to support whatever databases sql alchemy supports so it should work with mysql and sql server etc as well as long as you use this so i'm going to if you don't have it again you should have this if you've got anaconda but if you don't you can do pip installer conda install as shown here now i don't know if this is as visible as i like let me push that up a bit all right so now we're going to do is we're going to say from sql alchemy we're going to import create engine not a huge fan of the name because as far as i can tell this is no different than just get connection all right but they call it create engine and slightly different format but what i'm doing is i'm passing in the driver okay postgresql the cycop g2 the path now the path is interesting this is the user id that's the password and then i'm going to say at and this is going to be the name of the server and then here's the database kind of weird kind of looks like a file path but anyway that's how you get to it and you're going to return that result that pointer to this variable object called engine could be anything you want i'm calling an engine mostly because the examples you see everywhere call it engine and after all it is using a create engine function so here we are now we're getting to the exciting part at this point we have created the df right data frame and we want to take this and we want to persist it we want to save it to our postgres database as table customer we're going to use our engine which is our database connection object right that's how pandas is going to find the database and connect in and we're going to say here if it finds the database if it exists just replace it just blow it away and recreate it now we can say somewhere and have notes but you could say we append which is good because suppose it already existed and your real goal is to just add more rows to it then you would just say if exist depend it would take that data frame and add new rows the other thing you can do is say well this shouldn't exist i want to fail so you could also say fail and it will just abort and say no we won't do this now i'm not going to have an index in this particular data frame so i'm just going to say no and i'm going to run it and not much to see but it should have written it to the database so let me go into my postgres side of things and let me refresh and hopefully it worked and there we are we have a new database yay and we can look under our schema and when we go under it we'll also find there's a table all right and we can see the columns so again you can see how this can be really useful just be able to say get it do it is it there that kind of stuff because we may get issues so i love to bounce back between these things all right let me go back to my demo here so we've done that now i'm going to do the same thing and i'm just going to kind of simplify i'm going to create a product table from the dim product table same kind of thing nothing nothing on my sleeve nothing new and now i should have two more tables here so let me refresh and there we are we can say yay you can see them here all right let's go back now we have my tables that's great there is one implicit assumption in the method i've used which i like to skip over because creates a mess right but the thing is what happens if the data frame there's too much data imagine you've got a sql server table which is not unheard of or postgres table not a narrative with a billion rows and it's going to take a terabyte of memory to fit a little bit of a problem data frames reside in memory right they're just all in memory and so we're going to have a problem if it's a really big table in most tables it's going to be fine but when you get into really large ones it's a bit of an issue there's a technique we can get around this by now this is one of these examples where i saw a lot of complicated solutions and i wasn't happy because a lot of them involved like writing some sort of query over here and then looping over and doing insert statements which is tedious and insert statements with sql can be error prone it's very easy to get a character off but mess something up i don't want to do something that's difficult so instead i'm going to use a really cool technique and here's some links i stole this idea so i am borrowing so take a look follow these kudos uh i even put this person's name at tamar i can't really say turner trowring but really good blog that explains this so i modified it to demonstrate to you but take a look at it really good what i'm going to do instead is there's a feature which you may never have used with the read csv and for that matter most of the read stuff you can do like reading from a database as well called chunking chunking is where you say okay i want you to get data but i don't want you to get all the data get x rows so here we're saying chunk size a chunk is going to be 2 000 rows now there might be a million rows 10 million rows but it's only going to give me 2 000 at a time it's very similar to a transaction block if you will we're going to get a block of this and nice thing is it will return it because i'm doing this for loop it's going to return each 2000 chunk into my aptly named variable called chunk and i'm going to just take that now what is chunk it's a data frame but it's a data frame with only 2000 rows and i can now say chunk to sql go to sales engine and of course i don't want to say replace or be just keep deleting each one and i'd only end up with the last chunk that was inserted so i'm going to say append nice thing about append is if the table is empty that's fine it will just start inserting the rows so you can use append even if the there is no table it needs to create it so this will take potentially a minute and the trick i want to insert here is sometimes when i'm waiting for something in python i'll do something like this and i do that so that i know when it's finished it's a little trick so i know it hasn't finished because i just ran the cell says print okay so i'm going to wait and there it is so that's sometimes i don't know i probably should do something a little fancier with like a prompt box but there you are my my little cheat all right now that that's done let me find out how many rows are in that table i created this table sales i'm going to go against postgres and get a count it is a fairly large table uh 600 398 i've actually been able to do that entire thing in a data frame but it's good to know so now we can of course go over and i'll kind of do this again refresh if i go to tables we'll see that we have a sales table and by the way there's quite a few columns in here so a lot going on there let me go back here all right we're getting there we're not quite done i have a lot more things i want to show you which is riveting another thing i want to show you though is in system level databases a standard in sql that was implemented some time back i don't know when to be honest but they added a schema called information underscore schema the idea is now my this is a sql standard right so it is supposed to be a set of tables and views that display the objects within the database what do i mean in other words it's called the database catalog and that concept has existed since db2 and sqlds the original sql databases the idea is you have catalogs and you can query like you can there are tables within your database they tell you about database tables and then this is called metadata and so we want to be able to query metadata because i can't have the time remember what the column names are let along table names so this is the way i can go back and say oh yeah what what are these things called what are they stored like so it's a very useful thing and i'm going to use the read sql of pandas to query it now fortunately postgres follows the standard pretty well which is information underscore schema.column sql server follows it as well we're going to see sqlite it follows it but not according to the standard and so there's a table here that tells our columns and we filter on table name there's also a table for tables there's tables for indexes etc so this is really handy i'm going to pass in the engine and i'm only going to display a few rows because i want to just give you the idea and you can see here now i'm getting these are the column names in that table now i'm going to create another connection but this time i'm creating a sql alchemy connection called pg connection and i need that because i need to do a connect to that engine this is kind of like a cursor because what i want to do is i want to add an index okay to the table i created so i'm adding an index to the sales table on order date it's really a critical concept in relational databases that they support indexes because indexes allow you to go directly to something it's kind of like in a book you go to the index and you say where is this thing i'm looking for and it says oh that's on page 58 and you just jump to that page versus reading through every page of the book which is what you would have to do without it right if you don't have this so if you're dealing with a big csv file and you ask for a certain date you know it's going to read through every row to find what you're looking for and in a large table this can be a big problem it's called the table scan we avoid a table scan by creating indexes and indexes allow it to get too quickly so what i'm going to do is create an index and i use if not exist just in case i rerun this code and i don't want an error from it i'm going to create an index on order date so i'm going to run that now i can't use the the sql query because i was getting errors why because this expects to get a data frame back and it doesn't return a data frame okay now that i've done that index i want to start building out towards a kind of complex query to show you the power here a really good idea that i want to give you and even if you took nothing else away if all you learned was this then you got something worth this time and that is the idea of using a date dimension what's a date dimension it's a it's a very simple idea that solved a lot of problems for writing queries a date dimension let me clean that up a little bit a day to mention is when you want to do things on uh querying related to dates this has been around since the beginning of queries right somebody wants to see well i want you to give me a query for the last two years i want to get a summary of sales by year by quarter sometimes you may want it to say uh buy a year by quarter and i want to i want you to show me the days of the week that have had the highest sales and things like that what typically has happened and i've seen this in many places is this this you know take the date parse it out pull the date you know get the year summarize by that and create another column and then get the month okay and then create some sort of temporary table and say okay if it's the months here to here then that's quarter one quarter two it's a lot of extra code and a lot of extra work there's a very simple answer create a lookup table create a reference table should be dimension and that's called the date dimension and there's lots of places online you can get free date dimensions or scripts that will write a date dimension for you so let's kind of walk through this and i think you'll get the idea i'm going to just query the date dimension to show you what columns we have in it so we have in this date dimension the month number of years so if i join to this on the date key and the day key is a full date january 1st 19 whatever the 2000 something whatever right and it will have the month number of a year it'll also have the quarter right which is not a natural part of a date and i could have semester i can also have fiscal year and quarters etc versus calendar so again one table can let solve a lot of problems so let's see how this works now my real goal is to demonstrate postgres but i want to show you a little bit more here so what i do is run a query and i'm going to be using my sales table i'm going to join the date table in dim date and i'm using the order key and the date key i'm going to roll back for a minute because you may have noticed something why do i have quotes around my column names kind of weird quotes are similar to what sql server does they put brackets around things when you have spaces when you use anything that it doesn't that will essentially break not understand you can kind of put it or put something around it postgres is case sensitive and if you use uppercase then you're going to have to put quotes every time you reference the column i recommend i'm lazy i would use lowercase you can use underscores if you want to break up words but camel case and things which i actually worked on a contract and they did that everywhere and honestly it really gets to be tedious to always have to make sure you have a quote and you always have to remember well is there a quote because if you do this mind you you have to remember that this is always capital y and if somebody in one case made that a lower case it won't give you the answer until you get it right if you make a standard of just say stick with lower case it's going to be a lot easier at least that's my preference but i was even lazier because i should have done that change but i wanted to show this that's the answer anyway i wanted to show you this because that's kind of key the csv file was brought in and it had these this column headings at the top and i used them but it's kind of given me this experience now i can use sql to put a view over it which is not a bad idea bring it in and it's horrible but then add a view over it good practice anyway to use views instead of base tables back to our story though day key now i'm going to say okay order where the date is between now i want to really highlight that i'm only filtering um a subset of dates this is where the index is really going to pay off imagine i have 20 years of data here this was something a contractor did was a key thing i mean they had like 15 years of sales data and they only often needed slices they didn't have good indexes ideally in a case like that you want what's called a clustered index more on that later but basically you want indexes because it can speed up the selection of data now the dim date is helping me because i can say group by calendar year and quarter i don't notice i'm pulling those out no problem and i just joined to the date table and i'm good to go i don't have to parse i don't have to do this laborious thing don't tell your boss pretend you really worked all day doing this you know sweat on your brow you get credit it could be a promotion but it's really uh something many people don't still realize it's a simple solution so that's that's kind of a neat thing here and what we're going to do is create a query that's going to give me a summary by year calendar year calendar quarter and it's using dim date now you might say well isn't there a lot of overhead joining on the date column well for one thing dim date should be indexed by its primary key right so that's a typically a sorted index clustered index so it's fast and think about it how many dates are you really going to have if you have a year and you have for every day that's 365 rows so even in 10 years you only have 10 times 365. not a problem for an enterprise database it wouldn't even be a problem for sql lite which is client based so let's go let's run this and here we go so have a nice summary now i put this little round function in because it kept coming up with weird exponential notation if i remember right and uh you can get rid of that or decimal places so rounds will get rid of that and we have a nice little somewhere here that's suitable for framing so that's one thing i want to show you now when i give a little bit more i want to point out things like leverage the features you have in sql something that most demos don't do some do but is python is really good for usability here's an example where if you've noticed i forget column names a lot and what i want to do is i need a quick and easy way to get a list of columns on any table so rather than pasting this query and take and changing it everywhere i want i'm going to create a little function now i'm going to bring pandas in just to make sure it has the function i'm going to be using here right pd and i'm going to define a function called get columns and it's going to take the table name as a parameter and i'm just going to attach this format which will replace my symbolic parameter here in the query and voila i have a function that will give me column names back any time i want and there it goes my point here is leverage python i'm in a notebook but that doesn't mean i can't create classes and functions and things or even have something as a script externally externally and just run it and use it and i've done that in other presentations so that's kind of handy because now i can just use that whenever i want okay final stage of our riveting demo i'm trying to conceptualize another stage in how you might use things postgres is typically you know as i mentioned it's an enterprise database along the lines of say oracle and that's great one use case in data analysis is you can use that as your repository a place to get clean good data collaborate with many people on a team free right great and it's secure and you put it behind your firewall and you lock it down and people have roles and da da da okay great but there's also a use case where you say well i want to do you know my own thing i want to be able to drop tables play around in fact if you're me you're always worried about you're going to drop something people need and people will not like you and it's always a problem databases dbas do that sometimes they have to be very careful they have a lot of power so a really good idea is that's a good use case but another good use case if you don't want to go through the work and things of keeping postgres sometimes it's better to use sqlite and i have a whole bunch of things as i mentioned on that this is better for when you just have less scalable needs but still good you can have you know up to terabytes really with sqlite but the biggest thing is one i may want to share it really share it in other words i want to just create an extract sqlite it's a file right and i'm going to see that i can just email it to somebody not good for security but it's good for things like that it's also good because i can play around i can drop i can slice it even has an in-memory capacity where if i don't want a table i can just push it all into memory and be working there so for performance so postgres is something that can be seen as sort of a platform to operate from and single light can be seen as a very useful tool to work with the data in real time if you will and and maybe not even keep what you're doing around all right so we're going to have to import uh single light functions we got there and now we're going to do sqlite 3 connect we're going to return this object to slconnect sl con right and here's the database name aw.db sqlite's real riveting it's really complex now notice when we do postgres or other databases that are enterprise you never talk about files that's hidden underneath the covers sqlite is more like microsoft access it's got a file and you open it and you use things in this case if this file doesn't exist it will create it and if it does exist it says okay open it up connect with it and open up and get the things in there so let's run this in this case i don't think it does exist if you print the connection it just says yeah i'm a connection now what i want to do is i'm going to create a cursor and this cursor again is sort of that entry point because certain things i want to do will require a cursor so i can do that and again i keep forgetting what's in sales so there i am my handy little dandy function and i can look for what columns are in here and based on that out of all that all i really wanted was sales amount so probably didn't need that list but interestingly what i want to do here is now create a query where i'm going to join to customer okay the idea is i want to extract data from my postgres database and admittedly if i used odbc i could be doing it from sql server or something but i want to pull data out of my postgres database and persist it into sqlite my goal here is to show you kind of how to share the services and blend them in a way that optimizes your productivity that sounded pretty impressive didn't it all right anyway so we're going to do here is we're going to join sales to customer and we're going to do this on their keys which is the customer key is the column name and sales which gets me to the customer here so that's what relates my sales to the customer who bought the product now i'm using these aliases because i'm too lazy to keep typing out customer so what i got here is now i can just say c and so this i also identify is like where is this coming from i like to use aliases even when the column names are unique because that can change and i don't like code breaking and i'm also demonstrating like how could i get around the postgres problem with always having to quote and the way to do that is simply use this as feature now in this case i'm nearly preparing this query so that i can persist it to sql light where it's not as much of an issue but just to show you that if i did something like this and were to create a view on top of my base postgres table then the view doesn't require me to to refer to that quote anymore here i can just use the created names and we'll kind of see that later and in postgres i had to look this up i have to admit i want to say like okay i know in sql server how i would calculate something like the actual age of a person and birth date is a column i have how do i calculate age in postgres so in postgres there's a date part function and this is something i stole off the internet i don't remember where i apologize for that but i can do this subtraction and get the person's age which is pretty cool so i got this nice little tablet maybe i want to slice and dice i'm not summarizing it because i want this to be something i can query and again use your imagination i could be joined into a bunch of other tables i could once i'm in sql light i can join to other things with it and really step up my analysis and mind you sqlite also supports indexes and things so i can get better performance and speed all that stuff now i'm only creating a variable to hold this query before i go through and use it because it makes it a little more readable in this case so i'm going to just demonstrate i'm going to run this query but only display a couple of statements a couple of rows there it is this time i'm going to run the same thing without the head statement and what i'm doing here is saying read sql but what am i doing i'm returning the results to a data frame so let's look at this carefully i'm going to be running that query and i want to sort of show this the engine right so i'm running it and i've got the engine and the engine is pointing me to postgres so i'm running this query against my postgres tables and i'm going to create a data frame that's what we need to understand in this statement great now what i want to do is as i've seen before i need to back up a minute i've got a data frame has data from postgres okay now what i want to do is take that data in the data frame and save it as a table to sql light so i'm kind of using i'm saying grab it from here push it here now i love this approach because look how easy it is to do right i mean one statement this is all it took and i now have a data frame and that's the key to pandas always get a data frame and pandas lets you do almost anything with a data frame once you have it which i love so now with my data frame i'm going to say to sql i'm going to create a table called sales data as opposed to sales and you can tell here it's going to the sql like because i'm using the sl con at sqlite connection no index and i'm saying if it exists already just replace it so it'll delete and replace small point i have not invoked sql alchemy because sqlite is so easy to use with python and pandas it's ubiquitous so it's just great there's another reason why you may choose to bring things out of postgres sometimes and use it locally because it's just integrates beautifully with python okay it also by the way will offload compute to your machine so if you want to not tax the back end server good use to do it also if you have a really large table in postgres a good idea is filter out the subset you need bring it into sql lite and bang away against it and the dbas will love you the company will love you but not burdening the database with heavy queries all right so what i'm going to do here is i just demo a little bit of sql lite uh i have a lot of things i want to post in my description but take a look at my other videos because one's all about python and sql lite what you can do is while i mentioned sqlite does not implement the data catalog in the information schema approach they have this command called pragma and you can use table info and get information in this case like about sales data so i'm going to run that and this just tells me okay here's my new table here's the columns and the sqlite data types all right now there is a this would probably be a good point to point out that as i mentioned i do not like to work with databases without some window into them and so you can see here and i do talk about my video i have a whole video i think about this one sqlite studio i'm going to open it and i'm not going to bother getting to this but basically i have to do is go here and say open the database add a database to it and you can find your database and when you do actually does show it this is i think i opened this earlier and it's still pointing so you can see that you can query you can slice and dice you have similar functionality to what you have in pg admin it's a little bit lighter in functionality but really useful i honestly i love using sqlite studio sqlite i mean i'm like wow this is just great to have that all right so now i'm going to query the data that i have transported all the way from postgres to my sqlite database and there it is and here i'm just looking and saying okay selecting data from sales data i'm only asking for four rows and i'm using the round statement again because it doesn't display numbers real well there now i'm going to do a little aggregation because we love to do aggregation aggregation means we're going to summarize in some way we're going to summarize the sales amount by summing it up total sales we're going to do that by grouping it on education so group by the education level because again in a very small way but this could be something we might use as a model feature we want to predict what product a person is likely to buy well education level may affect that it's always probably going to be a big one too right but you know if people who are more uh better educated may favor certain features may favor certain things that over other ones so and there may be you know a sort of connection between features like salary and education which you have to be careful in model training but this could be a feature and so i'm exploring right eda exploratory data analysis so i'm selecting education now here's a weird thing in sql lite when i wanted to format the sum amount so that i could get commas i had to use a printf function which is very c like looking but hey it does the job and it gives you a lot of flexibility so i can't complain about it and then i'm just going to run this and again this is sqlite because sl con and i'm saying index column equal education because it kind of gets rid of that annoying sequence number on the left and it formats it kind of nice and now we can see here's my education here's my summary no decimal at all just integer so all in all looks pretty good so please extrapolate into various ways you would use this use your imagination this sky's the limit but there's a lot of power here and i'm i'm actually building a series of videos to kind of do concentric circles if you will like a tree so i can show you all the different ways you can use these tools to jump start really accelerate your productivity and make life easier that's the goal now one of the things you want to do when you're done is close your connections so here's my pg connection sql alchemy uses the engine a variable we dispose of that to get rid of it and then we get rid of the sqlite connection we want to do that otherwise it leaves connections open and particularly when you're talking about things like postgres it adds a lot of overhead if you've got a lot of connections open it's a very bad idea to leave things open typically with databases you want to get what you want and get out i didn't show it here but typically i write i have a function i've demonstrated in another video i did which goes against a sql server grabs what it wants throws it into a data frame closes a connection and i just work with the date frame which is a really more efficient way to work but again if if it's like a billion rows not going to work we talked about what is postgres and we learned that it's an awesome open source very liberal licensing you know model enterprise database that we can get and do what we want with it and it's great and it's very powerful and it is if you have any doubts that this can compete with the big leagues i hope i dispel them take a look explore it as i mentioned i did some consulting on it and i was very kind of prejudiced against postgres going into it like this is a nice go to sql server and i was uh you know admittedly sql server does have features that i do like a lot about it especially around bi but i had to admit it was impressive in performance and and in the price so yeah you got to look at that but why postgres with python why am i talking about python and pulse graphs because they work really well together they're both free they're both open source they both have the same philosophy and there just seems to be this really amazingly natural synergy between these two products if you will open source software products so and i've shown you python just has this really great tendency to kind of keep iterating over things until they simplify to the point you get to a single statement what used to take a page and pandas with postgres once you get through the layers and you kind of realize like i had to figure out sql alchemy is a piece there it's just a great synergy and i'm as i mentioned i try to use pandas everywhere as much as possible and only go to other things when i can't do it with pandas so i did a riveting demo i hope you liked it and kind of a side demo i hope i've shown enough of is of course downloading and getting the tools but also making sure you have a visual gui to look into and test your queries and see what's going on the database because again it's kind of like the weather i mean do you really want to be looking at your phone when you can just look out the window and that's kind of what the gui does it gives you direct access to see into the database so i want to thank you i've gotten a lot of great comments please add more comments tell me what you like tell me what you're interested in share with your friends like to get the word out it feels great to know people getting a lot of value out of this i feel a great deal of satisfaction from that and that's why i keep going so thank you and yeah have a great year
Info
Channel: Bryan Cafferky
Views: 16,383
Rating: undefined out of 5
Keywords: PostgreSQL, Python, SQL, data analysis, programming, data science, analytics, tutorial, learning
Id: FQzzQnYERBc
Channel Id: undefined
Length: 54min 1sec (3241 seconds)
Published: Sun Jan 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.