SQL Databases with Pandas and Python - A Complete Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
python is the Swiss army knife of programming languages especially when it comes to working with data there's an incredible amount of packages and libraries written for working interacting and visualizing data but what good is it to have all these packages if you can't access the data have a few other videos where I've gone over different file formats that you might want to store data and there are many situations where that's exactly how you want to keep your data but the truth is if you work for any organization they're probably going to have data stored in a relational database relational databases are great because they're a centralized location where the data is efficiently stored and then any user can write an SQL query against that database and pull down just the information that they're interested in so in this video I'm going to go through everything you need to know about how to interact with relational databases using python if you enjoy this video give it a like And subscribe to the channel really helps me out alright let's go so before we get started I do want to just point out that there are a lot of different relational databases to choose from and most of the time you're just going to be using whatever your organization order uses most of what we're going to go over here will be the same for each of those databases but we're going to use MySQL as a good example database now I'm going to use a Docker image of a mySQL database in order to show these examples and Docker can be really helpful in these situations because you might not want to set up the entire database but you want to test out how to interact with it with python well there are already a lot of Docker images out there which allow you if you have Docker installed on your computer to basically spin them up as if they're already set up on your computer and I'm actually going to be using this example database that's provided by MySQL in a Docker image this is a fake database create by MySQL for testing things out and has a bunch of information about fake employees at a company this is the shell script I'm going to run on my local computer which will start a mySQL database and we'll expose that on Port 3306 most of these databases have a specific port number that they'll use for you to connect to and we'll see more about how that's important when we run our connection string now that I've run that script I can just look in Docker PS and see that this database is actually running so maybe we'll want to take a look at this database without even using Python and that might be a good first step there are a bunch of different database management softwares out there but I like d Beaver it's a free open source tool that works on most platforms and it lets you connect to most types of databases so in D beaver I set up my connection to this database by providing it the URL well where the host is and I know it was exposing Port 3306 the database name and this is running on my local host and then I also provided my username and password which I have set up in my Docker image now if you're working at a company with a database the database administrator would give you all this information in order to know how to connect to it but now that we're into this you could see that I can open up this database tables we can see the database is called employees it even lets us see this Nifty ER diagram where we can see each of the tables in the database and how they interact with each other also in this software we can click on the database table create a new script and in here we can write some SQL to query the tables so let's write a SQL query where we take these employees and we're going to join this on the salaries table and we'll join on employee numbers let's also use a where the higher date is greater than 90 1899 and let's say we wanted to do some analysis on this in Python one of the ways we can get it into python is by exporting it from this software to something like a CSV file and then reading that in with python so I'll just go to export CSV and now I can see I have here the CSV file with my output and finally we're getting to python code so I have here a Jupiter lab instance with python running and we're going to read this in using pandas so import pandas as PD and then we're going to read this CSV which has a really long name we can see that we have the same results from our query that we ran earlier alright so that does work but it's not the ideal way to pull data from the database because it involves this intermediate step where we're using either an SQL query directly on the database and save off as a CSV and what we'd really like is to access the database directly from python now this is where it might depend a little bit on the database type that you're connecting to but here we're connecting to a mySQL database and we can use the MySQL this will allow us to connect directly to that mySQL database now most of these connections work in a very similar way so the first thing you need to do is create the connection using the MySQL connector and we're going to connect and we need to provide some information to this connection in order to create this connection string first one is going to be our username which we're just using root we set up the password to be College when this was set up the host is just going to be our local host the database name was employees and we're also going to disable SSL for this example now we have this database connector and we can use this to create a cursor now when we establish this cursor we can then start pulling down down data using our SQL query but we want to make sure whenever we're done with this it's in your code you want to always make sure that you close the connection and the cursor so that it's no longer reserved on the database but we haven't run our query yet so let's go ahead and take our cursor and let's run this exact same query so the first thing we're going to do is actually create a query object that is a string that will contain our SQL query let's go ahead and grab the same query that we used before and paste it in here now we can take this cursor and we can execute this query now you can think of our cursor as being pointed at the results from this query but the cursor will only give us the results one row at a time and in order to retrieve this data we need to iterate over the cursor and pull out and extract our data so I'm going to enumerate over the curse sir and I'm just going to break out of that out after the first run to show you that we do indeed have our first result from this query okay so let's put this all together we have a connector that we've created we've created the cursor then we could take our results after we run our query on this cursor and we can append them with the results from this data and then we want to make sure we take the cursor we close that and the connection and close that I've run this cell and we can see that it's run for all our results and we have them here it's in a list of results if I look at the length of the list it has 4853 elements similarly if we go back and look at the CSV we had 4853 so it's the exact same amount of data we've just pulled it in a different way now we could take these Raw results and just wrap this in a pandas data frame and now we have a data frame with the same data as the CSV that we pulled in the only difference is being that we don't have column names so to make this red I could just rename the columns with what I know we want them to be and there we go we have our results so you might be thinking this is a lot of manual things to keep track of if you want to pull the data from the database and it is you have to keep track of the cursor the connection make sure your close things and if you wanted to do things in Python for Speed this would be the way to do it however our main goal is to get this data out and run some analysis on it so we're going to use pandas directly to read this data using our SQL query now pandas just like read CSV or read parquet has a read SQL method and actually the read SQL method is a convenience wrapper around read SQL table which would read an entire SQL table or read SQL query which will run a query on the database and pull out the results but we're just going to use read SQL since it makes things simpler and what it takes is this SQL statement which we already wrote before and it takes a connection and the connection we are going to create using SQL Alchemy so let's import pan this and import SQL Alchemy and then from SQL Alchemy we're going to import create engine create engine will allow us to create the database connection to this mySQL database now this is where it's a little bit specific for the type of database that you have but the connection strings are usually pretty standard in the way they're written so I'm going to create a connection string which is going to use my SQL and Pi MySQL and this is where I provided the username and password which is root college and I'm accessing this on my local host and the database is called employees SQL Alchemy has some great documentation on these different database strings and how you might write them for each type of database so you can reference their documentation depending on the database that you're trying to connect to now that we have our connection string we can create our engine using SQL alchemies create engine and there we go now that we have this engine we can just use pandas directly to load in our query as a data frame I'll just paste the same query in here we'll use pandas read SQL we'll provide it this query and the engine and we'll call this DF read SQL now that that's run I can see that I have a data frame here that has all the different columns and we have the same number of rows that we did when we ran with the connection and when we just exported this csb from D Beaver now most of the time unless your database administrator or you're writing more advanced program you're going to be just reading from the database but pandas also allows you to write the results to the database so let's just do some quick analysis and take the result and write it to our database let's Group by our employee number first name and last name then let's take the salary for this employee and pull the max salary we'll reset this index and we'll create a new data frame called Max salary which has the maximum salary of all of the employees that we had in our initial data set now we can take this data frame and use two SQL to write this to the database now what we've provided here will be the name of the table that we are writing to in our database so let's just call this Max salary and then we also need to provide it the engine similar to when we ran the query now if I run this it's actually created that table into our database if I jump back to D Beaver here we can see the ER diagram that shows all the tables in our database and underneath tables we see the new Max salary table that was created as a result of writing to SQL typically in most companies a database is pretty important to keep correct and you want to make sure you're very careful when writing to the database so there are a few things to keep in mind the first one is if we try to run this same cell a second time we'll actually see that it returns an error and if we look at that error we can see that's because it's saying that this table already exists this is because by default to SQL with pandas is expecting that you're writing to a brand new table in the database but there are some ways around this if you know that you're writing to an existing table you can go to this setting of if exists by default this setting is fail and it will return the error like we saw before or we could change it to say replace which will replace that entire table basically drop it and then write the new data or append which will take what's in the database already and just add the data frame that you're writing onto that same table so just to show you how this works let's take the if exists parameter and say replace now let's read this back in using pandas read SQL we'll call it Max salary 2 read SQL and we'll just make a simple SQL statement select star from Max salary and we'll provide it the engine and our Max salary here should be very similar to our Max salary data frame that we use to create that table the only difference is that we have in index set to True here in when we wrote out our SQL so let's actually set index equals to false run that it should replace it and now if we read it in again now we no longer have that index column again looking at the shapes of the data that we wrote in the data that we're reading back out they're the exact same shape and they should be exact same data and to show you as an example if we wanted to write to the same table but instead of replacing append let's make a new data frame called Max salary plus and we're going to take this Max salary plus data frame and add 1 million to everyone's salary so we can see this data frame is different than what we had before and let's say two SQL again writing to the max salary table will provide it the engine we'll say index equals false and then if exists we'll say append now it's done writing that data data to that table but it's appended instead of overwritten that table so now if we read in our Max salary table again using the simple select star SQL query we can see the result of this actually has salaries in this lower range and then the ones we added a million to and it's twice the size of the original data frame one example of why you might want to do this appending to a table is for instance if you have Max salary that you compute on a daily basis you might create a new column called create date then you could just pull in today's date and add that as a new column now if I try to write this to the SQL table it's not going to work because I have this new column that wasn't there before so first I need to replace and now I'll append it and I'll have the newer date time in this create date so reading this one last time we can see we have the creation date if I run a value counts on this we could see that created on two different times just for good measure let's run it again and append and I'll again read from the database and do a value counts on this creation date we can see there are three unique creation dates so I hope you learned a lot in this video about how to work with pandas in Python to interact with relational databases there's a lot more still to learn but this should get you off the ground and started and if you paid attention this long you probably noticed that my shirt has changed because I finished recording this about a week after I started recording it I decided to make this video because I surveyed all my subscribers and they said that this was the topic they wanted to hear about so let me know in the comments what type of video you would like to see in the future and I'll try my best to make it see you all next time
Info
Channel: Rob Mulla
Views: 113,106
Rating: undefined out of 5
Keywords: database query with python, python sql, sql query database python, pandas read sql, sql with pandas, python pandas database, read data from database python, python database connectivity, connection to database python, python data analysis with database, sql python, sql database, rob mulla, python pandas interview questions, sql database tutorial for beginners, pandas read sql for beginners, postgres python, mysql python, pandas read mysql query, sql in python, read_sql
Id: DiQ5Hni6oRI
Channel Id: undefined
Length: 16min 59sec (1019 seconds)
Published: Tue Jun 13 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.