Importing a Pandas Dataframe to a Database in Python [For Your Data Science Project]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone it's nate here with another python video today we are going to talk about how to upload your data that you have in the pandas data frame to a database in the cloud so this is a continuation of a video series that i started a few months ago the video series is the one and only data science project you need where i introduced the infrastructure you could build for all of your data science projects so in the first video i showed you how to extract data from an api and save it to a panda's data frame and now in this video we're going to take that pandas data frame and then upload that to a database in the cloud so why is this important why is this a skill set that all data scientists need to learn so the number one reason is because of memory issues we don't necessarily want to save all of our data locally on our computer and then process all of that our computers are not going to be able to handle it so we want to save all of our data to a database in the cloud or somewhere else on another server and the second reason is because you can save different versions of your data like cleaned up versions of your data in the database that way you don't have to rebuild and re-clean your data from scratch every time you're running an analysis the third reason is because all companies have data stored in databases so it makes sense to learn how to pull and push data to and from a database so let's walk through how to load data from a pandas data frame to a database in the cloud i'm going to be using aws amazon web services and then also let's learn how to update the same table with newly retrieved data from your api so this is a real world scenario that we'll be handling this is something you'll get on the job just needing to update tables continuously with newly refreshed data and lastly let's do this in a scalable way let's write code with great software development fundamentals so that our code can handle millions if not billions of rows of data all right so let's get started if you like content like this please subscribe to this channel thanks and see you at the end of the transition all right so the first step is really just setting up our workstation what i want to set up first is a database in the cloud if it hasn't been set up yet right so i'm using as you can see on the screen i'm using amazon rds right here aws and i created a postgres database called database.yt it's a micro instance so they're actually free for anybody to really just spin up and play around with so i encourage everybody to get on the aws and just start up micro instances to play around with databases if i click into this database here i'll basically have everything i really need to connect to that database so really the endpoint is right here the port number is here and hopefully you saved your username and password to actually get into that database so we'll use these this database and the credentials to connect to it using jupyter notebooks or google collab so that's the next thing to set up i've spun up a google co-labs a notebook here which is exactly the same as a jupyter notebook i kind of just like using google collabs because it's very easy to spin up and it saves to my google drive so the first thing we want to do is make a connection to our database so because it's a postgres database what i want to do is pip install cycle pg2 which is a postgres wrapper that allows you to connect to a postgres database from python so i am pip installing uh cycle pg2 onto the colab server and then i'm going to be importing the library cycle pg2 and then the last thing i want to do is import pandas as pd all right so i'm just going to run this line of code here and it's run successfully here so the next thing i want to do is upload the data that we're going to be working with that we're going to be uploading to our database in the cloud so i mentioned before that our data is stored in a pandas data frame that we grabbed from an api specifically from the youtube api that's stored in a different notebook so what i did was i saved it as a csv and i'm going to upload the csv into this notebook but what i want to say is that you can actually just use the same notebook with the api connection so that you don't have to transfer csv files from one notebook to another that would actually make it much easier because i want to keep the projects kind of clean for these videos i just am working with the csv files that i will then convert into a pandas data frame so i'm just going to upload the data here it's called youtube vids uh first pull and it's on the server right now and then just to read the data i'm just using the read csv method from pandas to read the csv and convert it to a pandas data frame specifying the index column here and the first five rows should look exactly like this so if you followed my previous video on grabbing data from the youtube api this is the exact same data set so we have the video id the video title the upload date and the view and like and comment counts that we want to then upload to a database in the cloud so to my aws database so this is the data that we're going to be uploading and the next thing we want to do is connect to our database so we have our credentials we have the host name which is really uh the ip address of our database as you can see here the database name the port the username and the password of that database so now we're going to be using the psycopg library here to connect to our postgres database on aws so we are going to be using the connect method here to connect to our database and the parameters that are required to connect are host database user password and port all right so basically everything we have here so what i'm going to actually do is just pass these variables to the method itself so host is going to be hostname and the other variables and parameters are passed down just like this i'm also gonna save this connection as a variable called con for connect and i'm gonna initialize it here as well as none so this will provide a connection to our database so what i also want to do is check for errors just in case we have trouble connecting to that database so what we're going to do is implement a try and accept so it's going to look like this so try this connection if it doesn't work raise the error right and so there's an operation error if you can't connect to the database that gets triggered and what we're going to do is just print out the variable e the the error itself and then lastly if there is no error just print out connected so this is going to allow us to connect to our database so what i want to do next is wrap this into a function so that it's isolated from other parts of the notebook and we can reuse this function for anything we want to do in the future all right so our function is going to be called connect to db and i am passing through all of these variables here that connect to the database uh we're using it in this function and it's going to be passed through this uh to this connect method here and then it'll go through the try and accept and print connected if we are successful and then it's going the function itself is going to return this connection if successful all right so now let's test out this function i'm going to be calling the function connect to database here and i'm going to drop this code block to the bottom so we're going to execute the function first build it up and then we are going to use this function next and it looks like we're able to connect to our database so the next step is to create our database table so that we are able to push data from our notebook to the database in aws so what i want to do i'm going to move down this read csv down to you know our create table just so we have an idea of what the schema looks like the schema or the database columns or these pandas data frame columns are always going to be the same so we want to basically replicate what we're seeing here in the notebook back up to the database table so just looking at the column names i'm going to create a sql create table command that basically creates this exact table or the shell of this table i'm saving it in a variable called create table command and it's just a regular sql command that will create the table if it doesn't exist the table is going to be called videos and then we have here just the column names right video id video title all the way down to comment count and then i'm specifying the data type and then i'm saying uh not null as a constraint so this table and all the columns will match this pandas data frame here so what we need to do next is push this command from this python notebook up to our postgres database and so in order to do that i'm going to wrap this command into a function that i could use to create tables and as you can see here i just wrapped it into a function called a crate table i'm going to pass this cur variable here which stands for cursor i'll show you what that's about a little later and then what we're going to use is the execute command to execute this sql command on our postgres database so let me show you how that's done okay so the first thing we want to do is create a cursor variable here so we have our connection to the database and then we're going to use this cursor method here and save it to a variable called curve and that's going to essentially be passed to this crate table uh function here and so just to give you a little bit of context this cursor method here this cursor command allows you to run sql commands and fetch results from the database so it basically allows python code to execute sql commands in your database session so now that we have the cursor uh command built we're gonna use the crate table function here that i created up here and then we're gonna execute on this create table function to build our table okay so it looks like everything was executed successfully let's now take a look at the back end to see if we can see a table there uh what i forgot to mention is sometimes you need to commit your code and to commit the sql code you just do a con.commit to do that and it'll just execute the sql command onto the database so if we now just take a look at our tables i'm going to do a select star from the videos table that we just created and as you can see in this bottom output here we have all of the columns that we've built in the create table command we just don't have any of the data so now the next step is to populate this table in the database with data so in order to push the data from our pandas data frame to a database on aws there's going to be an insert and an update sql command that we're going to be using to perform that task but how we do it really depends on the nature of the data so there's two use cases that i can think of you can have a database table that once you have data in there that those rows are never updated again and so think about this as like bank transactions or say logging of user events these are just like snapshots in time that aren't going to be changed the second type of data are basically records or rows that get updated with time so this is definitely the case here i have a list of videos on my youtube channel and account of the views and comments and all of that those counts change with time so anytime i refresh that data those counts are also different and so i need to update existing rows in addition i need to insert new videos that i make that i put in my channel right so there is definitely an update command on existing videos and an insert command on new videos that i need to write in addition to that there are scalability issues as the data grows my update and my insert commands also grow i don't necessarily want to write huge insert or update commands because they can take up a lot of memory on my computer i'm going to be updating videos one by one and it's going to either be an insert command or an update command and so what i want to do is have a checker or write a checker to see if a video exists first alright so that's the first step before we write this update and insert command so let me show you how to do that so what i want to do first is write down this logic as pseudocode first to build a skeleton where i can put in all my code so that's going to look like this as i mentioned what i want to do is go row by row and check to see if a video exists in the database so what we're going to be doing is going through the pandas data frame for so the the list of videos that i have row by row and let's check if the video exists so i'm going to write a function for that if the video exists update that row update the counts if that video doesn't exist let's just append that video to the database table so effectively we're going to be writing an insert command here and an update sql command here but the first thing we want to do is write this check if video exists function first okay so to check if a video exists it's a really simple sql command that we're gonna execute we're gonna select the video id because the video id if we look at uh this pandas data frame that we have here the video id is the primary key so we're going to check to see if the video id in the database table matches the video id that we have in the pandas data frame here so that's why there's this where video id equals this parentheses s so this parameter this parameter is going to be coming from the pandas data frame it's going to be going through row by row and putting this value into the sql query and then it's going to be putting this value into the sql query and then basically just cross checking with the database table to see if that video id exists there again if it exists we perform an update if it doesn't exist we perform an insert okay so so the next thing we do is we execute this command and we're passing the query here which is this variable and this sql code here and because we have this parameter the video id we're also passing the video id itself as a tuple as you can see here this is the way uh python wants it to be done so we have to pass the video id as this tuple here and the last thing we're gonna do we're gonna wrap these commands into a function called check a video exists we're gonna pass the cursor as well as the video id that we grab from uh using this for loop here so this is kind of where we're gonna plug and play our function right and then the last thing that we're going to return is a fetch one is not none and so what that means what fetch one does is it will return a single row from a table so if a video is found with the required video id it should return the row if it doesn't it will return a none so basically what this is saying is that if you actually get something return that return whatever it is if you don't get anything it means that the video doesn't exist it's a none and then we can go to the else section of this for loop and perform an append or an insert all right so this now this check a video exists function is done we're gonna test it when we build our update and insert functions here what i'm going to do is just group this function with the other functions that we've created okay so the next thing we want to do is uh write this update row function so if the video exists in the database table we want to write a sql update command to update all of the view counts that we have so that's going to be a relatively simple sql update command so here's what that update command is going to look like so we have an update video so we're updating the videos table and then we're going to be setting new values for all of the columns right so the video title in case i rename a video the view count the like count the dislike count and the comment counts are all going to be updated where the video id is what we specify so with this sql command we have a bunch of parameters as you see here these parentheses s's we need to pass values to this query so the past values to the sql query here i'm going to go back to my pandas data frame and if i go here my pandas data frame basically has video title upload date view count like counts so all of these variables have already been named in my pandas data frame and all i have to do is access this pandas data frame so what i'm going to do is create a variable called vars to update it's going to basically go into my pandas data frame and grab the values from these variables here from these columns in my pandas data frame so the last thing to do just like the others is wrap this up into a function of its own i'm going to name this function update row just like this right here and i'm gonna pass the cursor and then all of the variables from my pandas data frame to the sql command here so we have the query like i mentioned before vars to update all of the variables from my pandas data frame and then i'm going to execute that query by passing the query the sql query here and then all of the variables to update so this function is done this function is taking several parameters from our pandas data frame so we need to update this line of code here and actually pass those parameters to the function itself all right so the question is what parameters are we going to be passing and how do we actually do that the first parameter we know we're going to be passing is this curve parameter here so i'm just going to type in c-u-r-r so i'm just going to pass that parameter in the next are the video ids video titles and all of these counts here so then the question is like how do we actually access that from this uh for loop where it's i comma row in df inner rows so if you're not used to writing for loops for pandas data frames it's done in a unique way using this inner rows that gives you both an index that's i and the row itself right so we actually want to leverage this row variable here so in order to do that i'll write the first one we are going to be grabbing the row and from the row we are going to be grabbing video title we know the first parameter here is video id we know it's contained the data is in this row right here this row variable so in order to access that we're going to be typing row and then video id to access that actual video id and if we follow the same pattern for all of the other parameters that we need to pass through to the function it'll look like this where we're passing the row and then the video title and then all of the counts as you see here so now our update function is finished and i'm going to move this function up with the other functions but what about videos that don't exist in our database table on aws we can take care of that by writing another for loop but before we do that what we can do is store all of those new videos that need to be inserted into the database table we can store all of those video ids in its own variable so what i'm going to do here is change this line of code where i'm going to have a temporary data frame but what about all of the new videos in my pandas data frame that isn't in my database table we need to write an insert command that will insert those new videos into the table itself into the database table so we can write an insert sql command to do that but what i want to do first is actually store all of those new video ids into its own pandas data frame so i'm going to create a temporary pandas data frame here called temp.df and what this is basically saying here is that if we check if the video exists if the video exists we're going to update the row if it doesn't exist append that new video and all of its information to a new pandas data frame called temp.df and then the next thing we want to do is actually initialize or build the shell of this pandas data frame so we can actually append the information to that data frame so we know it's going to basically mimic our pandas data frame structure here so let's just create a blank uh pandas data frame we're going to build it above this for loop here and it's just uh building a pandas data frame with these columns and these columns are basically the exact same columns that you see here uh where our pandas data frame lies and we build a shell of this so that we can append new videos to this variable or to this pandas data frame called temp df and of course what we want to do is wrap this whole thing into its own function so that it's isolated from the rest this function is called updatedb and basically what this function is going to do is it's going to create a pandas data frame called tempdf and then it's going to go through this for loop where it's going to check to see if a video exists in the database table on aws if it does exist it's going to update the row if it doesn't exist we're going to take that video id and all of its data and store it in this temp df this new pandas data frame we're gonna then i will then write a new function using a sql insert to insert the new video ids to the database table so what we want to do last is return the temp data frame because we're gonna need to know what these new videos are alright so before we run this update function let's just group it with the other functions here to keep the code clean all right so now in order to run this function here we're going to execute that function down here update db pass the cursor that we've initialized right here and then pass the data frame the pandas data frame that we've read into our notebook here and then because if you remember the update function returns the temp df which is really a list of all of the new videos that are not in the database table yet because it's going to return that we need to save it so we're going to save it in a variable called new vid df for new videos data frame and then that's how we're going to run this update db function all right so let's connect back into the database create the cursor we don't need to run this create table command because we already created the table and then let's run this update db command here and once once that's done let's commit all of the sql commands to the database okay so all the code has executed what are we expecting well let's take a look at the function again we go through our pandas data frame as you see in this bottom here and basically we're checking to see if the video exists first if it exists in the database table we're going to perform an update command the update command is up here it's just a regular update sql command there are no videos in the database table yet because we haven't inserted them in so all of the videos would go into this else statement here where we're actually building this temp df to store all of the new videos so we can perform an insert command later on so what i'm expecting is all of my videos making it to this variable here called new video df and if what if i uh check the output i have 71 rows that represent 71 videos in my channel and all of the information here so this is exactly what i'm expecting and the next thing to do is write the insert command so i can actually push that to my database table okay so writing this insert command is going to be identical to how we wrote the update sql command so first what we have is a regular sql insert query here insert into the table name videos and then insert all of this data here right and then the values are all placeholder values for what i get and all of the values are going to be passed to the sql query here so what are we going to be passing we're going to be passing our data from our pandas data frame from these from all of these columns here just like our update sql command and then lastly we're going to execute that sql command using the execute function here we're going to be passing the insert into video sql command so it's this sql command here and then because they have we have placeholder values for actual data we're going to be passing the data through this variable called rows to insert and that's here and of course everything needs to be wrapped into a function like we've done with the other stuff so that function is gonna look like this we have a function named insert into table and then i'm passing the database cursor and all of the video data and then we again have the sql query here the data to actually insert and then the execution so this is exactly like the update sql queries so if you have any questions about what any of this stuff means because i kind of glossed over it go back to the update section to get more information now what we're going to do next is write a for loop to insert all of this data all of the video data that we have to our database table and again for scalability and for performance what we want to do is write a for loop so we insert one video at a time rather than inserting all of the videos all at once because you might hit performance or memory problems that way all right so let's write our for loop to insert these videos into our database table so that for loop is basically gonna look like this it's very similar to the update for loop that we we wrote but what we're doing is we're going to be using inner rows again and then we're going to be calling this insert into table function that we wrote up here then we're passing all of the parameters of the video all of the data and it's going to be inserting all of that data into our database table as you see up here in this function so now the last thing is to wrap this into a function i'm going to wrap this i'm going to name this function append from df to db and pass the cursor and the pandas data frame that we have all right so let me move these functions to where all the other functions are okay so now to insert all of the new videos into our database table we want to use that function so what we're going to do is call this function append from df to db pass the cursor and then pass this and pass this new vid data frame right because this pandas data frame contains all video data that's not currently on the database table so what we want to do is pass that pandas data frame to this function here alright so let's test it out let's connect back into the database create the cursor and then i'm just gonna execute this append function that we wrote and then commit all of the sql queries okay let's see if that worked so what i'm gonna do is go back to my ide that i can use to connect into my database i'm going to run the select star from videos query here and what i have are 71 rows so 71 videos and i have all of my video data that were was basically inserted into my database table so that function worked so that's all the code we're going to write we took a pandas data frame and uploaded all of the videos all of that data into a database table and what i told you before is that the nature of this table is that all of the records get updated every time we refresh the data right every time i'm pulling from the youtube api all of the counts are going to be different as well as new videos are going to be part of that data poll right so let's test out if the update command actually works to see if we get uh new video counts and then let's take a look to see if the insert command will also catch new videos okay so we're going to track a few things in this experiment so i just ordered the the video table by the upload date so the latest uh date the latest video that i have in my table is this step-by-step approach to solving any data science sql interview question by twitch and then let's just keep track of this view count here it's 422 all right so now what we're going to do we're going to pull new data from the youtube api and then save that into a csv file thankfully i did that already i'm gonna upload the second poll to google collab so i have uh youtube vid second pull here and all i really to do is just rename this to uh the new um csv file second poll just going to run this again and as you can see here i have this new video called working with apis in python and then the step-by-step approach that twitch video that i mentioned before the view count is now 1579. so we want to test these two things to see if our code will actually update and insert as expected all right so let's test it out we are going to connect back into the database build the cursor we don't need to create the table we are going to be running this update command but i'm going to save it into a new variable maybe i'll just call it new df2 and then pass that to this insert so it inserts the new the new video that we caught so let's just run this update command here it's going to take a few moments i'm just going to run this commit command on top of that and now all of the existing videos have been updated all the accounts have been updated now we saw one new video that wasn't on the database table so let's append that let's insert that video in commit that now let's go back to our sql ide to check to see if all of those changes actually made it to the database table i'm going to run this again and look at that the we have 72 videos here so we have inserted one video working with apis in python for your data science project and then that step-by-step approach that twitch question it's updated the count is updated to 1579. so before it was 422 now it's 1579 the update command worked the insert command inserted all of the new videos and that's how you take a pandas data frame and upload the data to your database table up in the cloud all right guys that's it for me all of this code and the notebook itself is going to be uploaded to github and i'll have a link in the description below basically what we did in the past few videos was build a data pipeline we extracted data from an api in this case the youtube api saved it as a pandas data frame that video is linked in the description below if you want to know how to do that and in this video what we did was we took that pandas data frame and we uploaded that into our database table on aws so effectively you've built a data pipeline the only thing that's missing is adding in the scheduler so that you can automatically refresh your data from the api whenever you want but this is really the start of building out your data science project you're always going to be grabbing and collecting data and then cleaning it up this is really just a systematic way to grab data from whatever source you want and then push it to a database table you're going to need to know how to do this on the job so you might as well get some training get some experience before that and again the way we wrote the code is that it's scalable for millions if not billions of rows so you're never going to have any performance issues or memory issues if you have a lot of data that you're playing with so this skill set is directly transferable to a real job and it's something that you can speak to on a job interview as well alright that's it for me and this python series if you like content like this please subscribe to my channel i'll see you guys at the next video thanks [Music] you
Info
Channel: Nate at StrataScratch
Views: 5,208
Rating: undefined out of 5
Keywords: pandas dataframe, data science, data scientist, importing a pandas dataframe, database in python, importing data to a database
Id: 77IVf0zgmwI
Channel Id: undefined
Length: 33min 45sec (2025 seconds)
Published: Wed May 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.