Solve Data Science Tasks In Python (Saving Hours) (automating import CSV files to database)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys nate here i thought i'd try something new and write a python utility tool which can take a csv file that you have and upload that automatically into a database specifically in my case a postgres database so why is this important because as a data scientist you are working with a lot of different data so oftentimes this data is stored in your database or your team's database but you're also working with other teams like marketing and sales where they may have their own data in excel files and csv files on their laptops or in their own databases where you don't have a connection into that so oftentimes you're actually grabbing and taking their excel files their csv files and then uploading it from your computer onto your databases that will then allow you to take your data and their data and blend them together and run any analyses that you want what sucks is that that process takes a lot of time especially if you have multiple csv files multiple excel files to actually upload into a database that coupled with the fact that a lot of file names are not properly named for a database upload they may have capital letters extra spaces extra symbols that need to be cleaned up and then in addition to that the column names also need to be cleaned up so that could take a lot of time so why not just automate that whole process and so that's what i'm going to be creating today i'm going to be creating a python script that takes a csv file and cleans it up and then uploads it into a postgresql database so it's going to be strictly a first version proof of concept so a lot of things will be written manually the next video will be a version 2 of this script where i'm going to refactor all of the code make it completely automated so that way i could put multiple files in a directory or folder run the script without changing a single line of code and that script will just take every file in that folder and upload that into a new database table without me changing any code at all that's how you save time and reduce errors as a data scientist okay so with all of that said let's start coding all right let's open up your favorite editor here i'm using jupyter notebooks but you could also use google collabs or sublime or any other editor that you're accustomed to so in terms of steps the first step is to import the csv file into a pandas data frame and all of the manipulation will actually be handled there and then in terms of manipulation we want to clean the table name and remove all extra symbols spaces and capital letters basically anything a database can't handle in addition to that we want to do the same thing but to the column headers next we'll write the create table sql statement and then pass it to the database so that we can create a table and then we want to import the data into the database and so that's our last step here all right so five steps to basically import a csv into a database all right so the first thing we want to do is import the csv file into the pandas data frame but before we actually do that let's just import all of the libraries that we're going to be using today libraries so the libraries that i want essentially are the os library we're probably going to use numpy definitely going to be using pandas so the last library is called psycho pg2 it's a postgres wrapper that will be used to make a connection to our database because our database is a postgres database so if i run this this should all kind of work and it does i get that one right there all right so the next thing i want to do is take a look at my data file my csv file and to do that i'll just do a magic an exclamation point ls which is a terminal command a bash command to take a look at all of the files in your working directory and in there we have my csv file here that i want to import to the database it's called customer contracts and there's a space in between these two words and there's a dollar sign right that needs to be cleaned up because i can't actually import or create a table uh with that name and then this is the file that we're working on right now it's called csv import all right so we'll ignore that what's important is actually finding the csv file here all right so what i'm going to do now is just read the csv file into a pandas data frame so i'm just going to name this data frame df equals pd dot read csv and then i will output the the head of the the first five rows of the data frame so kind of messy headers here you don't really want spaces or uh capital letters or symbols like a dollar sign or or a question mark so what we're going to do next is clean up both the headers and the name of the file which will become the name of our table our database table so the next step is to clean the table name which is this right here removing all symbols spaces and capital letters all right so let's do that right now so let's say clean table names the rules that we want uh for every table name is to basically have only lowercase letters we also want to remove all white spaces the third rule is to replace symbols like dashes slashes you know different variations of slashes symbols with underscores all right so let's do that right now so our file name is actually called customer contracts all right so let's actually remove this dot csv file here so let's write a script that will automatically enforce these three rules here maybe i'll just call it clean table name is equal to the file object which is this string right here and to in order to enforce the lowercase letters so basically all letters in the string are lower case we're going to implement a lowercase method and then to remove all white spaces and replace all of these symbols with with underscores we're going to use the replaced method and we're just gonna gonna be chaining them all together for every single rule that we have right here so the second rule here remove all white spaces that one's pretty easy this is the first parameter in this method is what your original value is what you're looking for and then the second parameter in the method is what you want to replace it with in this case i am trying to look for spaces and then i am going to remove that and replace it with basically nothing and because there's nothing between these parentheses here all right and so now let us do basically the exact same thing but replacing all of these symbols with an underscore so that's going to look something like this so i'm actually going to replace it with nothing i'm going to remove dollar signs here so i'm going to change up the rule a little bit all right so here are the rest of my replaced methods so basically what i'm saying is to replace this dash or any dash i find in the string with an underscore any forward slash here with an underscore any backslashes with an underscore remove any percentages remove any parentheses remove any question marks alright so there's also two things i want you to know i'm using these escape symbols here to just go to the next line so that i don't have one really long line of code and so this is just a good method if you didn't know how to actually have multiple lines of python code so it just looks kind of cleaner so the second thing i want to note is essentially the use of r it just means raw string so i want basically python to understand that this is a raw string here don't treat this as an escape character or any other function literally just treat this as a forward slash so that's what the r means so once i have all of these replace methods and the lower method um implemented here let's just see if this works so i'm going to output the new table name and we forgot to remove the dollar sign and i see there's actually a duplicate of question marks here so i'll just exchange that with a dollar sign and everything should be good now so our table that was once customer contracts with a dollar sign and a space in capital letters is basically now just one word customer contracts so if you like how this looks you can keep it i actually want to make one change i want to actually have an underscore between these two words here so i'm going to replace this rule instead of removing spaces i want to replace a space with an underscore so if i do that now i get customer contracts and i know it's two words it just looks a little nicer and it's a little cleaner all right so let's do the exact same exercise but with the headers here and clean up the headers let's just copy some comments so we don't have to rewrite them clean header names using basically the same the same rules as cleaning the table names so the first thing i need to do is get all of the columns in my data file or in my pandas data frame and that's just going to look like this df columns will give me exactly that and i have an index of all of the header names so in order to clean up all of these column names what i need to do is go one by one through this index or through this list and basically implement this rule here or these rules here right so i'm going to write a list comprehension that's going to allow me to do exactly this so it'll be df.columns is equal to and then i'll start the list comprehension it's going to look like x dot and then i'm actually going to just copy and paste everything here because i know it actually works and the rules are exactly the same and so then i have all the rules and to finish this function i'm just going to write for x which is my object here which actually are the column names for x in df columns which is actually the index or the list here all right so hope that makes sense this is just regular list comprehension if i run this let's actually just look at what the output of that looks like so now i have column names that follow these rules here all lower case letters no spaces no extra symbols so this is exactly what i want when i create my column names for my database table all right so we're almost there let's go back up to steps we've imported the csv file into appendis data frame cleaned the table names clean the column headers now let's actually write the create table sql statement and then import the data into the database all right so the next thing we want to do is create the sql statement to create our database table so the sql statement will look exactly like this we have create table customer contracts which is our table name and then all of the columns and then the data type for each column one thing you'll note is that the data types here actually don't match our pandas data frame data types our panda data frame data types if you look at our pandas data frame df and just type in d types you'll notice that in our pandas data frame our columns have object and float64 but in our sql statement that we need to create to pass to our postgres database to create that table we have varchar and we have float so what we need to do is actually replace these pandas data types and replace them with database data types so what i'll do to do that is write a dictionary that has this mapping basically we want to map object to varchar and we want to map float64 to float so my date my dictionary is going to be called replacements and essentially i'll take the pandas data type so object and replace it with varchar and then i'll do the same with the float64 which is the pandas data type and then replace it with float which is what we're going to use in our sql statement so these are the only two data types that we have in our pandas data frame and also in our sql statement but there are more data types so what i'll do for completeness is add the other data type mappings all right so i've mapped an n64 that we see in appendis data frame often to an int date time 64 to a time stamp time delta 64 ns to a varchar so now let's output what the dictionary looks like and it looks exactly like we what we wrote so i'll leave it as that this will actually be used to create our sql statement all right so we have the column names here that's in df columns we have the data types converted from a pandas data type to a sql data type and the next thing we want to do is actually create this part of the sql statement so one thing i forgot to add is that this sql statement should have commas between them so it's going to make the code just slightly more complicated so we want to write some code that will essentially place the column name next to the data type but the data type needs to actually be a sql data type so we're going to be using this replacement dictionary here and then add a comma and then do the same thing for the next column the next data type and then replace that data type with the sql data type and then go to the third column and then the fourth column and then the fifth column and do that every single time so that's going to look something like this my variable is going to be called call string for column string i'll start with the comma so i'll do comma space and then i'm going to use a join function to essentially join the column name with the data type so this is the column name and this will be the data type and so now what i want to do is write a for loop that will go through each one of the elements in my list my column list and my data type list and then place them together just like i have them right here so that's going to look like this we're going to write a for loop here going to zip up do you have columns with dfd types and of course remember the dfd types is a pandas data type we want to replace that with the sql data types so i'm going to utilize a replace function to replace d types with a replacements dictionary all right so basically again what i have is the comma that will separate all of these lines here and then i'm joining the comma with both the column name and the data type here that's why i have these curly brackets and then to place the column name here and to place the data type here i have a format and then i have a for loop that will go through each element in my list and my list are the df columns list and the df types d types list and then in terms of d types these are pandas data types so i want to replace that with sql data type so i have this replace function and my replacements dictionary to do that and then i'm zipping them up because i want the column names right next to the d type names and when you package it up all like that it becomes one line of code all right and so now let's just see what we get and it looks like we have it exactly right we have the column name here followed by the data type and this is a sql data type because it says varchar and not object and then we have the comma space and then the next column name the next data type then the next column name the next data type and so it's exactly what we want it's just not in you know this pretty format that you're seeing up here but that's fine because when we actually insert that and pass the sql statement over to the database it won't care about formatting all right so i think the hard part's done we basically have the code that will write this automatically for us the next thing we want to do is establish a connection to our database and pass these sql statements through all right so now let's connect to our database i am using an aws rds postgres so that looks like this right here i am on amazon web services right now my database is called database.yt and it's a postgres database as you can see here i would encourage everybody to play around with cloud databases rather than you know establishing your own local database the benefits you get when you play around with cloud instances or cloud databases is that it's much easier to set up a lot of the settings and configuration is completely automatic it's essentially a one-click setup and they'll just set up and configure the entire database for you the second benefit is that these cloud instances are often used in industry so it's actually something that you could put on your resume if you're trying to find a job as an aspiring data scientist that you do know the aws infrastructure or the google cloud infrastructure and have played around with databases before all right so let's go back to our jupyter notebook so now let's establish a connection from this notebook to my database on aws and it's a postgres database so that means we are going to be using this psycho pg2 wrapper in order to establish that connection so open a database connection we need first the connection string so this is basically the address to the database the database name the username and the password to establish a connection to that database so in my case the host is the address to the database here as you can see it's an rds on amazon web services the database name is just called nate db the username nate and then the password is blanked out next thing we want to do is actually establish a connection using cycle pg we're going to pass the connection string open up a connection this way so once we have an open connection to the database let's just print out opened database successfully all right so let's actually just run this to see if we can open up a database successfully okay and we do so basically i'm passing my authentication over to cycle pg2 cycle pg2 is then establishing a connection to the database so now that a database connection is open we can start writing sql statements to that database so the first thing i want to do is drop tables with same name so we're just going to drop any tables in the database that have the same name as the table we're about to create because i don't want any conflicts or errors to pop up then i will create the table and then lastly i will insert values to table alright so let's just work with this one first to in order to drop tables with the same table name we'll use our cursor which is our connection to the database and use the execute function here we just want to write drop table if exists our table name is customer contracts right so i'll type in customer contracts so now let's see if this works it works there's no error that popped up all right so the next thing we want to do is create our table we're going to use essentially the same functions here as this drop table command but we're going to delete this command and write our create table sql sql query so that essentially is going to look exactly like this right here where we are we have a create table customer contracts and we have this part right here right but this part is actually right here that was all the work we did to take the column names and put it right next to a sql data type instead of having a pandas data type so what i'll do next is copy this part of the sql query into our create table query so that's going to look like this here we have create table customer contracts and then i've just essentially copied and pasted this whole thing down here so this entire crate table sql query matches the syntax of what we originally wanted here all right so if we execute on this query it says current transaction is aborted commands ignored until end of transaction block i think we just maybe need to open up a database connection again drop any tables that are the same and execute the create table and looks like we have it right so sometimes when you're working the connection can close you just need to open it back up again and and re-execute all of your code so now we've created our table customer contracts in the rds postgres database in aws the next thing we want to do is insert values into that table so inserting the values there's going to be multiple steps to do that it's not as simple as just writing an insert sql statement so the first step is actually saving your pandas data frame to a csv file so all of our data right now is stored in a pandas data frame we are going to save that to a csv file and we're going to use that csv file to then upload to the database table so then the next thing after that is to open the csv file save it as an object and and upload to db so let's just start with this first one here where we're going to save the data frame to a csv file so we're going to take the data frame object df and utilize the two csv function here so let's name our csv file customer contracts dot csv the headers we're going to use are df columns we don't want an index so we don't want to copy the index column from our pandas data frame into our csv file because that's actually not needed and then lastly we want to enforce utf-8 encoding so that's just the encoding that is used a lot of times if you are taking an excel file you can have different encodings that will just mess up the import you'll have like strange little symbols when you try to import the csv file to the postgres database or to any database so we're just going to enforce the encoding and say that the csv file that we're creating right now we're going to have utf-8 encoding so now we want to open the file to do that i'll just create a variable called my file open and then the file name so it should be customercontracts.csv and if this is successful i'm just going to print file opened in memory all right so it's basically taking the csv file opening it up in your ram and then preparing it to actually insert into your database table so let's just run this code block to see if everything i've written works and nope keyword argument here headers i think it should actually be header so now let's try again file opened in memory so this is all correct so we opened the csv file we saved it as an object and now we need to upload to the database so i'm just going to bring this down here upload to db so upload the object upload the csv file to our database now we need to create our sql statement and it's not as easy as just writing an insert sql statement because we have the file opened in memory right now that file is an object so we want to actually copy that entire object and all of its values into the database on aws so the sql statement is actually going to look like this i'm just going to create a variable called sql statement that will store the entire statement so copy customer contracts from std in with csv header delimiter as comma because it's a csv file comma separated values so basically what the sql statement is saying is copy the csv file that you have open right now it has headers it has commas as delimiters copy all of those values to your table called customer contracts alright so this is the sql statement and so to execute that sql statement we want to use cursor copy cursor copy expert the sql statement will be this up here sql statement the file that we want to copy i have it up here as my file that's the file opened right now and once that's copied i want a print statement that says file copied to db all right so let's see if it works so it looks like it works all right one thing i do want to do is i want to change the permissions of the table so that it's open to public so that it's basically open to anyone that has access to my database so so this is completely optional but there can be permission issues when multiple users try to access that table in your database so in order to grant um multiple users access i'm just going to change the permissions of that table so cursor execute grant select on table customer contracts that's the name of our table to public so that anybody that has access to our database to my database can actually access the data and access the table commit this code and then when i'm all done with everything let's just close the connection close the database connection here and when that's done let's print table customer contracts imported to db completed alright so the last thing is to actually check if the database table was created and all the values were inserted into that table so i'm using a database id called data grip it's just basically a platform that allows me to connect directly to my database if you're using a postgres database and it's locally installed pg admin is another software or another platform you can use so in order just to check to see if i have the table created and the values are there let's just do a select from customer contracts that's the name of our table and it looks like we do have it we have customer contracts our table the header names all reformatted so that there are all lower case letters there's no extra symbols no extra spaces and then the data got uploaded as well all right so this is essentially how you upload a csv file to a database specifically in this case a postgres database so as mentioned before all of this stuff is pretty manual i'm typing out the name of the csv file here i'm typing out the name of the table that i want to drop in create i'm typing all of this out or copying pasting all of this out typing out the name of the csv file that i want to save and then upload as an object so all of this manual work is okay as a first version as a first iteration to prove out the mvp to prove out that this actually can work uh the second step as a data scientist especially if you're working with in teams or working on the job is actually to automate as much as possible so you're not at all editing or refactoring any code so in the next video what i want to do is just refactor all of this code make it as general as possible and as automatic as possible so that when you want to upload any csv file or multiple csv files we can do that by just dropping the files into a directory into a folder running the script not changing a line of code and just having that automatically upload to your database tables the benefit of doing all of that automation work is that it'll save you a lot of time especially if you have multiple csv files and it will reduce or eliminate any errors that you could make when you actually manually start typing things out because everything is just automatic at that point so if you want to see how that goes subscribe to this channel right here and you'll be alerted when i release that video in a few weeks
Info
Channel: Nate at StrataScratch
Views: 7,771
Rating: undefined out of 5
Keywords: data science, python, python pandas, python numpy, numpy, postgres, psycopg2, analytics, automation python, python automation, python data science, csv files, import csv files
Id: wqBFgaMgFQA
Channel Id: undefined
Length: 33min 36sec (2016 seconds)
Published: Sun Nov 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.