Convert Excel Files to CSV using Python | Working with Large Excel Files in Power BI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in a recent consulting assignment i was working with a lot of excel files they were large and they had very nasty data and i had to clean the data before i could put the data into power bi now turns out when you connect the data through excel into power query and you have to work on a large data from excel and the data transformations are also pretty complex the performance is going to be absolutely terrible how did i solve the problem i converted all of those excel files into csv files and the performance skyrocketed now i thought maybe i'll just share the idea with a couple of people on twitter and on linkedin people seem to like it too much i believe a lot of people are facing such problems in this particular video i'll share a nifty python script to convert all of those excel files automatically into csv files let's go all our people working with two simple excel files here let me take two minutes to explain you the data and then i will talk about the construct in which i will write the script the logic part of it take a look we have two simple excel files in here we have the east excel file and we have the best excel file each of these excel files could possibly contain multiple sheets and each sheet should be converted into its own csv now it may be a possibility that you don't have one excel file and multiple sheets within that you could just have a single excel file and one sheet within it but if that is a scenario you could possibly use this even if you don't have an excel file with multiple sheets the code is going to still work just okay now please take a look in the first excel file which is the east excel file i have three sheets we have jan feb and march the data structure is the same it doesn't matter whatever be the data structure i'm just going to convert whatever data do we have into a csv now note that a csv can just contain one sheet you can't really have multiple sheets within a csv take a look in the other excel file that we have we again have two different sheets we have jan sheet and the feb sheet now what is the logic that i'm going to follow to be able to solve this particular problem it's very very simple all that i'm going to do is using python i will step inside this particular folder that's part one once i step inside this particular folder then i'm going to take a look at every single excel file not any other file just excel file once i pick up this is going to be like a loop so i loop through every single excel file i pick up the file and within that file i take a look at the name of the sheet that means what sheets do you have convert jan into a csv convert feb into a csv and convert march into a csv because to avoid any name conflict i'm also going to concatenate the name of the excel file which is right here to the name of the sheet so that i probably just get the all the sheets which are absolutely unique all right and that's going to be the solution let's just jump over to python and start to solve this problem all our people i'm here in visual studio code i'm assuming that you have some sort of editor to be able to write the code into your computer i'm also assuming that you have certain packages and modules installed and you have the bare essential working knowledge of python my job is not to really teach you python here i am learning myself and i'm barely getting at a level where i can make workable solutions so it's going to be an explanatory process for the both of us please take a look the first thing that i do is after i open the visual studio code i'm going to create a new file in here and i'm just going to maybe create a new file i will save the file in the same directory where my excel files are kept so i'm just going to click right here and say control s to save it and in the same folder i'm going to call this as excel to csv and i'm just going to maybe call this as a python file so right here save it and the file has been saved now if you take a look at the folder on the right hand side uh you can see that we have the python file added right here currently the file is blank let's just start to write some script in that file all right so the way that i would want to start is that i would want to step inside the folder and i want to take a look at all the files in the folder not all the files but only excel files that i would want to work with to be able to do that i'm going to start to write something like an import and i will import the os module which is actually going to help me to take a look at the contents of the directory i'm also going to import pandas that i will use later as pd and let's just declare a path that means i have to tell python that where to look for the files and i'm just going to maybe put an equals to sign come right here copy this particular path and paste that right here i'm also going to make sure that the subfolders are demarcated by a double slash and not a single one that's how the way it works here all right now once i have put in the path i would want to take a look at what contents or files are there in that path so i'm going to use a like a function let's just declare a variable first so i'm just going to say that all the files that you get you're going to store that files in the variable called files and i'm going to use a simple function called list directory to be able to get the list of the files which are there in that path so i'm just going to feed the path right here and let me just see the print of this let me just kind of call the interim output and take a look at what is the result that i'm getting as of now so i will call the print command and i'm just going to say hey why don't you try to print the files variable and take a look at all the files which are there in this particular path i'm just going to maybe say run run without the debugging and you can see that at the bottom here it gives me that we have excel file there is a python file which we are currently working on there is a west file and these are two temporary files that we cannot see probably these files are hidden that's totally okay the next thing that i would want to do is i would want to create a little loop inside of this particular folder take a look at every single file and only pick up to start to work with that file if that happens to be an excel file if it's not an excel file i would like to drop it right so how do i actually create that little for loop i'm going to go ahead and start to write a little for here so i'm going to say for each file in files this is going to be a single s each file in files then i'm going to say something like hey why don't you actually check if the each file dot let's say end with the extension which is dot x lsx then i would want you to go ahead and print the name of the file right so let's just take a look this seems to be right i'm just going to click on run run without the debug and this only gives me the names of the two files which is east and west so far it seems to do the right thing so now let's just go and continue this and start to step inside of that file and start to grab the data of the sheet and convert that into a csv all right now what we're going to do is we're going to pick up the names of the excel files which is nothing but a text and from that we will start to read the data within that excel file what we have been able to get is just a piece of text east.xlsx is nothing but a text we need to convert that text into a real excel file which contains the three sheet data so how do i do that i'm going to go ahead and let's just comment that out maybe kill that and i'm going to write something like excel file which is a variable that i'm creating in that variable i will use a python like a function so pd.excel file and i'm going to say that hey convert that every single name which is this particular name east and west into a real excel file so these particular names are stored in each file so i'm just going to maybe reference each file in here and once i actually get to the excel file now i will have the access to the sheet names of the file let's just try it out so i'm just going to say that excel file is the file and that file actually gives me the access of the sheet names and this particular variable is going to be stored in let's say another variable called let's say sheets so i'm just going to maybe write sheets and that is going to be my variable so i'm just going to delete that extra space in here and that's okay now to be able to test this out have we really gotten the names of the sheets or not let's just actually use the print command to print the names of the sheets right so i'm just gonna maybe run this click on run without the debugging now what i should be able to get or see is the names of the sheets and if you remember we had five different sheets in two different excel files the first one had jan feb and march and the second excel file had jan and feb so we have been able to get to the sheet level now let's just take the data off the sheet convert that to a csv and store it back in this particular folder all right now that we have been able to get the names of the sheets let's just step inside every single sheet and try to grab the data of that particular sheet how do we do that i'm going to get rid of the print command in here and start to declare another for loop the reason why i'm declaring another for loop is because i would like to step inside every single item of this particular sheets that i have been able to get which is nothing but an array or a list and i'd like to pick up the first one and i'd like to get the data off that pick up the second one get the data off that so on and so forth so i'm just gonna say for each sheet in the sheets that we have been able to create i'm gonna go ahead and i would like to create another variable called data or sheet data let's just call that so sheet data is equals to excel file uh this is the name of the excel file if you remember we just got the name of the excel file from here actually the excel file from here not the name so i can now use a parse function to parse the data of every single sheet and in here i'm going to just use each sheet in here because that is where i have the name of the sheet all right now let's just test it out i am seeing that this parse function once it is taking a look at every single sheet name which is the jan sheet the fab sheet and the mod sheet it is storing all of the data in this particular variable called sheet data let me just try to print that out so i'm just going to maybe hit and enter and try to print nothing but sheet data and let's just go to run and say it on without the debug and i should now be able to take a look at the data that i was taking a look at in excel so we had three columns date value and the product and we were able to take a look at all of the data of all of the five sheets right here now the problem is that the data as of now is just thrown at the terminal here however i don't really want to throw the data on the terminal i actually want to grab the data and write that back into a csv file let's just try to do that once we have been able to get the data of every single sheet now it's the time to take the data and put that in a csv there happens to be a very simple function for that so i'm just going to get rid of the print in here and start to write something like hey why don't you take the sheet data which is the variable that we just created and please write that to a csv now this particular function is asking you hey once i create the csv would you like to give any name to that and that's going to be the input right here in the brackets now the name obviously for the csv is going to be dynamic that means i need to have the file name concatenated that with the sheet name and i know for the fact that this particular variable contains all the file names and this particular variable contains all the sheet names maybe i could just concatenate that but before i start to concatenate and produce the csc files i however would like to test it out that are the namings appearing fine or not otherwise i'll just produce a bunch of files then i'll have to delete it and i'll have to redo a bunch of work so i'm just going to comment this out for just a quick second and create another variable and i will try to test out the names of the files which are being created i'm going to call this as csv name and i'm going to say that hey this is going to be a concatenation of each file plus plus is the and percent or the concatenation symbol in python and i'm going to say hey why don't you just maybe write a little dash in between plus once again and each sheet and then plus once again and then a little csv dot csv in the end because the extension of the file needs to be dot csv and that's a little mandate that you'd have to follow all right now i'm going to test this out that are the names appearing the right way or not so i'm just going to say print why don't you print the csv name let me just take a look at that and i'm just going to run this particular thing what do i get i get the names of the files all good but the only trouble is that since the name of the excel file which is this file contains the dot xlsx word it looks a bit ugly and i don't really want to have that so i'm just going to create a clean file name and just import it here so i'm going to go back to my if function and after the if function has been tested that you're only working with excel files i now want to clean up the name of the file so i'm going to do something like this i'm going to say clean file name and i'm gonna say hey why don't you pick up each file and use the replace function and in the replace function i will say hey i have dot xlsx that i would like to replace it with nothing like a blank string and that's the clean file name that i will actually input in my name of the file which is the csv file so i'm just going to say clean file name and we're good to go and let's just try to test it out so i'm just going to say run run without the debug now this seems to be all good and this particular uh print command gets deleted and now in the sheetdata.csv i'm just going to put in the csv name right here the other input that i would want to do in this particular thing is that i don't really want to have the index if you took a look at the output that was being generated by the sheet data i was getting four columns of data not three columns of data however in my excel we had just three columns of data we had the date we had the value and we had the product what is this particular column this is nothing but the index that is created by python automatically i don't want to have the index so i'm just going to come here and i'm going to say when you're converting the data back to a csv just don't have any index so i'm just going to say index is equals to false all right that's good to go and let's just run this let's just see do we get a bunch of csvs or not so i'm just going to go to run around without debugging let's just take a look at the folder and voila what do we have we have a bunch of csvs created right here which are now created in the folder usable files that you can now put that into your power bi and skyrocket your model's performance all right that's been it let me know if you have any questions around this and i will try to reply to your questions i'm not skilled at python myself so perhaps your questions are going to help me learn if you are however skilled at python and if you have a better way of solving this particular problem please do not hesitate to drop in a comment and show us all the better way of doing this in the end i'd like to give a big shout out about my dax and power query courses which is where i'm actually skilled at doing a bit of tax and power query if you're starting out with power bi and the journey to learn dax or power query is hard and you would want to master the fundamentals first and then move on to solving more challenging more difficult problems of your own data i'd highly recommend that you take a look at my courses it's going to be super awesome thanks so much for sticking around and i will catch you guys in the next one cheers [Music]
Info
Channel: Goodly
Views: 14,576
Rating: undefined out of 5
Keywords: Power BI, Power Query, Excel, DAX, M Formulas
Id: 7ExoLCzvfCI
Channel Id: undefined
Length: 14min 58sec (898 seconds)
Published: Sat Jul 16 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.