How to Read Excel Files with Python (Pandas Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Excel is still the most used program to share and visualize data within many organizations and while Excel is an amazing tool it also has its limitations if you want to level up your data analysis and learn a better way to work with data then this video is for you I will show you how to read Excel files with python and the bonus library in particular I will first briefly cover the basics and then go into more detail showing you some practical tips hey everyone welcome to the channel my name is Dave and I make videos to help you become better at working with data let's get started okay so before we start there are a few requirements that we need in order to read Excel files with python and the first one being obviously a python installation so make sure python is installed on your system I will leave links to all of the requirements in the description so yeah go ahead and install python if you don't have it already then secondly we need some sort of way to run python codes and my preferred way of doing that is Visual Studio code so if you don't have that already or if you're using some other way to run python that is fine but if you're new to python then I highly recommend installing Visual Studio code I also have a video about Visual Studio code which I will link somewhere up here okay and then we need two python packages the first one being pandas so if you don't have that already you can install it using pip using the command over here and the same goes for open by Excel you can also open install that so make sure you have both bundles and Open PI Excel installed now if you're completely new to Python and don't know how to install packages using pip then please check out the video that I've Linked In the description now once you've installed all the requirements you can open up visual studio code or any other preferred IDE of choice and create a new file I will first quickly show you the basics and then we'll get into some more advanced tactics if you want to follow along there's a link in the description to a GitHub repository where I will upload this project also including the data I also have some other videos where I explain how I set up my visual studio code for data science that is the video over here and I also have a video where I explain how I structure my projects that is this one could be interesting if you want to follow along for this tutorial we'll be using four different Excel files I got the data from an open source data set on kaggle so it is Data from a pump so there are a lot of sensors on that pump and our goal for today is to have a look at the data and make some plots so here we can have a look at the folder structures So within the main project directory there is a folder called Data within there we have a folder called raw and then pump sensor data and over there are the four Excel files they are split by day so this is from the 1st of April 2018 2nd of April 3rd and so on and then if we open up the files in Excel this is what it will look like so we have a timestamp and then we have sensor 0 all the way to sensor 51 and the last column is the machine status which indicates if the pump is running normally or if there's a defect okay so now let's read this file into Python and to start off we'll create an empty python file so here I have a file read Excel file dot Pi which is located over here in the source folder and then within data so we'll start with an empty file and we'll start off by importing the pandas module as PD and I'll run this over here I will run it in an interactive session okay so now we have the pandas Library imported and now we're going to use the read X Excel method which is this one over here so pandas has a read Excel method where we can specify the parameters and then we can read in Excel so I'll now show you how to do that I'm assuming you have some basic python knowledge otherwise I will refer you to my video about working with python in vs code so we'll start off by declaring a variable data and then we'll call PD referring to the pandas function and then the read Excel method opening and closing parentheses and now we have to specify the path and we do this between quotes and the easiest way to get the path to a file is to right click over here on the left click copy path and then paste it over here and then run this line and then if we show the data variable we can see that we have loaded the data so that's quite awesome so with just one line of code we were able to take all of the data within this Excel file and load it into a pandas data frame those are the very Basics if that's what you came here for then then this is all you need you can also go ahead and explore the bonus read Excel documentation on your own and find additional arguments that you can use to load and prepare the data even better other parameters that are very useful and that I regularly use are the sheet name with this you can indicate which sheet to open or to read from your Excel file if your Excel file has multiple sheets also the decimal can be very beneficial because some Excel files might use a comma as a as a decimal separator and that will mess up your data frame and another very useful one is the skip rows parameter so you can use this to give list with indexes or an INT so that is either skipping certain rows based on the index or for example the first five rows and that could be useful for when your Excel files has some headings for example or some additional information that you don't want to include in your data frame but I will now continue to show you some practical tips that I find very useful working with with Excel files on a regular basis so first we're going to clean up this path a bit and don't use the absolute buff but the relative path and since I'm in the data folder which is in the source folder I have to go up two directories and then from there on I can go to data folder Raw pump sensor data and then here I will access the the first date now let's run that and see if it still works so do it again boom still works okay so now we're using the relative path then we're going to look at two additional parameters from the read Excel method what you can see here is we're dealing with a timestamp so this is a Time series data set and what I typically like to do is I like to use the timestamp and set it as an index so that will make it easier down the road to make plots and then the x-axis will automatically be the date so you can make nice plots easier so what we can do is we can go into the read Excel method and we can include the parameter index column and what we can do with this one is we can specify which column we want to use as the index and since it's in the First Column we'll use the zero and now if we run this again check it boom we can see that time stamp is now removed from the columns and is set as an index so if we compare that to the data frame over here here we have an index from zero all the way to the final row over here and now we've set this as the index so the timestamp as the index another thing that we want to do so let me check let me just quickly undo that and check how the data is loaded so if I go into here do data info we can see okay we can see that the timestamp is pandas recognizes it as a daytime object that's also very important and sometimes this is not the case so what we could then do just to show you uh what that would look like is we can add another parameter and I will do that before and that is parse dates and here in the same way we can specify columns which should be parsed as a date so that it's a Dame daytime object we will include this as well just just to make sure now run it again and then if we show the data variable we can have a look at the timestamp and it's corrected as the index and is also parsed as a date just to force that let's make this all nice and clean call this opening a single file by the way the heading that I just did I have a snippet for that using all thread on my Mac if I do a hashtag and then a capital H this is what what I get so I like to use that in my codes to specify the sections but now let's just quickly explore this data so what we can do we can do data at 10 looks fine we can describe the data get some statistics and then what else we got we can now have another look at the info make sure that everything has the correct data type so as we can see they're all float 64s here which is to be expected except for the machine status which is a string so that is an object so that is all correct so there's a quick way to make sure the data is loaded correctly what sometimes can happen if you look into the Excel file over here is people sometimes mess with the Excel files and for example put a letter some somewhere or accidentally type somewhere and what then happens is pandas will read over all the columns and it will find a string variable for example and then it doesn't recognize it recognizes it as a float and then you cannot plot it or do any numerical computations with it before converting it so that's just a tip to to check then um what we can do we can do a quick transformation of the data by looking at this sensor 15 over here which basically has zero non-nil values which basically means if we look here at the DOT that is just an empty column and we can check that so that doesn't add any value to our data frame so what we can do is we can get rid of it and we do that by using the data dot drop so bonus dataframe dot drop method and then within that we specify the column that we want to get rid of so that would be sensor 15 and then we also have to specify the axis because we want to get rid of a column and not of a row so that is how that works let me quickly check so it's now 52 columns before it was 53 I guess 53 yeah so we got rid of one column and now in order to store that we just overwrite the data variable save it again then run it and now sensor 15 that column is dropped so if we now run the info again if we look over here you can see no more sensor 15. so that was a quick transformation and then let's take a quick look at the data and plotted plot the data what we can do is we can take for example the first sensor zero and then create a plot have a look at the data over here and just a quick heads up I am going to import some plot settings from the utility folder over here I also have a video on how to do this but basically these are these are some style settings for matplotlib that we can import from over here and then all we have to do is import the system then append the parent Pro folder to the path and then we import these settings and then when I plot look we get an instantly nice graph so this is a very nice utility method as mentioned I have another video where I explain how to do this then we can copy paste this line and for example look at the second sensor and we can already see this is quite a different type of sensor so a different range different periods so that's interesting all right what else we got we can also plot the data in a loop so I will quickly go over this so we can for example plot the first five columns import modplotlip pins by plus s PLT make sure to spell it right got that and then we can Loop over it and then we can have another look at the data so that was basically the main objective of what we wanted to achieve but we have looked at a single Excel file so just for the first day and what typically happens in a real world scenario is that people work on Excel files and they have multiple Excel files because as you probably know once an Excel file gets too big your computer can't handle it anymore it will get slow so but you will typically typically see if you have to work with with Excel files as an analyst or data scientist is that there are multiple Excel files maybe multiple logs in the case of this sensor or financial data for example sales data whatever what we would then like to do is we would like to combine all these Excel files into a single data frame so we can do our analysis using one script and going through it all at the same time without without having to create separate files so what I can also do is just make a copy of this file and then open up the second file and then do analysis like this but of course that's not what we want to do we want to use Python instead of excel because we want to be efficient we want to be faster we want to automate things so what we will then do is we look at how to combine multiple files so combining multiple Excel files and reading them at once so for that we're gonna have to import uh one additional Library which is called Globe you might have to pip install this as well so from Globe we import globe and basically how Globe works is we can search a directory for specific files file extensions and then list those and we can then use that to Loop over them and read them into pandas we first specify the path where Globe has to look and so this is the data folder so we go data and then raw and then pump sensor data so this is the folder that we want to check and then uh we want to get the files and we do that by running club and then we put in the path equal sign over here so we do both and then plus and then this is the syntax to search for Excel files so we basically say we wanna sorry I have to do it like this so we combine the path that we've specified above and then we want to search for all files that end with xlsx so that's how Globe works so if I run this buff for uh over here now with a variable with the path and then if I run the files we get an error because Globe is not defined yet because I have to run this first and then we can do it like this and then let's check what's in files and as you can see now we have a list so this is a list with all the file pads and I will now show you how we can use this to Loop over them and read them into one pandas data frame but first I would like to add one more thing and that is the command sorted or the method sorted in front of this so I think in this file over here it doesn't really matter because they are sorted already but in some cases that might not be the case and you want to make sure that your file names are sorted according to in this example the date I do notice one strange thing that there is this extra file over here and that is probably because I have to file open as you can see there is a somewhat of a temporary file that is created so make sure to close this file I won't save it and now this goes away so then let me quickly rerun that and now we have four instead of five so otherwise we would have probably gotten an error right so now let's continue and loop over the files to read them into one bonus data frame so we're going to use a list comprehension for this could be pretty Advanced if you don't know what it is you can look it up or just follow along so what we're going to do is we do opening closing brackets and then we do PD read Excel and then we specify F and also the index goal and then to be precise we're also going to do the the parse dates oh let me make sure this is a comma we do parse dates and then we also specify the first the zero column again and then we say 4 F in files so what we now have is We have basically a loop within a list comprehension and if I run this it will take F and loop over all those files so those will be the four different Excel files that we have and it will then call the PD read Excel function with these parameters that we've specified earlier so if I now run this it will do this one by one by one and it will basically output four different data frames that's not what we want we want to combine them so we take this and then we call another method from the panels Library which is concat so we do PD concat and then opening and closing parentheses and if I run this we get all the data nice into one data frame and we can see that it starts here on the 1st of April 12 o'clock and it goes all the way to the 4th of April also 12 o'clock so now we have all the data into one file one Panda's data frame and then we can store this as for example data all the data combines and then save this uh let me check so it will automatically format my code because I have black installed in vs code I also have a video on how to use black let me correct that data combined and as you can see now it's all nice into one data frame and now if we go back to the plotting section over here and we plot the data we can see that oh wait for example if you now take I'm just gonna Swap this out probably get an error because it's not defined before move it over here so as you can see now we have some more days within the data so now we get a different a different picture where instead where first sensor zero was just this straight line as you can see up here now we get some more variation same goes for sensor one we get some more clarity and then we can Loop over the first five columns or even do 10 whatever we can do it like this and we get a nice view of the data now the final thing I want to show you is how to export this data frame because what would typically happen is either you receive an Excel file file from a colleague for example and you do some Transformations and you send it back to your colleague in an Excel file so that's what I will show you now so export to xlsx what we do we take the data combined and then we run two Excel so this is a pandas method again and then we specify the path so again we go up up data and then this time we'll do interim so this could be an intermediate file and then I already have it here I say data combines and then we run this and then what it will do it will take this data frame which is appendless data frame and it will export it to an Excel file so now if you go into the interim folder over here there is this Excel file and if I refill this in the finder we can have a look at it at it over here and here you can see we have the same data frame that we just had but now in an Excel and notice how sensor 15 is not removed from here because we looped over the data again and we did not include this drop function over here if we would we could and we would have to do that within the list comprehension Loop so then we're going to read the Excel F index called parse dates and then we're gonna uh put the method at this point and I think if we do it like this this should work so this is a pretty Advanced we're doing some methods chaining over here but now if I run the data combined and then data combines and then info to check all the columns 15 should be gone so then if I close this out don't save this and then have the data combined export to excel again then we go back to excel refilling finder open it up and now it's gone so that's how we include that as well so if I go back and I save this then probably black will format this this even further but but as you can see this is getting pretty Advanced there are some a couple of methods chained together so to say but that is exporting to excel so that was the first way that you would normally work so you get an Excel do some Transformations give an Excel back to your colleague but a more common method at least for my kind of work is that you receive an Excel you do the Transformations and then you save it to CSV because in general these fees are a bit easier to handle also Less in file size so what we can now do is we do export to CSV and we're just going to copy this line over here paste it and then we're gonna change the to excel to 2 CSV and then make sure to also change the file extension run it again and then it will export it to CSV I know it already was in here but I just overwrite the file that is because I've run this code before to test it out of course but refill this in a finder and then we have to CSV file we can take a look at this if it loads how big is it oh there it goes so now we have a CSV file and for example we can show this better if we open this in Sublime for example let me check have it over here so nice and clean CSV file so that was the final thing that I want to show you so this turned out to be a lot more than just showing you how to read Excel files with python now if this video helped you out I would really appreciate it if you like this video And subscribe to the channel I'll be making more videos related to data science Python and machine learning so if that's something you're interested in you should definitely subscribe see you next time foreign
Info
Channel: Dave Ebbelaar
Views: 15,971
Rating: undefined out of 5
Keywords: read excel with python, datascience, excel, python, pandas, read excell with python, read excel iwth python, read excell iwth python, python tutorial, data analysis, how to, import excel in python, read excel, pandas tutorial, pandas dataframe, excel dataframe, vs code, excel file python, excel file pandas
Id: P6HCyxSyFpY
Channel Id: undefined
Length: 23min 23sec (1403 seconds)
Published: Sun Sep 04 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.