Automate Excel Work with Python and Pandas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can manipulate and automate excel work using python so we're going to be using pandas now the reason for that is because that although you can use other libraries to sort of manipulate the excel files as you go i much prefer to import the whole thing into pandas into a data frame that we can then do whatever we need to do with and then export it in so we're going to import it manipulate mess with it in pandas and then export it using pandas again to our desired format after we've done everything that we want so if we head over to my screen we'll see that i have some data up here there's columns and this is actually a csv file but i'll show you how to import if if as if it was an xlsx as well and this has some various columns that you might expect to see this is kind of like a maybe a order sheet where you've got a size a quantity the price some address data etc etc so what we're gonna do in this is we're going to be manipulating this file and i have another one which is very similar and i'll show you how to work with them together so this is actually a csv so to start working for this i'll make this one bigger so we can actually all see we need to import pandas now if you haven't got pandas installed just do pip install pandas it will get everything for you and then we want to import pandas as pd this is just the sort of standard way of doing it it makes it easier to work with now to get that data from this csv file bear in mind that this csv file is in the same directory as my pi file to get that into a data frame which is the way the pandas works with the data we need to give our data frame a name now throughout my examples i'm going to be using df but however if you're actually using this in your project make sure you give your data frames uh proper names that mean something so then we go equals and then we just do pd for pandas and we do read csv and we give it the csv file name data dot csv and that's it so we've now that is basically importing that whole csv file into a data frame that we can work with now there's so many more but there's so many benefits to doing it like this over trying to work with it in excel basically we can do so much more uh so many more rows and don't have self issues and we can just automate things uh and i'll show you as we go along so one of the first things that you may want to do is to actually check the data out so i'm going to go ahead and print df.head df of course being the data frame name that we've given it here if we run that we'll see that we get the sort of top rows of the data frame now the head is actually the default five rows but you can kind of get an idea of what's going on here now i can see that we have this id here and an extra index now if you don't want that extra index you just say that the index call for index column is equal to and in this case i'm going to set it to the id so now if i save and run again we can see i know it's gone slightly off the screen i'll make it one smaller there we go we can see we now we now have the id is the column as the index which is a bit more useful for us in that case we can also see what's at the bottom by doing dot tail and this will show us the last five rows of the file of the data frame there we've got those now one thing that you could quite conceivably want to know is all of the column names you can see that we don't have them all showing here when we print out and that's just because of the size of the text but if you've got large files you might want to know the column names and this will lead into our next point because we're going to actually rearrange the columns in this data frame so we can sort of manipulate it a bit more so let's do for c in df.columns let's print c and i'll run this and we can see that it's showing us all of the columns down here that are available in this file now you can see dates just been cut off the top there there we go the important thing here is df.columns so we could put this into a list and we could call it our c is equal to list and let's use the df.columns here and then let's print our list out so this gives us it's in a list form there we go now if you want to rearrange the file or have uh maybe remove some of the columns the easiest way to do that i find is just to copy this out and let's let's get rid of this actually now and let's put these in here and let's just call these the columns actually i'll call it something else um data calls and we'll paste our list in there now to rearrange it you basically just want to rearrange this list i'm just going to make this a bit more friendly to see here and what you can do then is you can just rearrange the order of this list as you need it to to actually rearrange the way the columns appear in the in the data frame so if you had a file like this and maybe you've got two or three of these files or however many and part of your job was to basically reorder them so they fitted into your data input type you could get pandas to do that for you really quickly and really easily so let's move currency to the end make sure we have our comma in there because this is a list and let's move the email address to being up the top here there we go and let's remove that and now what we want to do is i'm going to collapse this list just so it's out of the way we want to actually create a new data frame now you don't have to do this you can do in place if you want to actually just override your data frame but sometimes i like to actually create a new one separately so we've still got the old one that we can reference if we need to in this case i'm just going to overwrite it so i'm going to say df is equal to df.reindex and i'm going to say our columns that we want to give this new data frame is equal to the data columns that we put here so what it's doing is it's basically just going to create another data frame but in this case we're using the same name so it's probably going gonna it's going to override it and we want to use these columns in this order so now if i was to print df.head again and i'll do it above this as well let's close this terminal down so it's clearer we'll get the first one which has the id again this is kind of a bit squiff on my screen there we go id day etc and then we have the currency at the end where we put it and the email is now there where we put it so that's a nice and easy way just to re-order the columns of your data frame so another thing that you might want to do is you might want to say well i don't need all of this data i only need these columns and i don't i can just get rid of them now you can drop columns from a data frame if you want to that's absolutely fine but sometimes i find it easier just to do it slightly different way so we're going to just going to remove this bit of data here all of this chunk in the middle and we're going to create a new data frame with this part of the data only now to do that we're just going to remove this i'm going to say my new data frame again you could just do df again to override it if you wanted to but i'm going to do new data frame is equal to and then we just say df and then our data columns so all we're doing is we're saying let's create a new data frame using the information from this one which is up here that we've imported using only these columns so now if i print the new df excuse my typing we'll get a new data frame here with just those columns of data that we wanted note that i set up here the index column still id which is why that's always first if you wanted to move the id column around just get rid of this line and use the zero index zero one two three four five for the lines the row index so there we go that's another easy way of doing that so we've basically just chopped out a load of data that we don't need but if we look at this again we can see that the price has this dollar sign here which means that it won't be an integer which means that we can't actually do anything with that data so we want to be able to remove this dollar sign from the data frame column itself now to do that we need to reference the column so i'm going to say our new data frame price is equal to again new data frame price because we're just going to overwrite it so we want to say it's going to be equal to this now and now we want to do dot now this is going to basically do exactly what it says it's going to replace whatever we say here with whatever we do okay so it's just like um just like the normal replace in python then we give it an r and then we say what we want to replace which is the dollar sign a comma and then what we want to replace it with in this case nothing now that would work however i'm also going to turn this column into a floating point number at the same time so i'm going to say dot as type and it was a float so this whole thing is basically going to just saying this column is now going to be equal to this column because we're replacing it with the same data but we're taking away the dollar sign for nothing and we're changing it to a float so now if i print new df dot head we should be able to see that the dollar sign is gone and we have our price there and that's a floating point number so the next thing that you might want to do is you might want to add files together so we might have multiple file instances like i mentioned at the start that you might want to concatenate the whole lot together so you can do this it works best obviously when you have matching columns so if you don't quite have matching columns and you have multiple files import them all and then work the columns as i showed you before using this method here to get them into the same way and we can just basically add data frames together so we're going to concat the data frames together but the data frames that we're doing are actually going to be from multiple files so up here i'm going to do df2 again remember what i said about names is equal to pd.read.csv and i have another one here which is called this so what i'm going to do is i'm just going to remove this now because i don't need this but we are going to keep this part here but i'm just going to have it commented out for the moment so if i just say uh con df for df concatenated we want to do pd dot com cats and then open the brackets and then the square brackets and we just give it df and df2 now if i print the condf dot head so we can see what that looks like we're going to see that we just get the top lines again which is great but how do we know how many lines there are what we can do is we can just do a length on the index because the index will always have something in it this is a good way of doing it although there are other methods which may be better depending on what your data looks like so if we do the first one we say let's print the length of the first data frame dot index that should be 1000 lines there it is and then if we do it with the data frame that we have concatenated the df1 and df2 together we can see that we actually have 2000 lines now which means we've pushed them all together into one data frame which you could then export or do with whatever you wanted to so i'm going to get rid of that i'm going to move this new df under here i'm going to change it to equal the concatenated data frame so we're now going to have both of these two together there we go and put that in there and we're going to have all the data together which is our 2000 lines without the with the price corrected so let's just check that so let's print the concatenatedf dot head and you can put a number in here if you want to i'll do 10 this time so we get the top 10 lines and we should hopefully get the price here there we go we can see just at the end just behind my head there you can see they're just appearing up like we did with removing the dollar sign and changing it to a floating point number the next thing which is possibly going to be the most important is a pivot table so that's a very common excel thing that you need to do create a pivot table so you can sort of summarize your data a bit better so i'm going to show you how to do that in pandas and as you can tell by all of this it's going to be just as simple all we're going to do is we're going to create a new data frame so you know if you think like how when you pivot on excel it opens it up in a separate sheet we're kind of doing the same sort of thing except there's no sheets we're just creating a new data frame so i'm just going to call this pivot it's equal to and then we do pd.pivot table and we give it our data frame so we're going to be using the condf in this case now we need to specify the index and the values that we want to pivot onto so i'm going to say our index is equal to and let's pivot on to the size so we saw that we had those size those size values in our sheet so let's do size and now we need to give it some values so our values is going to be quantity and then we need to basically say how we want to summarize it so basically in excel you get the choice of counts uh sum and all that stuff now to do that we do we do need to import numpy that's nice and easy and if you've installed pandas you get numpy installed as well so so i'm going to do import numpy as np again this is just a standard a nice easy way of summarizing of writing it down to make it easier so all we do now is ag funk is equal to mp.sum so this line basically is just creating a new pivot a new pivot table in this data frame i'm calling pivot with the index of size and the values of quantity and the the function we're using is the sum so uh mp.sum so if i just print print our pivot table out now i can type we'll see that we get it down here so this is basically the sizes of all of the data from both of the sheets so if you were to do this manually let's say you had to do this task a couple of times a week or once a week but there was more sheets you'd have to open each one copy and paste the data over onto one big one and then pivot it in this we've just basically reading them into data frames and we've done it in well a few lines of code now you can actually change this index and value so let's make that a bit more interesting so let's say country so we can split it up by country as well so we're going to have our index it's going to be country and then size in that order and our values are going to be let's put this into our list as well quantity let's leave it at that for now so let's print that and we get the country first and then the size split if you put these the other way around you'll see that you get them the other way around essentially so you can have you'll get all the sizes and then the country split just like you would when you order it in excel when you move it up and down i want it the other way around though so i'm going to put that here country first and then size and i'm going to do quantity and then price now because we did this this will work let's run it so you can see that that is the sum of all of the price as well there um obviously the values are nonsense because i'm using fake data but we get that around that way too so if i was to actually remove this line um of the changing that we're going to get an error here because obviously you can't you can't uh you can't summarize in a sum function this without the without removing the side because it's a string otherwise and obviously you could mess around and you could put whichever data you wanted in here just like you do in excel so the next thing and the final thing that we're going to be looking at is doing a vlookup now this is obviously again a really common thing in excel so i have another data sheet here that i've got that i'm going to import i'm going to call it refs it's equal to again it's a csv file and it's going to be i think i called it references references.csv so let's keep our pivot but let's do that let's just print out our refs data frame for now and we'll see that i have an id which matches which we're going to match against and then a reference just a random code string here so what is what could be quite common that you may want to do is you might want to vlookup into your data first so then you have that available to you maybe you don't have the country and you need to put that in or maybe whatever whatever reason so what we can do to do that is we need to use df.merge now to merge the two together there are a few different types so there's a left a left merge it's very similar to sql queries there's a left merge there's an inner and there's an outer which is a right and you need to specify which one so there is a real python article here which i'll show you which summarizes the joins quite well and you can see that when you have your data frame one you want to do your left join and that joins the two like that i'll link to this article down below but this kind of makes it a bit more easy to understand what i'm doing so if we go back to our code we basically want to do our merge so i'm going to call this merged it's equal to because that's we're creating a new data frame that's going to have all the data on now the first one we want to do is our data frame one so that's going to be our condf because that's the one that we're going to put those references onto and then we do dot merge and then inside our first bracket is the refs because those that's where the data that we're taking so we're going to vlookup that data onto the first the first data frame now we want to do how is equal to left because i want to do a left join i want to join from my refs onto our conde condi f and if there are isn't if their matching values aren't there it will just show that they aren't there and then we want to do how uh sorry on is equal to id on equal to id because this is the column that we're going to be matching so now if i print our merged dot head and i'm going to say let's do 25 rows we should get some matched data so there we go so we can see that we've got our id here and then the reference that has matched from the second sheet so we've successfully vlooked up to that reference onto this sheet and again as i showed you how you can move columns around earlier you could do that too so let's just move some of this up now the last thing that you want to do is you want to export your new data that you've just worked all the way through and you want to save it to a new file now that's really easy in pandas all you need to do is let's do merge first is you can take the name and then you do dot 2 csv or x2 xl or whatever you want to do and then we'll give it a name and we'll just call this one merged dot csv now if i run this it's going to give us no output because we're not printing any but if we come to our file manager here we have our merge csv which has all of the information on it it's going to have 2 000 lines and we can see the references at the end there and if we scroll all the way down we can see down here we have no references because they did not match any i think we found well there's one there that that matched so you see where it didn't match any it um didn't add it on and that's because we did that left join if we did an inner join we would only get the ones that where it matched which can also be useful for you um it just so happens is that matches are all the top because the the references that match the ids are actually from the first data frame so there you go so that's going to do it for this one hopefully you found it useful drop me a comment down below like the video etc etc thank you very much for watching i've been john and i will see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 89,653
Rating: undefined out of 5
Keywords: python automation, python excel, python pandas library, python tutorial excel, python excel automation, automate excel with python
Id: CnCBbiLw-HU
Channel Id: undefined
Length: 21min 28sec (1288 seconds)
Published: Sun Jun 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.