Compare Two Excel Sheets with Different Number of Rows and Find Differences using Python

Video Statistics and Information

Video

Captions Word Cloud

Captions

Hey guys, In this video, I will show you how to use Python to compare two Excel sheets and find their differences. Like in this example, where I have a list of names and next to it the expected arrival date. Now, at some point, somebody might have updated this list, and now you want to find out which cells have been changed. This is exactly what we are going to do in this video. OK, and without further ado, let us get started. First things first, we need to install the required dependencies. To manipulate the excel files, we will use pandas and xlwings. When working with Excel files, pandas also relies on the optional dependency openpyxl. You can install all modules by opening your command prompt or terminal and running pip install followed by package name. So, pip install openpyxl, pandas and xlwings. With that in place, I will import pandas as pd and xlwings as xw. As we are also going to deal with different file paths, I will import path from pathlib. Pathlib is a core python module. So, no additional installation is required. With that out of the way, let us look at the first example. This example will be perfect for comparing two datasets with the same shape. By shape, I mean the same amount of columns and rows. Like in my two excel files. On the left side is the initial spreadsheet, and next to it is the updated version. Only the values within each row might have been updated. However, no additional rows or entries have been added. So, both files have three columns and 101 numbers of rows. I know this might not always be the case in a real-life situation. So after this example, we have a look at a more complicated spreadsheet. With that said, let us define the file path to the excel files. Both workbooks are located in the current working directory of this Jupyter Notebook in a folder called 'Same_Shape'. To create the file paths, I am using the pathlib module. Once we have the file paths defined, let us bring the Excel data into a pandas dataframe. You can do this by typing pd.read_excel followed by the file path. And just to validate that it worked, I will also print out the first rows. After we have our two dataframes, let us compare the data. To do so, we will use the pandas compare method. However, as mentioned already, this method can only compare identically-labelled Dataframe objects; this means dataframes with identical row and column labels. So, this method would work in our example, and it is actually very straightforward. In my case, I will compare the updated dataframe with the initial excel file, and I want to align the differences on the columns. After running this cell, we will get back the following dataframe. Pandas shows us now all the differences. For example, in the updated excel file, the name troi has been changed, as well as some of the arrival dates. The compare method also takes in additional arguments, which I will show you in the following cells. For example, instead of aligning the difference on columns, we can also stack them on rows. In that case, you want to set align axis to 0. After running this cell, the output will look like this. Instead of only showing the differences, you could keep the original number of rows and columns by setting keep_shape to true. Pandas will now return the following dataframe. NaN or NaT means that there is no difference. But at index position 99, the two dataframes differ. And last but not least, you could also put the data side-by-side by setting both keep_shape & keep_eqaul to true. In this dataframe, it is not very easy to spot the difference, but perhaps you need this kind of overview. You could also export the dataframe by running the 'to_excel' method, followed by the output file name. In my case, I will save the first example, where we have aligned the differences on columns. After executing this cell, we will have a new spreadsheet, which looks like this. When I created this tutorial, I thought it might also be cool to take the updated Excel file, highlight the cells with a different background colour, and even put a little comment in the cell that shows the old value. So, that is why I also came up with a different solution. Instead of pandas, we will be using the xlwings library. Let me walk you through the code. I can use xlwings to create a new Excel instance, which I will call 'app'. By setting visible to 'False', Excel will start in the background. I will open both workbooks within our Excel instance, so the initial version and the updated one. The data with the arrival dates is located in the first worksheet in both files. OK, and here comes the trick. With xlwings, we can get back the used range of our worksheet. So, this will return a range object, with the range A1 to C101. Within that range, I can now iterate over each cell. So, starting from cell A1. As the shape of both spreadsheets is the same, I can also get the old value from the initial workbook by using the current cell row & column number. For cell A1, this would translate to the first row in the first column. Now I can compare the existing value with the old one. If the values are not matching, we have identified a difference. In that case, I would like to add a comment to the cell. In the comment, I will write the name of the initial workbook and the previous cell value by using an f-string. But please note that adding comments is not yet fully implemented in xlwings, which means I am using the underlying API here. This line of code should work fine on Windows, but I have not tested it on a mac. If you are on a mac and getting an error, you might want to comment out this line. With that said, let me also change the background colour to a light red. We could now overwrite the updated version, but I will just save it as a new spreadsheet and call the file 'difference highlighted.xlsx'. After executing this cell, we have our new workbook. All the differences are highlighted in red, and additionally, we have got a comment with the previous value. OK, now let us move on to a more complicated example. This time, we want to compare two datasets with different numbers of rows. The spreadsheets are located in the folder 'Different Shape'. Let me open with the updated version to show my modifications. On top of the previous changes, I have also inserted a new blank row and also a new entry with the arrival date of Peter Parker. We can also compare the shapes of both dataframes to validate that they are indeed different. Therefore, we cannot use the pandas compare method. Instead, we will merge the dataframes to show the difference. But before merging the dataframes, let me reset the index from the updated dataframe. This bit of information will be very important later. I will show you why in just a moment. So, let us now merge the two dataframes. If you have worked with SQL before, you might be familiar with the different merging methods. We could perform a left, right, outer or inner merge. I have created the following image to illustrate the different techniques. In our case, I want to perform an outer merge so that we keep the rows of both dataframes. The trick is to use the indicator argument. This new column will indicate whether the row only exists in the right or left dataframe or if it exists in both datasets. If we only want to show the difference, we could now just filter out the entries which exist in both dataframes. Now the reaming table will show us the difference. As an example, this entry does only exist in the initial dataframe. In the updated version, Troi was written with an 'i'. Even though we now know the difference, I think this table might not be handy, as we will still need to check what has been changed manually. Therefore, I came up with a solution to highlight also our Excel rows. For this, we need to know the row number, which includes a difference. So, let me query our dataframe and only show the ones that exist in our updated version. Those are all the entries where we differ from the original excel file. This dataframe also contains the initial index. Let me convert this index column to a python list. By default, pandas will convert the values to floats. However, I would like to get the integer values. We can do this by using a list comprehension. After executing this cell, we will get back to the following list. However, this list does not yet represent the excel row numbers. The pandas' index starts from 0. And in the excel sheet, our data without the headers starts from row number 2. So, we need to add 2 to every value in our list. As before, I will use list comprehension to do that. As a result, we have a list with all the row numbers we want to highlight in Excel. And to highlight the rows, we will be using xlwings again. After initializing the Excel instance, I will open up the updated version. As before, I can get the used range from the first worksheet. This time, the used range is from A1 to C103, as I have added two rows. We can now loop over every row in the used range. If the row number is in our list, I will apply a red background colour to that row. And last but not least, I will save the workbook as a new file. After executing this cell, we can see the new workbook in our folder. And if we look at the file, we will notice that xlwings has highlighted every row that has been changed. OK, guys, and that is all for this tutorial. As always, thanks for watching and see you in the next video.

Info

Channel: Coding Is Fun

Views: 56,154

Rating: undefined out of 5

Keywords: coding is fun, compare dataframes, compare dataframes different number of rows, dataset comparison, excel sheet comparison, compare sheets, compare excel, find difference excel, pandas compare method, pandas compare function, pandas compare tutorial, xlwings tutorial, highlight difference, microsoft excel, compare two sheets in excel

Id: D7dEQ9LI-8A

Channel Id: undefined

Length: 10min 38sec (638 seconds)

Published: Sun Apr 03 2022