Hey guys,
In this video, I will show you how to use Python to compare two Excel sheets and find
their differences. Like in this example, where I have a list
of names and next to it the expected arrival date. Now, at some point, somebody might have updated
this list, and now you want to find out which cells have been changed. This is exactly what we are going to do in
this video. OK, and without further ado, let us get started. First things first, we need to install the
required dependencies. To manipulate the excel files, we will use
pandas and xlwings. When working with Excel files, pandas also
relies on the optional dependency openpyxl. You can install all modules by opening your
command prompt or terminal and running pip install followed by package name. So, pip install openpyxl, pandas and xlwings. With that in place, I will import pandas as
pd and xlwings as xw. As we are also going to deal with different
file paths, I will import path from pathlib. Pathlib is a core python module. So, no additional installation is required. With that out of the way, let us look at the
first example. This example will be perfect for comparing
two datasets with the same shape. By shape, I mean the same amount of columns
and rows. Like in my two excel files. On the left side is the initial spreadsheet,
and next to it is the updated version. Only the values within each row might have
been updated. However, no additional rows or entries have
been added. So, both files have three columns and 101
numbers of rows. I know this might not always be the case in
a real-life situation. So after this example, we have a look at a
more complicated spreadsheet. With that said, let us define the file path
to the excel files. Both workbooks are located in the current
working directory of this Jupyter Notebook in a folder called 'Same_Shape'. To create the file paths, I am using the pathlib
module. Once we have the file paths defined, let us
bring the Excel data into a pandas dataframe. You can do this by typing pd.read_excel followed
by the file path. And just to validate that it worked, I will
also print out the first rows. After we have our two dataframes, let us compare
the data. To do so, we will use the pandas compare method. However, as mentioned already, this method
can only compare identically-labelled Dataframe objects; this means dataframes with identical
row and column labels. So, this method would work in our example,
and it is actually very straightforward. In my case, I will compare the updated dataframe
with the initial excel file, and I want to align the differences on the columns. After running this cell, we will get back
the following dataframe. Pandas shows us now all the differences. For example, in the updated excel file, the
name troi has been changed, as well as some of the arrival dates. The compare method also takes in additional
arguments, which I will show you in the following cells. For example, instead of aligning the difference
on columns, we can also stack them on rows. In that case, you want to set align axis to
0. After running this cell, the output will look
like this. Instead of only showing the differences, you
could keep the original number of rows and columns by setting keep_shape to true. Pandas will now return the following dataframe. NaN or NaT means that there is no difference. But at index position 99, the two dataframes
differ. And last but not least, you could also put
the data side-by-side by setting both keep_shape & keep_eqaul to true. In this dataframe, it is not very easy to
spot the difference, but perhaps you need this kind of overview. You could also export the dataframe by running
the 'to_excel' method, followed by the output file name. In my case, I will save the first example,
where we have aligned the differences on columns. After executing this cell, we will have a
new spreadsheet, which looks like this. When I created this tutorial, I thought it
might also be cool to take the updated Excel file, highlight the cells with a different
background colour, and even put a little comment in the cell that shows the old value. So, that is why I also came up with a different
solution. Instead of pandas, we will be using the xlwings
library. Let me walk you through the code. I can use xlwings to create a new Excel instance,
which I will call 'app'. By setting visible to 'False', Excel will
start in the background. I will open both workbooks within our Excel
instance, so the initial version and the updated one. The data with the arrival dates is located
in the first worksheet in both files. OK, and here comes the trick. With xlwings, we can get back the used range
of our worksheet. So, this will return a range object, with
the range A1 to C101. Within that range, I can now iterate over
each cell. So, starting from cell A1. As the shape of both spreadsheets is the same,
I can also get the old value from the initial workbook by using the current cell row & column
number. For cell A1, this would translate to the first
row in the first column. Now I can compare the existing value with
the old one. If the values are not matching, we have identified
a difference. In that case, I would like to add a comment
to the cell. In the comment, I will write the name of the
initial workbook and the previous cell value by using an f-string. But please note that adding comments is not
yet fully implemented in xlwings, which means I am using the underlying API here. This line of code should work fine on Windows,
but I have not tested it on a mac. If you are on a mac and getting an error,
you might want to comment out this line. With that said, let me also change the background
colour to a light red. We could now overwrite the updated version,
but I will just save it as a new spreadsheet and call the file 'difference highlighted.xlsx'. After executing this cell, we have our new
workbook. All the differences are highlighted in red,
and additionally, we have got a comment with the previous value. OK, now let us move on to a more complicated
example. This time, we want to compare two datasets
with different numbers of rows. The spreadsheets are located in the folder
'Different Shape'. Let me open with the updated version to show
my modifications. On top of the previous changes, I have also
inserted a new blank row and also a new entry with the arrival date of Peter Parker. We can also compare the shapes of both dataframes
to validate that they are indeed different. Therefore, we cannot use the pandas compare
method. Instead, we will merge the dataframes to show
the difference. But before merging the dataframes, let me
reset the index from the updated dataframe. This bit of information will be very important
later. I will show you why in just a moment. So, let us now merge the two dataframes. If you have worked with SQL before, you might
be familiar with the different merging methods. We could perform a left, right, outer or inner
merge. I have created the following image to illustrate
the different techniques. In our case, I want to perform an outer merge
so that we keep the rows of both dataframes. The trick is to use the indicator argument. This new column will indicate whether the
row only exists in the right or left dataframe or if it exists in both datasets. If we only want to show the difference, we
could now just filter out the entries which exist in both dataframes. Now the reaming table will show us the difference. As an example, this entry does only exist
in the initial dataframe. In the updated version, Troi was written with
an 'i'. Even though we now know the difference, I
think this table might not be handy, as we will still need to check what has been changed
manually. Therefore, I came up with a solution to highlight
also our Excel rows. For this, we need to know the row number,
which includes a difference. So, let me query our dataframe and only show
the ones that exist in our updated version. Those are all the entries where we differ
from the original excel file. This dataframe also contains the initial index. Let me convert this index column to a python
list. By default, pandas will convert the values
to floats. However, I would like to get the integer values. We can do this by using a list comprehension. After executing this cell, we will get back
to the following list. However, this list does not yet represent
the excel row numbers. The pandas' index starts from 0. And in the excel sheet, our data without the
headers starts from row number 2. So, we need to add 2 to every value in our
list. As before, I will use list comprehension to
do that. As a result, we have a list with all the row
numbers we want to highlight in Excel. And to highlight the rows, we will be using
xlwings again. After initializing the Excel instance, I will
open up the updated version. As before, I can get the used range from the
first worksheet. This time, the used range is from A1 to C103,
as I have added two rows. We can now loop over every row in the used
range. If the row number is in our list, I will apply
a red background colour to that row. And last but not least, I will save the workbook
as a new file. After executing this cell, we can see the
new workbook in our folder. And if we look at the file, we will notice
that xlwings has highlighted every row that has been changed. OK, guys, and that is all for this tutorial. As always, thanks for watching and see you
in the next video.