Hey guys,
In my folder, I have over 300 Excel files. Each Excel file represents the sale records
by category for a specific date. The format within each Excel file is the same. For example, in column A, I have the different
Product categories, and in column B, I have the corresponding sales amount for the respective
date. I also have a master file that contains for
each month a dedicated worksheet. I intend now to copy and paste the values
from the daily sales reports into this master workbook. If you do it manually, you need to open up
each Excel file and copy and paste the values over to the master workbook. Let us see how we can automate this task using
Python. To show you the final outcome, I will execute
the Python script, which we will write in this tutorial. The execution of the script takes only 2 seconds. As a result, we will get a new Excel File. If I open up this workbook, we can see that
all values from the Excel files have pasted into the master file. We could also validate this by opening up
one of the daily sales reports and comparing the figures. Ok, and without further ado, let us get started. To read and write values from Excel, we will
be using the openpyxl library. You can install the library by typing' pip
install openpyxl' in your command prompt or terminal. For the sake of explanation, I have created
the following Jupyter Notebook. However, you will also find the download link
to the complete source code and all the sample excel files in the description box. Before diving into the code, let me explain
my approach to this automation tasks. I have clustered the solution into three steps. As the first steps, we would need to list
all excel files from the given source directory. Once we have the excel file paths, I will
iterate over each file. From each worksheet, I will then retrieve
the values and store them in a dictionary. The dictionary will look like this, where
the key is the date of the Report, and the corresponding sales are held in a list. Once we have the dictionary with all values
from the daily reports, I will iterate over all worksheets in the master workbook. For each worksheet, I will then iterate over
the defined cell range and compare the date with the key of our dictionary. If we have a match, I will paste the values
into the corresponding row of the master file. Ok, now that we know the overall approach,
let us get started by importing the libraries. From pathlib, I will import Path, and from
openpyxl I will import load_workbook & workbook. Pathlib is a standard python module, so no
additional installation is required. Next, let us move on to the first step. To get the Path to all excel files, I am using
Path and glob. In my case, the source directory, which contains
all the daily sales reports, is called "Daily Reports". As my Jupyter Notebook is in the same directory,
I only need to type the folder's name here. However, you can, of course, also change this
Path to any other folder. Now that we have the excel file paths in a
list, I can iterate over each file. A short remark regarding the performance. In general, if you are iterating over a thousand
or millions of items, you want to use a python set or dictionary instead of a list. A dictionary is a hash table, so it is really
fast to find the keys. However, we only have 365 files for this particular
example, so there is no noticeable impact on the performance. Having said that, our intention to loop over
the excel file was to create a dictionary with the date as the key. Therefore, I will use the pathlib method 'steam'
to return the filename without the xlsx extension. For further cleanup, I will also replace Report
with nothing. After our cleanup, we will have our report
date, which will be later used as a key to match the date in the Masterfile. As the next step, I will open each excel file
in memory. The date we are interested in is located in
the first worksheet in the cell range B2:B19. Unfortunately, it is not very straightforward
to get the values from the given cell range. Because if I iterate over the cells, openpyxl
will return me a tuple, which looks like this. To access the cell values, I would need to
loop over those tuples again. Once I have the cell values, I will append
them to our list. As we have now the list of values and the
corresponding reporting date, I will add this information to our dictionary. The dictionary will look like this, where
we have the date as a key followed by the sales records. Executing this piece of code took only around
2 seconds. And to validate, I could also print out the
length of the dictionary, which is 365. So exactly the number of files I had in the
source directory. Ok, now that we have the values stored in
a dictionary, let us move on to the last step. First, I will open up the Master template. Within the master template, I will iterate
over each worksheet. I want to check in column B within the worksheet
if those values, so our dates, are inside our dictionary. Therefore, I will first determine the excel
cell range I want to check. The data is in column B, and the first date
is in row number 3. Depending on the month, the last row might
be different. Therefore, I will check the length of column
B. This line of code will return the last row. For January, this would be 33. So the range will be B3:B33. As before, I can now iterate over this cell
range, and once again, I will only get back the tuple in the first step, so that is why
I need to loop over the cells again to access the values of the cells. If we run the code until here, we will see
that we got back the dates from each worksheet. I will now check if this date is available
in the dictionary we have created in step 2. If so, I want to write the corresponding sales
values, so our list, in that specific row. The problem is that I can not simply write
back the complete list into one cell. I actually need to iterate over our python
list. And each value inside the list needs to be
inserted into a different column. To do so, I am using enumerate to have access
to the current index while I am iterating over the list. So, let's imagine we are finding January first
2021 in our dictionary. The corresponding list for that date looks
like this. Now I am iterating over that list, and for
each iteration, I am offsetting the column. So, the starting column was B, which means
in the first iteration, I am adding 1 to the column so that openpyxl will paste the value
in column C. The next item in the list will be pasted into column D and so on so forth. Once that is done, I can save the Masterfile. After executing this line, we will get back
our new Excel File. This video was just one automation example. Your specific use case might be very different,
but I hope you got an idea of how to solve your particular task. Last but not least, I want to mention that
there are, of course, also other Python libraries to interact with Excel. I have the same example also coded out by
using xlwings. However, openpyxl has the advantage that the
content of the Excel files is loaded directly into the computer memory without creating
an Excel instance. That is why the execution took only 2 seconds. Whereas with xlwings, you would need to open
up each workbook to access the values. This approach might be not very practical
if you want to retrieve values from multiple excel files. Ok, guys, and that's it for this tutorial. As always, if you have any questions, please
let me know in the comments. Thanks for watching and see you next time.