Python: Copy Data From Multiple Files to Master File | Read/Write Closed Excel Files Using Openpyxl

Video Statistics and Information

Video

Captions Word Cloud

Captions

Hey guys, In my folder, I have over 300 Excel files. Each Excel file represents the sale records by category for a specific date. The format within each Excel file is the same. For example, in column A, I have the different Product categories, and in column B, I have the corresponding sales amount for the respective date. I also have a master file that contains for each month a dedicated worksheet. I intend now to copy and paste the values from the daily sales reports into this master workbook. If you do it manually, you need to open up each Excel file and copy and paste the values over to the master workbook. Let us see how we can automate this task using Python. To show you the final outcome, I will execute the Python script, which we will write in this tutorial. The execution of the script takes only 2 seconds. As a result, we will get a new Excel File. If I open up this workbook, we can see that all values from the Excel files have pasted into the master file. We could also validate this by opening up one of the daily sales reports and comparing the figures. Ok, and without further ado, let us get started. To read and write values from Excel, we will be using the openpyxl library. You can install the library by typing' pip install openpyxl' in your command prompt or terminal. For the sake of explanation, I have created the following Jupyter Notebook. However, you will also find the download link to the complete source code and all the sample excel files in the description box. Before diving into the code, let me explain my approach to this automation tasks. I have clustered the solution into three steps. As the first steps, we would need to list all excel files from the given source directory. Once we have the excel file paths, I will iterate over each file. From each worksheet, I will then retrieve the values and store them in a dictionary. The dictionary will look like this, where the key is the date of the Report, and the corresponding sales are held in a list. Once we have the dictionary with all values from the daily reports, I will iterate over all worksheets in the master workbook. For each worksheet, I will then iterate over the defined cell range and compare the date with the key of our dictionary. If we have a match, I will paste the values into the corresponding row of the master file. Ok, now that we know the overall approach, let us get started by importing the libraries. From pathlib, I will import Path, and from openpyxl I will import load_workbook & workbook. Pathlib is a standard python module, so no additional installation is required. Next, let us move on to the first step. To get the Path to all excel files, I am using Path and glob. In my case, the source directory, which contains all the daily sales reports, is called "Daily Reports". As my Jupyter Notebook is in the same directory, I only need to type the folder's name here. However, you can, of course, also change this Path to any other folder. Now that we have the excel file paths in a list, I can iterate over each file. A short remark regarding the performance. In general, if you are iterating over a thousand or millions of items, you want to use a python set or dictionary instead of a list. A dictionary is a hash table, so it is really fast to find the keys. However, we only have 365 files for this particular example, so there is no noticeable impact on the performance. Having said that, our intention to loop over the excel file was to create a dictionary with the date as the key. Therefore, I will use the pathlib method 'steam' to return the filename without the xlsx extension. For further cleanup, I will also replace Report with nothing. After our cleanup, we will have our report date, which will be later used as a key to match the date in the Masterfile. As the next step, I will open each excel file in memory. The date we are interested in is located in the first worksheet in the cell range B2:B19. Unfortunately, it is not very straightforward to get the values from the given cell range. Because if I iterate over the cells, openpyxl will return me a tuple, which looks like this. To access the cell values, I would need to loop over those tuples again. Once I have the cell values, I will append them to our list. As we have now the list of values and the corresponding reporting date, I will add this information to our dictionary. The dictionary will look like this, where we have the date as a key followed by the sales records. Executing this piece of code took only around 2 seconds. And to validate, I could also print out the length of the dictionary, which is 365. So exactly the number of files I had in the source directory. Ok, now that we have the values stored in a dictionary, let us move on to the last step. First, I will open up the Master template. Within the master template, I will iterate over each worksheet. I want to check in column B within the worksheet if those values, so our dates, are inside our dictionary. Therefore, I will first determine the excel cell range I want to check. The data is in column B, and the first date is in row number 3. Depending on the month, the last row might be different. Therefore, I will check the length of column B. This line of code will return the last row. For January, this would be 33. So the range will be B3:B33. As before, I can now iterate over this cell range, and once again, I will only get back the tuple in the first step, so that is why I need to loop over the cells again to access the values of the cells. If we run the code until here, we will see that we got back the dates from each worksheet. I will now check if this date is available in the dictionary we have created in step 2. If so, I want to write the corresponding sales values, so our list, in that specific row. The problem is that I can not simply write back the complete list into one cell. I actually need to iterate over our python list. And each value inside the list needs to be inserted into a different column. To do so, I am using enumerate to have access to the current index while I am iterating over the list. So, let's imagine we are finding January first 2021 in our dictionary. The corresponding list for that date looks like this. Now I am iterating over that list, and for each iteration, I am offsetting the column. So, the starting column was B, which means in the first iteration, I am adding 1 to the column so that openpyxl will paste the value in column C. The next item in the list will be pasted into column D and so on so forth. Once that is done, I can save the Masterfile. After executing this line, we will get back our new Excel File. This video was just one automation example. Your specific use case might be very different, but I hope you got an idea of how to solve your particular task. Last but not least, I want to mention that there are, of course, also other Python libraries to interact with Excel. I have the same example also coded out by using xlwings. However, openpyxl has the advantage that the content of the Excel files is loaded directly into the computer memory without creating an Excel instance. That is why the execution took only 2 seconds. Whereas with xlwings, you would need to open up each workbook to access the values. This approach might be not very practical if you want to retrieve values from multiple excel files. Ok, guys, and that's it for this tutorial. As always, if you have any questions, please let me know in the comments. Thanks for watching and see you next time.

Info

Channel: Coding Is Fun

Views: 36,133

Rating: undefined out of 5

Keywords: coding is fun, automate excel, automate excel using python, automate excel with python, combine workbooks, combine excel files, consolidate excel files, consolidate excel data, merge excel files, merge excel workbooks, read write excel files using python, openpyxl tutorial, python excel tutrial, excel automation idea, python automation idea, python tutorial, read excel in python, write to excel in python

Id: WlQ7t0fC5Gs

Channel Id: undefined

Length: 7min 37sec (457 seconds)

Published: Sat Jul 17 2021