Combine Data from Multiple PDF Files into a Single Excel File

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's just say that I'm working in this particular folder which is where I have three PDFs sales 2003 and 2004 sales and 2005 sales if I quickly open one of the files you're going to see that the data looks something like this pretty standard columns I have the date column sales rep customer amount profit and the region and this data obviously belongs to 2003 the column names are going to remain the same in all the PDFs however I'd like to combine the data from all of these PDFs and put them together in a single Excel file at one point in time you feel like hey perhaps I could just do that manually because it's going to be easier so what you do is you select all of this data press contrl C on the data and when you go to your Excel and you start pasting the data the data kind of appears in a very crappy format and you kind of give up on the hope that you can ever get this data into Excel in this video I'm going to show you an awesome trick to get the data from all of these PDFs into Excel no matter what columns you have and no matter how many pages of PDFs are there all right no further Ado let's go check it out all right fellas I'm in Excel and I have closed all the PDFs they are there in one single folder and I will use my Excel to connect that folder how do I do that I'm going to go to the data tab from there I'll say get data from a file from a folder and that is my sales data PDF so I'm just going to click right here it doesn't show any of the files that I have but the files are right there so I'm going to click on open and now I get to see the preview of all the files there are at the moments let's just click on the transform data and this will land us inside of the power query window once we there inside of the power query window you can take a look that the content column actually shows you that these are all the PDFs so this is 2003 4 and 5 the first order of business is to get the PDF from the binary that you can see it right here into a readable format so that we are able to access the data and the function to do that is nothing but pdf. tables let's just use that so to begin with I'm going to go ahead in the add columns Tab and make a custom column and I'm going to feed the binary which is nothing but the PDF into that formula so I'm going to say something like P BDF do tables uh note that P and the t is uppercase and then inside of that it's asking you for the BDF which should be as a binary sure enough we have three binaries right here they are there in the content column and that is going to go right here so I'm going to feed the content column right here close the brackets and press okay and now what we get is nothing but a table if you peek into the table the table is going to show you that here are all the pages that I was able to get in the PDF so the first PDF has got three pages and here is the data of those three pages there is also one table that it has found and this this is the second PDF third PDF so in the preview you're going to take a look at all the pages there are from here let's just expand this so I'm just going to click on expand um uncheck the name prefix all the columns are good click on okay and now we have been able to expand that now if you just go ahead to the far right you're going to see that we have a column called data which is where I will take a look at the data of that particular PDF so Page by page it is going to show me the data that it has been able to capture now one very important thing to understand is that the PDF should contain the data in the form of some sort of table not a picture if your data is in the form of picture I don't believe that pdf. tables function can read that so that's one thing that you need to keep in mind all right now once we've been able to get the data so if you take a look at the first PDF which is where is that U right here 2003 PDF and it has got four different rows and if I just go to the far right you're going to see that the first three rows are the pages of the PDF that means the data from the page and the last row is the data from the table the way it works is that obviously you understand this is the data of the first page the second page the third page if power query has the opportunity to combine all the data of the three pages and read it in a single table it is going to do that so here if you take a look at the table the table is actually reading the data off of all the pages that means page one to page three and that's the data combined to you in a table so should you want to read the table directly you can do that if you want to read the pages directly you can do that but don't do both because you're going to create duplicacy so for now I'm going to read the pages so I'm just going to go ahead in this column and I'm going to apply a filter that hey keep only the page data click on okay and that's what we get now these are all the pages from all the PDFs at the moment the problem is that the headers of these um PDFs are not promoted to the first row and that's exactly what I'm going to do so I'm going to go ahead and probably create another custom column and I'm going to say something like table do promote headers and I'm going to start the bracket and I'm going to feed the feed the formula feed actually the column which is the data column I'm going to feed it right here here are the tables for which you need to promote the headers I'm going to do that just click right here click on okay and now what we get is a table but the difference is that here the headers are promoted and here the headers are not promoted all right before we expand all of these tables that we have in the custom column let's just remove the columns that we do not need at the moment so I'm going to click on custom and I'm going to click on the name of the PDF which is what I want along with the data right click and I'll say remove other columns and that's what we get now if I just click on the expand button right here you're going to see that I get all the names of the columns sure enough but if I click on okay all the names of the columns just got hardcoded within the formula so date got hardcoded sales rep got hardcoded customer got hardcoded all of that if your column names are not likely to change this is absolutely fine and you can go ahead with that but if you believe that the column names are going to change then this approach is not going to work because should the column names change your query is going to break and you certainly do not want that so we going to delete that step and make our column names Dynamic so what I'm going to do is I'm going to get rid of this particular step which we will do it in a bit again now from all of these tables right here which we have it in the custom I would like to extract all the names of the columns and then feed that in my formula so I'm going to click on the FX button right here which refers back to the previous step which is right here which which is remove other columns from remove other columns I'm looking to combine all the tables in the custom column and that's what I'm going to write so square bracket I'll write custom and then I will uh wrap this in the table. combine function start the bracket and close the bracket in the end now this is going to give me all the tables combined but hey I don't want the tables to be combined I just want the names of the columns so we need another wrapper function which is nothing but table do column names which is going to get the names of all the columns of the combined tables press enter and here are all the possible columns in all the PDFs on all the pages that we have this is incredible now let's just call this particular list as column names so I'm just going to call this as column names and now is when when we use the expand features right so I'm just going to go back to the previous step right here and I'm going to expand this particular Uh custom column right here so I click on expand all of the columns click on okay sure enough the column names are hardcoded you you can see that all the column names do get hardcoded but now we have a step which is nothing but column names that is actually giving us the columns and they are not hardcoded so what I'm going to do is I'm going to quickly move this step up and expanded custom uh goes down and here is the list of all the columns and let's just fix this particular step so in this step the table that I'm trying to work with is this particular table and let's just reference that so I'm just going to say remove other columns the column that I'm trying to expand is this column which is custom which is also what is hardcoded in the formula which is fine but I don't want this hard coding to which I will just delete all of these column names and I will just reference the column names that I have just created right here nothing obviously is going to change but now that we have the column names Dynamic and should the column names get added or deleted in the PDF it is going to start to reflect in our combined data as well and that's been it I just click on on the Home tab and I say close and load the data and the data gets loaded into my Excel from all the PDFs that you had that's been it let me know how did you find this one should you have any questions or any peculiar cases while combining data from the PDFs please do put them down in the comments below i' be glad to take a look at them in the end a big big shout out about my M course on power query this is the exact kind of problems perhaps even more difficult kind of problems that we solve in the course I teach you the fundamental logic of thinking about M and then building on that knowledge to solve more harder more difficult data cleaning problems using of course the M language in power query in case you're interested the link is there in the description of the video that's been it thank you so much for watching this video and bye from London I'm going to catch you guys in the next video [Music] bye
Info
Channel: Goodly
Views: 14,415
Rating: undefined out of 5
Keywords: goodly, chandeep chhabra, power query goodly, combine pdf files into one, bulk combine PDF files, import pdf to excel, combine PDF files, Excel, Power Query, data extraction, PDF data, data manipulation, data cleaning, data processing, bulk combine PDF, Microsoft Excel, PDF to Excel, data merge, data formatting, data analysis, spreadsheet, Excel tips, PDF conversion, document management, combine multiple files into one file
Id: l6HoZC16Xng
Channel Id: undefined
Length: 8min 59sec (539 seconds)
Published: Wed Feb 21 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.