How to "automatically" extract data from a messy PDF table to Excel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so one of my clients emailed me this PDF and you can see that it has these pricing tables multiple of them on the page and I wanted to get all of this data into Power query for further analysis but the problem is when I try to load this page into Power query it treats this entire two tables as one table combining all of them into a giant data set and in this video we are going to use power query to solve this problem you don't want to miss the ending because I'm going to show you how we can actually create a custom function in power query so that next time you have to cut a table in the half you can use that function instead of repeating all these steps let's go I'll open a brand new Excel file and go to the data ribbon get data from file from PDF let's point to the PDF and import that and you can see the issue here this table table 002 is the combined table that has all my prices but everything in one place and I would like to split it after column 6 so that we can take all of these items and put them underneath here so I'm going to go to transform data and from here we want to split at column 7 now unfortunately power query doesn't have any ready to use solutions to solve this problem so we are going to do a little bit of power query gymnastics to fix this issue and build our custom function let me show you par first up let's transpose this table so that we can turn rows into columns so you can go to transform and hit on transpose and that will give you a table like this and you can see the problem Clearly Now first six rows are one table next six rows are the next table at this this point you can use the FX button here to add a custom step you do need the formula bar for this so if you don't have the formula bar go to the view ribbon and enable the formula bar so if I disable it I don't have it but if you enable you can see the formula bar in power query so tap on the FX button and we can use the table do split function to split the table that was part of the previous Step at six rows so that is the size so every six rows will become one table this is going to create two mini tables each having one pricing list so you have got table one and table two the problem with these pricing lists are they are still transposed so we need to now individually transpose these back to their original format so first up as this is in a list format I'm going to convert this into a table and now add a column custom column and this custom column we don't have to name it here I'm going to say transpose so table. transpose and then add the column one as the parameter for that so the originally cut tables are now being transposed back so we'll get corrected tables here this custom column has everything that we need we just need to now Stitch everything together to create that giant table as we no longer need the original column I can right click on it and remove that and now using this expand button here I can just expand out all the columns so we'll get everything we still have one problem though which is the headers are repeated so first up I'm going to go to the home ribon and take the first row and make it headers using use first row as headers button Now using the filter I'll remove any subsequent headers by unchecking that option in the filter list that's it my table is ready I can load this into Excel now as promised I want to build a power query function that can do all of these operations for me automatically On Any Given table so let's figure out how to do that you can go back to the query editor by using the query ribbon and edit we're going to leave this query as is but we are going to create a new query which will then become a function for us so I'm going to right click on this query and duplicate this because I don't want to repeat all the steps again and this is the one that we are going to now turn into a function the first step is to rename this it's uh we are going to call this as function table split and now I'm just going to go to The View ribbon and click on Advanced editor this makes it easier for us to write the function and customize the code as you want so here you can see all the steps that power query has done to split the table in the middle and then bring it down and append and remove the headers we can see clearly that it kind of hard quats a lot of values so for example here it is actually getting the data from that specific PDF as well as splitting after six columns or six rows so above this I'm going to open bracket and this is how you can develop a function by changing the signature so we'll call two parameters one is my table as table and size as number and this function is going to run these steps on those two parameters so it'll take a my table as input and the size as input and then run these steps as this will automatically give you a table you no longer have to read the PDF so we can delete that step and the table will be actually my table so I'll just say my table here and then when we are transforming or transposing the size is six so here I'm just going to type size so those are the only changes that we are making we took the original query took away that PDF line and kind of gave it the parameter as the table and then the size as the other option that's it when you click done this query is now going to be converted into a function your screen will Flicker and you'll now see there is a FX available to you called function table split which takes two parameters one is a table and another is a size and it's going to do its thing now how are we going to test this this is really simple I'm going to for a minute there close this thing so that we save the whole setup into power query and now let's uh go to data get data from file from PDF so this time again we are going to connect to the same PDF and we would like to run that function on table 002 so to make that happen we would need to see the table actually not the raw date of it so I'm going to select the entire PDF itself rather than any specific table and then go to transform data this will show you all all the tables on in that PDF in the data column here as we are only interested in table 002 I'm just going to filter this down to that and that is the table the raw data looks like this you can see you can see it here it has all the 12 columns and we want to split it at seven and bring it down so we would like to run our function against this table so here you can go to add column and then click on invoke custom function and uh you can just leave the column column name the function is function table split the table has to be in the data column it automatically picks it up because there is only one table values here and the size is going to be six and when you click okay the function will do its thing and it will generate another table as the output you can expand this out and you will get the result query here this is your concatenated table of course it's going to leave all these original columns as well I can just remove them and then I'll end up with my concatenated table nicely given to me here automatically through my function and we can close and load that into Excel and continue with our work but this is only part of the problem this PDF is actually a part of a 108 page PDF that has multiple pricing tables in different formats so I would like to attempt how to combine all of these pricing tables into one giant data set using power query I'm still working out the specifics of that power query setup but if you would like to see that video comment yes right now I'll make a video on that and upload it on the channel in future so if you're new here you may want to consider subscribing to the channel so that you don't miss out on that video and meanwhile if you would like to see some of the other power query tips and tricks check out the playlist that shows up on the screen I'll catch you there bye
Info
Channel: Chandoo
Views: 20,390
Rating: undefined out of 5
Keywords: chandoo, chandoo.org, Excel, spreadsheets, reshape pdf data, get data from pdf to excel, advanced powerquery, advanced power query editor, how to create custom functions in power query
Id: kN9aTCx1TvE
Channel Id: undefined
Length: 9min 40sec (580 seconds)
Published: Tue Jan 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.