Import Data from a PDF to Excel

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
office and Microsoft 365 users can now import data from a PDF to excel using power query it's the same functionality that's been available in powerbi desktop for a while now it's super handy for importing data tables inside PDF documents because power query can locate them for you ready to import and if the data is a little messy we can use the power query transform tools to clean it before loading it into Excel or powerbi let's take a look this is the PDF I'll use for this example you can see it's mostly text and a few charts and then on page seven here we have a table and again on page eight another table and there're the two tables I'm going to import into Excel let's pop over into Excel I've got an empty workbook here so on the data tab of the ribbon we're going to get data from file and then from PDF here I need to navigate to to the PDF that I've saved on my PC it's this one so click the import button it's going to open the Navigator window on the left we have the list of tables that it's found and all of the pages I can click on the tables and get a preview in the Navigator Pane and this is the second table with the various Columns of data we can also import the pages themselves and then pause that data I'm going to check the box to select my multiple items and I want the two tables now if they're in the perfect layout ready to load to your worksheet or into the power pivot model you can click load and it will deposit the data there I'm going to click transform data to open the power query edit to window because my data's a little messy we need to do a bit of work before we load the data into Excel the first table here doesn't need too much tidying up the First Column is just an index number which I don't need so I'm going to press the delete key and get rid of that the next thing I noticed is that it's not in the perfect tabular format which I need for working in Excel to be a tabular layout it should have one column for the number of simulations one column for the model and one column for the cmip number so let's quickly fix it I'm going to start by duplicating the query so right click duplicate the first query is going to contain my cmip3 data and the second query is going to contain the cmip5 data so let's start with the first query this is simulation data so I'll call this one simulations 3 and therefore I want to remove the cmip 5 data so selecting those two columns just hold down shift or control to select them both and then press the delete key now I need to add a column that identifies this data as cmip 3 so we'll give this column a name cmip and the number is is three click okay so there's our cmip number let's rename this model and this query is done let's repeat those steps for this version of the data remember this is my simulations 5 and therefore want to delete the two columns for the cmip3 data let's add the custom column for the cmip number this one's five click okay now we've got our identifier there I can Ren name this model and let's just expand this column you can see here it's had a one appended to it I'll just double click and remove the underscore and the one from the end and that will just make this column name match the column name here so that when I append these two queries they're going to stack the columns on top of one another so we're ready to append them on the Home tab I'm going to append the queries now I'm just going to append them as new just to keep it clean primary table is simulations 3 and the second one is simulations 5 and click okay and now we have one table of data stacked on top of one another containing both cmip3 and cmip 5 so let's change the data type for this to whole number and we'll give this query and name simulations and with that I'm ready to close and load so let's take a look at the second table now it's a bit Messier because the column labels are spread over multiple rows and some of them are also merged across two columns if we go back to the PDF you can see here columns B through G have one column label for two columns so that's going to cause us some problems but we can clean that up with power query so let's do that back in Excel you can see we have 1 2 3 four rows of column labels and then if we look at column B you can see it's got the column labels but the second column for column B hasn't got any labels and likewise every second column except for this one here we'll come to that in a moment is missing the column labels so to fix that we can go into the transform tab transpose the table and now we have our four rows of column labels displayed as four separate columns and with that we can merge them together before we do that though remember we've got some rows that don't have the labels so what we can do is with the four columns selected on the transform tab we can fill the labels down now we're going to have a problem with this one here this label should be on this row we'll fix that later obviously when power query's imported the PDF it hasn't been able to correctly identify that this column header also belongs to this row of data so with that I'm just going to merge the four columns together so with them selected right click merge columns I'm going to use the space separator because I want to space between each piece of text we're going to leave the column name as merged it doesn't matter what it is we don't actually keep it in the long run so I'll click okay now we have one column containing our column headers and the rest of the columns containing the data so I can go ahead and transpose it back there's our column labels now all I need to do is promote them to be the column headers so on the drop down here I can use the first row as headers and that promotes them up to the header level now I can just tidy up the couple of column labels that are incorrect so this one here we can see has a space in segment so let's just delete that space and over here this column should be called quti of R blahy blahy blah so I'm just going to double click and copy this one before I copy it I'm just removing the carriage return or the line break crl C to copy double click this one crl V to paste and I just need to replace three on the end with five press enter and there is my table of data it contains the quantile information so let's name this query quantiles now it's important that you give your queries reasonable names because this name becomes the table name in Excel and in power pivot if you load it to the data model so make sure you give your queries a name that's going to be useful now before we load the data by default if you have tables that run over multiple Pages for example credit card or bank statements p query will automatically combine the tables into a single query table however in my experience it doesn't always detect them correctly and if this happens we can specify a page range that we want imported from the PDF into a single query table if we look at the source step in this query you can see in the formula bar it uses the pdf. tabls function now this function has optional parameters that lets us specify the start page and end page so what I'm I'm going to do is just copy this line and we'll create a new query right clicking in the queries pane new query from other sources and then at the very bottom blank query in the formula bar I'm just going to contrl + V to paste in the source it's already got my PDF file path so it's just saving me some time and we can replace this parameter with the start page and end page parameters so I want to start from page seven and end at page eight I just want two pages and with that we just press enter and you can see there it's listed the pages and the tables on those pages and I can filter it further so for example if I only want the tables I can select just the tables and now I have two tables I can click on them and get a preview in the bottom and when I'm happy I can click on the double-headed Arrow to expand the columns now it doesn't make sense to merge these two tables into a single query table because they contain different data this would really only apply where you have tables that span multiple pages so it's the same data continued across multiple pages in this case we have different tables of data across multiple pages but just keep in mind that's how you specify a page range so we'll give this query a name just so you know what it refers to and we're ready to close and and load so I'm just going to go to the Home tab and close and load to and in the dialogue box I'm going to choose only create connection because I don't want to load all of the queries remember the simulations 3 and five are really only staging queries so I'll click okay you can see they're all loaded as connections I'm going to right click on the ones that I want to load to the worksheet so right click and go into load to here I'm going to load this one to a table and it's going to pop it on sheet one and there's our quantile data and then again for the simulations one this is the final appended version I'm going to load to a table and we'll put this one on a new worksheet and there's the data for my simulations well I hope you found this technique useful you can download the files for this lesson from the link here and if you like this video please give it a thumbs up and subscribe to my channel for more and El share it with your friends who might also find it useful thanks for [Music] watching
Info
Channel: MyOnlineTrainingHub
Views: 132,277
Rating: undefined out of 5
Keywords: power query, import pdf to excel
Id: Xkew2GrXu9c
Channel Id: undefined
Length: 11min 10sec (670 seconds)
Published: Wed Aug 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.