Import Data from PDF to Excel Tables Using Power Query (Quick and Easy)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to another Excel tips video I am Sumit bunel and in today's video I will show you how to connect your Excel file to any PDF file and fetch the data from PDF into Excel and this can be done using the inbuilt power query feature so you do not need any paid software or any website or third party addin you can do that using power query which is already there in Excel now with power query you can connect your Excel file to any PDF file and then get the data from that PDF file into power query and once you have that data in power query you can transform that data get into nice tables or in whatever format you want and then put it in Excel so this has an added benefit because compared to any software which would simply convert your PDF into Excel and the format might be very messy and dirty with power query you can actually convert it into a better format in a usable format and then get the data in Excel another benefit of using power query to do this is that it automates the process so for example today if I get a PDF file and I want to fetch a specific table into Excel I can do that and next month if I get a newer version of that PDF file I do not need to repeat the process I can simply refresh my Excel and it would instantly fetch the data from that new PDF file as well so let me take you to my computer and show you how all of this works so let me start with a simple example where I have this PDF file where I have these airport codes and I want to get all these airport codes into Excel as a table now I cannot copy paste this because when you you try and copy paste data from PDF into Excel the formatting is usually messed up see what happens when I try and copy paste it it shows me in one single line so what I'm going to do is I'm going to use power query to connect to this PDF file and then fetch the data as a table into Excel so I'm going to go to the data tab here then click on the get data option and here in from file I'm going to click on from PDF and when I do that it opens the import data dialog box where I need to locate the file so it's in my downloads folder and here this is the file Airport codes. PDF and now I'm going to click on the import button and when I do that power query is now connecting to that PDF file in the back end and here it opens this Navigator dialog box where it shows me all the objects it could find in that PDF file so it could find one table which spans across five pages and these five individual pages so I can use any of this data if I click on this table it will show me the data in that table and if I click on these Pages it is going to show me the data in these Pages now because power query has already identified that there is this table across five pages in The PDF I can just select and get this data in Excel so I'm going to select this table now I can load this directly into Excel but there are uh a few changes that I want to make so I'm going to first click on transform data when I do that Excel is going to open this data in power query where I can do the transformation so you can see this is power query editor where this is my query table 001 page 1 to5 so this has taken the name from the object name itself and I can change it here here so let me call this airport codes and now uh one thing I want to do is I want to promote the header so here in the first row I have the header and as of now you can see the headers actually are column 1 column 2 column 3 this is not what I want I want this to be the header so what I'm going to do is go to the transform tab here and then click on use first row as headers and when I click on it it is going to promote the first row as the header so APC City and name becomes the header and I have the rest of the data in the table now this is fine if you have a data set that has different types of data in this case all I had is textual data but in case you have a data set that has let's say numbers or dates or percentages or currency then you can make sure that you change the data type in this case it has already used this step called change type so it has already identified that this is all data type this is all Text data type but in case you have different data types then you can make sure that either power query automatically identifies it as the correct data type or you manually do it so you can click on this icon here and you can change the data type in this case it has identified it correctly so I'm going to leave it as is now I can go to the Home tab and here I can click on close and load now when I do that the default behavior is to insert a new worksheet and then put this data as a table in the workbook what I'm going to do is click on this option here and then click on close and load to and when I do that it opens the import data dialog box and shows me a couple more options in this this case it shows me whether I want to get this data as a table in the existing worksheet or in a new worksheet I can also insert it as a pivot table or a pivot chart or I can only load it as a connection I can also load this as a data model if I want in this case I want it as a table and I want it in the existing worksheet so I would select this and then now when I click okay it is going to get the data from Power query and bring this data as a table in Excel so this has become very easy something that was not possible through copy paste is now being possible with power query but let me show you one big advantage of using power query so let's say I go back to the folder where I have the files and let's say I have this new file called airport codes extra so now I have five pages here but in this file I have six pages so I want this new data also to be fetched into Excel so I don't need to repeat this power query steps all over again all I can do is go back to my power query and change the data source and connect it to this new file instead of the older file so what I'm going to do is come here double click on this query this is going to open the power query editor and here in these steps that has been applied I'm going to go to the source step and I'm going to click on this settings gear icon and when I click on it it opens this PDF dialogue box where I can select the file path so I'm going to click on browse option and I'm going to go to the folder that has the new file and now I'm pointing my query towards this new file and now when I click import and click okay what is happening is in the back end you can see see it has added this table which is Pages 1 to six now in the back end what has happened is power query has now connected to this new file and repeated all these steps automatically so now if I come here and I click on close and load you'll notice that instead of 245 rows it is now 288 rows because now it connected to that new file fetched the data did all the transformation promoted the headers and got this data here in Excel so this is one really easy way of doing it but as I mentioned I showed you a very simple example where we had a continuous table but what if I do not have a continuous table for example I have another file here which is airport codes with brakes and in this case I have these table but I also have pages that are not table in between so these are blank pages or it could be pages with any other text which means that I have tables but these tables are scattered in the PDF it is not one continuous table so how do I combine these let me show you how it works let me start by first connecting power query to the PDF file from which I want to fetch the data so I'm going to go to the data tab here then I'm going to click on get data from file and then from PDF this is going to open this import data dialog box so I'm going to go and locate the file which is this one airport codes with brakes and now I'm going to click on the import option when I do that power query is now going to go back to that PDF file and give me all the objects in that PDF file so you can see here there are four tables and if you remember in the previous example I had just one table that was spanning across five pages because it was a continuous table in this case these are tables which has blank pages or some other pages in between so these are four different tables that are not connected so I need to somehow combine these tables and then get all the data as one single table in Excel so if I click on any of these tables it will show me the data and it also has all these eight pages now I want to combine these four tables and there are two ways of doing it first is I can select these tables manually and then go into Power query and append them so that it combines all the data and the other method is I open all of these objects within the power query itself and then I combine them in the power query using these object names so let me show you both of these methods first I'm going to show you how to combine them by selecting these tables so I'm going to first check this option called select multiple items when I do that it is going to make these check boxes available in front of all these objects so I'm going to select these four tables and now I'm going to click on transform data and when I do that it is going to open each of these table as one separate query in power query editor so you can see I have now table 0 1 02 03 04 so all these four tables have now become four separate queries now I can combine these queries and to do that I have this option in the Home tab called append queries within the combined tab I have this option called append queries now within append queries also I have two options append query and append queries as new if I use the first option what it's going to do is it is going to take the first query and then combine the data from the remaining three queries into this query itself so I would have uh table 001 that would have the combined data from all the queries and if I use append queries as new it is going to insert a new query and then that new query would have the combined data from all these tables so it is neater and more it's better if I have a new query because I don't want to temper with the original raw data so I'm going to click on this option append queries as new and here it is going to ask me uh how many tables do you have so in this case because I have four tables I'm going to select three or more tables and then I need to put the tables here in the tables to append box so here table 001 is already there I'm going to select these three tables so hold the shift key and then select the first one and the last one so it selects everything in between then I'm going to click on the add button here and then click on okay and now when I do that it inserts this append one one query here and then combines the data from all of these tables you can see it shows me 249 rows now one thing that I need to clean here is that there is this row which is the header Row in each table it is being repeated every time it is combined from the table so you can see it is there in number one but if I scroll down you would see that it repeats at number 51 because this table had 50 records and then when it combined with table 002 it started again with the header row and then it would again repeat at I think 150 which is here so you need to make sure that you remove these because this is the data this is these are duplicate records which I do not want now I can remove it from any of these columns but I would not try and do it from the First Column which has the airport codes because this is APC and it might be possible that this is also an airport code so I don't want uh it to remove uh an airport code because I've used this column as the one to remove duplicate similarly city is uh the column where city names would repeat so I cannot use this one but I know that this is going to be a unique column because the airport name is going to be one single name for each airport there cannot be two airports with the same name so I would come here on column 3 I right click then go to remove duplicates when I do that it removes all the duplicate records you can see it also records these steps here in the applied steps column and now I can promote this as header the first row so I would go to the transfer tab here and then click on use first row as header and when I do that it promotes this as header and now I have 245 records and now I can load this in Excel also note that here all the steps are being recorded and it has the last step as changed type because power query tried to identify what type of data is there in each row and then set the data type accordingly in this case as I said this is a very simple data set I only have Text data but in case you have different types of data set such as numbers or dates or time or percentage or currency then make sure sure that the change type that power query has identified is correct if it is not you can remove the step and you can manually change these by clicking on these icons and then setting the data type in this case it is fine so I'm going to go to the Home tab here now if I click on close and load at this point in time it is going to load all these five queries because whenever you create queries in one session and you load them into Excel all these queries are loaded as table so if I come here and I click on close and load at this point it is going to insert all these in separate worksheet so I would have five worksheets and all these tables would be inserted and I do not want this I only want the append one table so what I'm going to do is first I'm going to create all of these as connection only queries and then I'm going to load only append one table so let me click on close and load two option and here I'm going to click on only create connection and click okay so what it's going to do is only create connection for all the five queries and then I can load this one so I right click on this one and then say load to as a table in the existing worksheet and click okay and when I do that it is going to load the data in this table here so this is as I said one way of doing this you select the table that you want to combine from the PDF file and then you combine them now this method is good if you want to do it once in a while but if this is something that you want to automate for example let's say if in the future you expect that there is going to be a new PDF file and you may have to combine tables in that PDF file then this method has one drawback which is that we we hardcoded what tables we want to combine we manually selected table 1 2 3 and 4 so now in the future if your PDF file has six tables then this method may not work so let me show you another method that would work so let me open a new Excel file and show that to you there again I would go to the data tab click on get data from file from PDF and I'm going to select the PDF uh which is airport codes with brakes and I'm going to click on import this is going to open the power query Navigator and here it shows me all these objects now instead of selecting these tables what I'm going to do is right click on this name which is the name of the file and say transform data and when I do that it is going to bring all these objects into Power query where I can then play with them so see what happens when I click on transform data here it is going to open the power query editor and it gives me all these names all these file names here instead of just giving me the tables or just giving me the pages are just giving me the data in those tables so what it has done is it has created one single query with the name of the file and it shows me all the objects which is the page or the table now in this case I only want to combine tables I don't want to combine pages so what I can do is I can come here and I can uncheck the page option and click okay so it only gives me these tables and now I do not want these I just want to combine these tables and you can see in each of these records in this column the data column I have the table and if if I click on this white space here you can see that this cell actually holds the entire table so this is amazing thing about power queries that it can hold the entire table or an entire file in just one single cell so I do not want any other uh column so I'm going to right click on this and say remove other columns and now I'm going to combine these so when I click on this see what happens it shows me this this option where I can select the columns that I want to show in the combined data so I want all these columns but I do not want the original column name as prefix so now I can click on okay and now when I do that it has combined all this data here for me and now I can make the changes so I don't know if there are duplicates or not let me check and I can see that yeah there are duplicat so first thing I would do is I would remove duplicat so I would come here and I would say remove duplicates then I'm going to promote the first row as header so I go to transform and then use first row as header and now I have the right data you can see in the status bar I have two 45 rows which is the same as the previous method and now I can load this into Excel so I would go to home close and load to and here I'm select I've selected table and then I would say existing worksheet and click okay and it instantly loads the data here now one benefit of this over selecting the tables manually is that in this case let me open the power query here in this case I had the source file and I only selected the tables from it which means that in future if my PDF file changes and it has let's say six tables or 10 tables or 100 tables this method would still work it would still combine those files but if you compare it with the previous method where we manually selected the tables it is not going to work if your tables increase but this method is foolproof it is going to work in all the cases so if it's once in a while situation you can use any of the methods but if you want to automate this process then it is better that you use this method now let me show you how to combine multiple PDFs that are there in a folder here I have this folder called codes that has these six PDF files that I want to combine so let me go to Excel and what I'm going to do is connect power query to the folder and then combine all the files in that folder so I'm going to go to the data tab here then click on get data from file and this time instead of from PDF I'm going to select from folder and then select the folder that has the file so I'll go to this folder called codes that has the PDF files and I now I'm going to click on open and when I do that it is going to connect to that folder and give me the details of all the files in the folder as of now I only have six PDF files that I want to combine so let me click on transform data and this is now going to show these file informations in power query editor so it has added this query called codes and it has all these records the one record for each file now in this case it's fine because I just have these six PDF files and these are all the files that I want to combine but what if you want to automate the process for future but there is also a possibility that there could be files with different extension in this folder for example such as Excel files or word word files or PowerPoint files or any other file in that case you want to Future proof your power query in such a way that it is only going to combine the files that you want even if there are more files so what I'm going to do is I'm going to filter this column here because in case let's say there are different types of files those would show up here but because I only want to combine PDF files I'm going to filter this here so I would click on this go to text filter and say equals. PDF and when I do that it inserts this new Step here although it doesn't change anything here but it has actually added a step where if you have any other extension file here it would be removed and it would only show you the PDF files similarly let's say if you have other PDF files that you do not want to combine you only want to combine files that have airport codes in the beginning of the name then you can further future Pro this by filtering this as well so I would go to text filter begins with and then I would say airport codes and now when I click okay again doesn't change anything here but we are future proofing it in such a way that in case there are more files which are PDF files or non-pdf files it would not impact our query in the future now I'm going to come here and in the content column you have these binary cells and if I click on the white space here you would see that it each of these cells holds the individual file that I want to combine so binary refers to a file type and within each of these it has the file airport codes 1 and two and three and so on so I only want this column I do not want these these other columns so I would come here right click and then click on remove other columns now I'm going to click on this combine files option and when I do that it is going to combine all the files that are there in this column which are these six PDF files so see what happens when I click on this it is going to do the magic in the back end and now it opens this combined files dialog box now what it's asking me is it's not going to ask me this for each file it just picks up the first file and it asks me what do you want to combine because as of now it doesn't know it asks me whether you want to combine the table or the page so I'm going to select table one and it only picks up this from the first file here so if if I click on this you can see it says first file and then it says airport codes 1.pdf it's very hard to read because it's in gray color but you can see in the tool tip it says airport codes 1.pdf which means that it is going to now take table one from this file and then go to each file and then pick up the table and combine them so now when I click okay it does the work where it has this query codes which is which was my original query and now it has combined all these steps you can see it has recorded all these steps here but at the same time you would also notice that something else happens it inserts these additional queries here so you can see there are these three helper queries and this transform sample file query now this is something that is going to happen automatically in power query whenever you are combining files whether these are PDF files or Excel files or any kind of file now this is something that you do not have to worry about power query takes care of everything in the back end you only have to think of this query codes which has the data but in case you do want to understand how this is working let me know in the comment section and I can create a detailed video where I'll explain how each of these things work and what is happening in the back end now I want to just get this uh data so I'm going to just work on this query which is codes and again the same thing here I would have to remove duplicates and then I have to promote header so I would come here to column three right click and then click on remove duplicates then I would go to the transform tab use first row as headers if you want to change data type of any of the columns you can do it manually and then I would load this back into uh Excel so I would go to the Home tab here close and load to and here I would say I want a table in the existing worksheet and when I click okay it is going to load it now one thing to know is that it is not going to load any of the helper queries that were created although they were created in the same session these helper queries that are created uh when it's trying to combine files are not going to be loaded they are only loaded as connection it is only going to load your main query so now we have everything in the back end now if I go back into that folder and I remove a file let's say I remove this file airport code 6 and I just want to combine these five files I can just come here right click then click on refresh and this is going to go back connect to the folder and combine the files so this is now a process that we have automated you can add more files to the folder you can remove files whenever next month next week you get new files you put it in the folder and then you just simply refresh the query so it is going to go back to the folder connect to all those files and combine the data so this is something that can save you a lot of time now so far I have shown you example where the data was very neatly organized within the PDF file itself but in most cases that is not going to be the case so let me show you an example where I have this file here this is the data this is the file of Spotify that I got from their website and I want to get this data from this PDF file now you can see this is not very clean there is this text here it is also not very consistent so how can we do this using power query so let me show you how this can be done so so the first thing again I would do is I would connect to the PDF file so I'm going to go to the data tab here click on get data from file from PDF then I'm going to select the PDF file which is this file here Spotify Q3 2023 and now when I click on import power queries connecting to that file and then it is going to show me all the objects it can find in that PDF file so it opens the Navigator dialog box and here you can see it finds all these tables and all these pages in this PDF file now I I have a specific need to extract this table here so I can identify which table is this which I already know is table 002 on page three so this is the third page and this is the table on page three that I want to extract now although you may think that this looks like a proper table when power query is looking at this data it's actually not looking this as proper table it's also considering this as part of the table so this is something that I would have to clean but the first step is that I need to identify what table I want to extract which is this one so let me click on transform data button so that this this opens in power query and where I can clean this data so the first thing I want to do is I want to remove this column so I would select this right click and then click on remove and this is my data this looks better now because this is now only data here in this table now this is my header row so I would promote this one here which is the first one so I go to the transform tab here and then click on use first row as header then I have these blank rows that I do not want so I can come here and then I can uncheck null and click okay and now my data is in a pretty good shape you would also notice that all these steps are being recorded here on the right in these applied steps field uh one thing I want to mention here is about the data type now in this case power query has identified that my data type is of Text data type because I have null I have numbers I have uh percentages I have these negative percentages and I also have these uh dashes so power query identified that there there is no one single data type here and it can be only classified as text so it has put it as Text data type and this is the case for all of these now if you specifically want a data type in this case you can work with it but because there is a mix of data in this case I do not want to change any of this all I really want is to get this data into Excel so I'm going to leave the data type as is but one thing I would want to show you is the change type step in the change type step one problem that you would often see is that it hard codes these values so in this case it has hardcoded the column name user and financial summary and then it has hardcoded Q3 2022 now if I'm trying to automate my work and in future I want to use the same query to fetch data from some other file it is going to give me an error because I've have hardcoded these values so you need to be cautious about this in this case let me keep it and let me show you the next time when I connect it to some other file I'll show you how if this causes a problem but in this case let me keep it but in most cases it is a good idea to remove change type step if if you do not want it so I can rename this query let me call this sptify data and now I can go to home close and load two and I can load it as a table in the existing worksheet so now when I click okay it loads this query called Spotify data and gives me the data so now this in itself is quite amazing because I have this file here and I instead of if I had copy pasted it it would have given me the data in a really horrible format but using power query I was able to connect to it identify a specific table remove all the garbage stuff and get the data in a nicely formatted table now while this is good let's say I have another file that I want to now connect this query to and fetch the data from it so as of now I fetched the data from Spotify Q3 but now I want to fetch the data from Spotify Q4 so I'm going to go back to my query and change the source and this is where the magic is going to happen so I'm going to go to uh go to this step called Source click on this gear icon and then change the file path so I would click on browse and now I'm going to select the new file from which I want to fetch the data which is Spotify Q4 and click on import and now when I click on okay it is going to connect to that file and fetch the data from there but as I mentioned there is going to be a problem here if I go to the last step you see there is this error and this error is because of the change type step if if I click on go to error it is going to take me to change type which is actually causing the error and the error here is because it out coded the value Q3 2022 but when I connected it to the new file it did not have this column so I would have to remove this step and make sure that there are no hard-coded values as I said I can easily remove change type step it's not very useful so now when I remove it you'll see that my data works perfectly and now I can load this in Excel so because I've already loaded this in Excel I can just come here and I can click on close and load and you'll see that it connects to that file now and it now loads loads the new data which is for Q4 2023 so this is again quite wonderful uh you can connect to any PDF file and try and clean the data as much as possible as I mentioned getting data from PDF is in itself very very difficult but Power query makes it easier because it allows you to transform that data in power query itself and then load this data and it also then allows you to connect to any other file and you do not have to repeat the process you can simply refresh the query so this is how you can use power query to quickly fetch the data from PDF files or combine multiple different files in a folder that's it in this video I hope you found this useful also if you're liking these videos please subscribe to this YouTube channel and click on the Bell icon so that you never miss out on any new Excel tips video I come up with thank you and have a nice day
Info
Channel: TrumpExcel
Views: 8,264
Rating: undefined out of 5
Keywords: convert pdf to excel, pdf to excel, power query, import pdf to excel, convert pdf to excel without losing formatting, how to convert pdf to excel, pdf to excel converter, how to change pdf to excel, import pdf into excel spreadsheet, how to convert pdf file to excel, import table from pdf into excel, advanced excel, excel, excel basics, excel tips, excel tutorials, learn excel, sumit bansal, trump excel, import pdf to Excel table
Id: fMH8gxGBTfE
Channel Id: undefined
Length: 29min 17sec (1757 seconds)
Published: Thu Apr 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.