Scrape Data from Multiple Web Pages with Power Query

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you how you can automatically scrape data from multiple web pages with power query using a custom function now this approach requires urls that have a predictable structure like a page number or like this one that has a reference for the start number let's take a look this is the web page that i want to scrape the data from and i want to get the book title author the book type and publisher information now you can see this page is showing me the first 10 results of over 23 000 records which equates to over 2 300 pages of data now with power query we can automate the gathering of this data and you can use either power query in excel or power bi i'm going to use power query and power bi because this web page doesn't have the data structured in a proper html table and with power bi i can use the new add a table using examples tool which isn't available in excel yet we start on the home tab get data from web and i just need to paste in the url for the page and click ok in the navigator i can choose from the html tables available on the page and if the page that you're wanting to get data from shows it nicely formatted in one of these tables listed here then you simply click it to select it and then either load or transform data and this works exactly the same in excel another option we have in power bi are suggested tables and if we look at this one table 14 and make it a bit wider you can see it has almost all of the data that i want but it also has a load of data i don't need so i'm going to use add table using examples this opens up another window let me maximize it you can see at the top it's a preview of the web page and at the bottom we simply type in our columns and examples so the first column is the book title and the first book title on this page is crime and personality so you just start typing it in brings up a list to choose from and it's this first one here so i'm just going to double click to select it we'll give it another example and there it is there the structure of human personality so double click and with that it's worked out that i want all of the book titles and it's completed the list for me i'll click on the asterisk to add the next column and this one's the author the first book author is by h j isaac so there it is there at the top double click to select it and with that it's worked out the field that i want let's add the next column this one's the type and the type for the first book is ebook double click and that's enough to complete the rest last column is publisher and the first publisher is london and it's the second one here double click and with that it's completed the rest so i simply click ok and then if i just wanted this one page of data i could click load but remember i want to get data from multiple pages so i'm going to click transform data this will open the power query editor i'll just make it a little wider and we can see all of the data there next i need to convert this query to a function and we do that via the advanced editor which is available from the home tab and at the top i just need to add a line of code that includes my variable name so my variable name is page start and it's text then we just enter an equal sign and a greater than sign and that's my variable name let's copy the variable name because what we want to do is replace this one for the start number with the variable name so i'm going to enter a double quote and then an ampersand paste in the variable name another ampersand and then i have another double quote already there so we're replacing the one with the variable i'll click done and that converts my query into a function let's give this function a better name i like to start my functions with fx just to differentiate them if i have a long list of queries it's easy to tell what's what you'll also notice the function names are italicized now for this website i need a series of numbers for the page start variable and i can generate them using power query but other websites might use text strings for their variables if so you can import a table that contains your list of text string variables i'm going to create a new query and we just want a blank query because we want to generate a list of numbers between 1 and 21 610 it's a list so it's in curly braces and the two dots instruct power query to give me a list between 1 and 21 610 simply press enter now i need to convert this list to a table you just click the to table button and at the dialog box click ok now remember i only need the page start number and there are 10 records per page so i need 1 11 21 and so on so i'll add a column for the modulo and the value here is 10. i want every 10th record and then i can use this modulo column to filter and just retain those rows that have a one so you can see there they correspond to the page start numbers that i want so select one now modulo has done its job so with it selected i'm going to press the delete key now i have over 2 000 page start numbers and loading that many web pages to scrape the data will take a very long time so for the purpose of this example i'm just going to filter this list and just keep numbers that are less than or equal to 41. that's just going to give me five pages to scrape so there's my list now i need to convert this to data type text a url is text so i can't put numbers in it it needs text and i'm ready to invoke my custom function we can do that by the add column tab invoke custom function and i'll select it from the list i only have one column so it's defaulted to that and that's perfectly fine and click ok now if i click in one of the table fields you can see a preview at the bottom so this is a list of books on the first page this is the books on the second page and so on we can click on the expand button to expand those rows out into one table i don't want to use the original column name as the prefix that's the original column name there wouldn't make any sense so click ok and if we maximize the window you can see the books for those first five pages of the website now this column's done its job so i'll delete that let's give the query a name now this is going to be the name of the table in my power bi model so it's important that you give it a sensible name and i'm ready to close and apply so on the home tab i can close and apply and if we take a look at the data view there's my table of books now one of the great things about power query is that it maintains a connection to the web page and this enables us to refresh this connection and get updates if needed you can do that via the home tab and then click refresh well i hope you found this technique useful you can download the file for this lesson from the link here and if you like this video please give it a thumbs up and subscribe to my channel for more and why not share it with your friends who might also find it useful thanks for watching you
Info
Channel: MyOnlineTrainingHub
Views: 45,234
Rating: 4.9796181 out of 5
Keywords:
Id: VkDpwSGWPmk
Channel Id: undefined
Length: 8min 26sec (506 seconds)
Published: Wed Aug 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.