Excel VBA Introduction Part 48 - Web Scraping Across Multiple Pages

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this wise L excel vba tutorial this video covers web scraping across multiple pages and it's a direct follow on from the previous video we showed you how to do the same thing with just a single page so as the name suggests we're going to look at scraping multiple web pages in a single procedure it helps if you've seen the previous video but will give you enough information to follow along with this one even if you haven't so start with a really quick recap of the HTML and XML object libraries and how to reference those and then how to generate HTTP requests and create new HTML documents from the results we'll look at using a web browser to inspect HTML elements just to help you to work out what it is you need to process and then we'll talk about writing code to refer to those HTML elements either by class name or by tag name we'll then look at a couple of techniques that allow you to loop over those collections of HTML elements before we finally start concatenating URLs and looping over multiple pages the final part of the video is going to cover how you can write out the useful data to some worksheets in Excel which is really the scraping part of web scraping for the final flourish will then add some hyperlinks back to the source data just so you've got some kind of useful information to work with as a final result so it's quite a long detail video help you up for it let's get started let me start by explaining the example we're going to create in this video so it's a web page you might be familiar with already it's the YSL home page of course from the top menu bar I can choose from the resources section videos and that takes me to a list of all the videos we've ever produced so that the same as the videos you've got to upload it to YouTube of course they're just them embedded on these are puzzle pages on the white cell site we've got the videos organized by category so you can see down the list on the left hand side there's various different categories showing videos of different types so if I clicked on the excel vba category that takes me to a page showing me the first 20 excel vba videos now each video is represented by one of these little gaudy boxes and there are details included in there like the title and the author and the description and of course some video categories are so large that not all the videos fit onto a single page so there are some paging controls down on the bottom so what this video is going to show you how to do is write a system that is going to scroll through all of the individual video categories on in this list and it's going to go to each one in turn it's going to go to each page for that category in turn and then write out a row in a worksheet showing information about the video so I have a separate worksheet for each category the starting point for all this is well it's in a blank Excel workbook so if I head into Excel I've got a brand new blank Excel workbook I've already saved it as a macro enabled workbook so that means I can just head straight to the Developer tab head to the visual basic editor and here's the starting point we'll meet a new module of course so let's insert that now I can right-click and choose insert module and I'm not going to bother renaming it before I start writing any code I'm going to reference a couple of object libraries these ones that we covered in the previous video so I won't go into too much detail here so I'm going to go to the tools menu choose references and then I'm going to scroll down first of all and find the Microsoft HTML object library so I'm going to ignore the the Microsoft Internet controls library in this video yes and in the previous video without using Internet Explorer just for efficiency reasons I'm not going to use Internet Explorer in this example which is going to find the Microsoft HTML controls so it's sort of like on HTML object library the way that we're going to connect to the website itself is using Microsoft XML so there's a Microsoft XML library in here somewhere there's a few of them in fact I'm going to go for the latest one that's installed on my machines that's Microsoft XML version 6.0 that allows us to make HTTP requests without having to use a web browser which makes life a lot but it makes the code much more efficient doom easy doesn't have to deal with opening up the web browser and all the complexities of controlling that other application now we can start writing the subroutine so let's create a new sub I'm going to call mine something like get a video page and what this one is going to do is send an HTTP request the Wisel video home page and then return the HTML and generate a new HTML document from that so we're going to need a couple of variables to achieve that let's declare our first one I'm going to call mine XML rec or short for request as a new so it can be an auto instancing variable msxml - that's the name of one of the libraries we just referenced and then an XML HTTP in my case 60 so this is got the numbers 60 or 6-0 after it because it's version 6.0 it may well be that if you've referenced a different version of the library if you didn't have version 6.0 in your machine that could easily be 5 0 or it could just be XML HTTP but in my case it's definitely 60 will also need somewhere to store the document that we create so let's have another variable I'm going to call mine HTML doc as a new so again another auto instancing variable ms HTML and dot HTML document next we can open up our new HTTP request so to do that I'm going to say XML req dot open and there are several parameters I need to fill in so the first two the first one I've got to fill in is the method I'm going to apply which in this case is get the second parameter is the URL I want to get the information from so in this case it's in fact I'm just going to go and copy and paste out from the wise L wise our page so I'm just going to browse back to the videos home page and then just copy the URL from the address bar just to avoid typing it in the last thing I'm going to do is fill in the third parameter which is technically an optional parameter but I'm going to make sure that this request is run asynchronously so it when single is essentially that means it's going to wait for a response before it precedes with the rest of the code once you've filled all those details in we can send the request so again I can say XML req dots send the next thing I quickly like to be able to do is check that the request has succeeded and we can do that by checking the status so let's say if XML rec status icon status is not equal to the number 200 that indicates that the request has succeeded and we've browsed to a page well not browse to it but we've returned a page so if that condition is not true or if the status is not 200 then I'm going to print out a simple message box so let's say message box I'm just gonna print out a single message that says problem and then I'm going to concatenate to that a VP newline character and then I'm going to print out the XML req dot status so we'll see what number the status actually is and then I think it would also be helpful to have the status text as well just so we can see what's actually going on so XML req status text we'll make sure that we exit the subroutine at that point so we don't just carry on blindly if we haven't succeeded to return the page that we were looking for and then we'll have an ENDIF to finish off the block if at that point is probably worth I'll just quickly testing that the request succeeds so with the correctly formed URL let's just try to run that one we shouldn't see anything happen at that point it should just run and no problems reported if we had a slightly malformed URL so we've said something like video rather than videos and then we executed that again this time we're going to see a problem and so dreaded 404 error so the webpage was not found so if I click OK that tells me that I haven't successfully returned the request and then this ability in just ends so I switch that back to videos and make sure that it is going to work and then let's carry on and actually get the HTML document generating the new document is pretty straightforward all we're going to do is say HTML doc dot body dot inner HTML and we're going to make that equal to the response text from the XML request so XML req dot response text so what that will do is generate a brand new HTML document formed of the text sent back via our HTTP request at that point what we then need to do is start thinking about how we process the individual items on the page and the best way to start with that is to have a quick look at inspecting elements on the web page now we use this technique a lot in the previous video to find out what items we had on the web page to work with so I've got the Y's l video of page open up in Internet Explorer but you can also use Google Chrome or Firefox or Safari whichever browser you happen to use the same technique will work I'd like to find out some information about these video categories first of all so I'm going to do is right click on any one of them and choose inspect element what I should do is open up the document object model Explorer and highlight the exact bit of code that's been used to generate that single specific item so what this is telling me is that the particular item that I've clicked on is has a tag of a which represents a hyperlink with an attribute called href which is the link itself and some in a text it's also a member of a list item tags in CLI there who's got a unique class attribute which is wo menu item that's all part of a ul ur an unsorted list which is part of a div tag so I got a variety of ways I could start to get access to these individual items in the previous video it looks quite a lot at using getting items or getting a reference to a set of elements by their tag names we use things like a and div and so on to get a list of items by their tag what we're going to do in this video a little bit more is use the class name so that's a technique we didn't see in the previous video so in this case I can see that each of the items in this list has got a unique class of wo menu item so my first job is to get a reference to all of those items using that class name so I'm gonna head back into the VB editor and I'm going to start by declaring a couple of variables to help me out with this I'll call the first variable something like vid cats short video categories and the type of that will be an MS HTML dot I HTML element collection so that's of course just a collection of elements from the HTML document to set a reference to that what I'm going to do is after I've got a reference to the document itself once that's being created I'm going to say set vit cats equals HTML doc dot then there's a method of the HTML document called get elements by class name so yes they will look to getting elements by ID we got a look at them getting elements by tag name in this video we can have a quick look at how to get elements by class name so if I say get elements by class name open up some parentheses and some double quotes then the name of the class was W a menu item and if I close it up quotes and close the parenthesis that now stores a list of all of the elements with that specific class name just as we're going along and testing out code it'd be useful to print out some information about the various elements we're getting access to so I could say debug print vit cats dot length just to find out how many items are actually in the collection and if I displayed the immediate window by pressing ctrl + G and then should execute the subroutine at that point I'll find that there apparently there are 19 elements in that collection if I want to find out any information about the individual items in the collection I can loop over the vit caps element collection and I can do that in a couple of different ways a nice convenient way to do it is actually using a for each loop in which case I'll need another another new variable so I'm gonna say dim fit capped represent representing a single category as an MS HTML dot I HTML element so having done that I can scroll back down and then after I've printed out the length of the collection I can say for each vid cat in vid cats then next vid cat and then I can think about what information I would like to print out about each individual video category for each of these elements then what I'd like to be able to see is the type of tag that is some will say debug print vid cat dot tag name and that also like to see the inner text property so fit cat dot inner text having done that I'm just going to run the subroutine again by pressing f5 and this time I'm going to say a whole range of different values printed out into the immediate window if we have a quick look at what's in the immediate window then you can see that it's actually everything that's been printed out is a list item tag and you'll see the various different names of the items there the Unitec so you can see that some of them corresponds to video categories but it also got some of them correspond to the video authors as well and then a little bit further down they can correspond to the years in which the videos are published so just going back to Internet Explorer and having a look at what we've got here we're seeing not just videos by category but we're also getting access to the authors and years which is something I don't want to do so I'd like to look at how we can sort of change that so rather than looking at all of the Y's our menu items the problem that we've got here of course is that these things correspond to the same class so they have the same class w a menu item if I expand this class menu title inside the Dom Explorer I'll find another menu list and then inside there another unsorted list with items in there those classes W a menu item and they correspond to the authors and then another one further down for the years what I'd like to do is prevent that from happening so here's what we're going to do we're going to look for items in the page first of all whose class is wo menu list now there are several of these in the page already so you can see that there's many lists for the authors and a menu list for the years as well we can guarantee that on this page the first item whose class is menu list is the one that contains the video categories so my first job is to get a reference to an element whose class is double your menu list and make sure that it's the first element in the page with that class so let's head back to the VB editor and let's think about how we can do that I'll start by declaring another variable which is going to hold the video category list let's just add that one just above the vid cats so I'm gonna say dim vid cat list as an MS HTML dot HTML element so it's not an element collection it's you can refer to a single element in the page before I set my vid cats reference then I'm going to say set like I'm going to say set vid catalyst equals HTML doc dot get elements by clasp name open some round brackets and then the particular class name that I'm looking for this time is wo menu list rather than menu item now currently this line of code is going to try to get a reference to a collection of elements which is no good because I haven't declared the variable as an element collection is just a single element I'd like to return just the first item that belongs to the collection of elements returned by this method so to do that I can kind of almost cheat at the end and open a new set of parentheses and refer to the first item in the collection now the element collections are indexed from zero so the first item has an index number of zero so having done that that'll give me a reference to just that single individual menu item or menu list folder video categories whilst I can now do is attempt to get just the individual video categories from the category list so one approach that you might think about trying in order to do that would be to try to use the get elements by class name not for the HTML document but for the vid catalyst element so I save it cat lists get elements by class name that will appear to make sense but if I try to run that one at this point it's going to fall over if I click the debug button it falls over on the line that I've just modified I can't use the get elements by class name method on another HTML element this method is only applicable to the HTML document now fortunately for us there are other techniques we can use to to make this works and it's going to stop running the code at this point and then instead of trying to get elements by class name what I'm just going to do is refer to the property of an HTML element called children so vit catalyst dot children so children refers to the element collection which is all the direct descendants of that particular element in the HTML code to an our case our if I can find it in here really reasonably quickly this first wo menu list has a range of direct descendants here which are all list items so if I were to go back to my VB editor now and I'm just going to clear the contents of the immediate window click into it press ctrl-a and then delete and then click back into my code and hit f5 to run it I'll find that I've now got a list of just the video categories and not authors or years now what I've just done here leaves me with another tiny little problem because it wasn't actually the list items that I was interested in getting a reference to these are the direct children of the video category list but if I go back to the Dom Explorer here it's not the list items that I'm interested in it's the children of the list items so effective it's the grandchildren of our wise our menu list is the a tag is the the hyperlinks that I'm really interested in because these are the ones that hold the URL that I'd like to browse to next as well as a sensible name for the category and an indication of how many videos there are so rather than getting access to the children of my sorted list what I'm going to do instead is refer to just the a tags inside that item so going back to our VB code again rather than saying vid catalyst children I'm going to save it catalyst dot you get elements by tag name now at this point you're thinking hang on your intelligence doesn't show that you can run into the same problem that you run into earlier on by getting elements like last name fortunately not the intelligence is a little bit misleading here although you can only apply the get elements by class name method to an HTML document you can apply the get elements by tag name method to any HTML element whether that's an entire document or any other individual element so what I can then do is open up some parentheses then and some double quotes and then look for a tags and then closer double quotes and close the parentheses and then what I'm going to do is click back in civvy immediate window and hit ctrl a and delete and then I'm going to run the subroutine one more time just to see what I now get okay so the immediate window here shows me that I'm definitely just getting the a tags and I'm showing that the inner text is the same as it was before so it's the video category names and I'm still not getting the authors or the years which is things that I didn't want so what I can do now is make sure that I get the H ref attribute so I'm not bothered about printing out the tag name anymore let me just get rid of that at that point what I really want to do is get the H ref attribute so after vid cap dot in a text I'm going to type in a comma and then I'm going to say vid cat dot gets attribute so this allows me to revert to any attribute of a tag so these are the items were just switched back briefly to the Dom Explorer these items in this particular browser are displayed in red text here so these are things like class in the H ref etc so back into the VB editor the particular attributes that I'm after here is H ref and by closer double quotes in closer parentheses clear the immediate window one more time and then run the subroutine one more time now get to the urls of the various video categories now I do just have one final problem with the range of items I'm looping over and that's to do with the very top item in the list is the videos homepage I've already browsed the videos home page with the very first request that I opened and I don't want to browser the home page again so I want to eliminate this item from the list one way to do that I suppose would be to use an if statement somewhere inside out for each loop and say if that was the videos home page then ignore it and move on to the next one instead but it's slightly more elegant I think if we eliminate that one from the loop altogether now sadly I can't do that with a for each loop what I need to do is change this into some other kind of loops I'm gonna use a for next loop to do that choose a full neck loop I'm going to need to declare some kind of number variable a counter variable that allows me to count through the individual items so I'm going to declare a variable called vid cat ID as an integer now if I wanted to use this type of loop to process the video categories I could say instead of for each vid card I could say for vid cats ID equals and then type in the value for the lowest number I want to start looping up so if I was going to loop over the entire collection I've wanted to process all 12 items in the collection I would say 0 because as we saw earlier on the indexing of element collections begins at 0 and then rather than in I would say to vid cats dot length minus 1 we could spell length properly that would help so vidcast at length minus 1 so what that would do is loop 12 times and print out the information about each video category or it would at least if I've actually set a reference to the vid cat so I would do inside the the loop then oh as well I need to change the next statement to a next vid cat ID and then the first thing I could do inside the loop is say set vid cat this is the actual HTML element variable equal to vid cats open some parentheses and then say vid cats ID so that will use the number that the variable is currently holding each time through the loop so having done all that if I just clear out the contents of the immediate window again and then if I run that one we'll see we once again get all 12 items now having done it this is much easier to eliminate one of my items from my list if I know that the videos home pages element at number 0 all I need to do to make this work is begin my loop at one instead so one last time if I just clear the contents of the immediate window from the subletting one more time this time you should find that I only get 11 elements background or silk they're still a count of 12 of course the the count won't have changed the length that the the element collection but I've eliminated the videos home page from less I'm only actually printing out 11 of those what I need to start thinking about doing now is converting each of the href attribute values into a proper fully-formed URL so a little bit more like what we've put in earlier on to open up the HTTP request in the first place so essentially I need to concatenate this part of the URL with this part of what the href attribute returns to create a complete fully formed URL to help us out with that I'm going to declare a couple of extra variables and a constant as well we're going to be referencing our Y's l videos home page quite a lot so scrolling right up to the top of the module I'm going to declare a constant which is going to be called something like wall vid URL as a string and I'm going to make it equal to what I've already typed in so in fact I'm just going to cheat and copy and paste or in fact cut and paste this one I'm going to cut that out and paste it in to the the constant value what I can now do is just modify my open method by referring to wall vid URL so that makes that code a little bit tidier I'm also going to declare a couple of extra variables for handling each of the individual href attributes I don't necessarily need both these variables but it will just make life easy to break down the individual steps of what we're about to do so I'm going to declare something like dim next H ref as string and dim next URL as string and I'll put a capital L on that as well and then I can go back down to my for next loop and work out how to construct a fully-formed URL out of what we've got so far so I'm just going to start by giving myself a bit of space in between the set vid cap line in the debug print line and what I'm going to do next is say next H ref equals and then I'm just going to pick out the href attribute from the big cat so I can just copy and paste that from the debug print statement this will just make it easier for the next line of code where I'm going to try to concatenate everything together but extract a part of the value of the H ref so I'm going to say next URL equals we've got the wall video URL and I want to concatenate that with the results of the mid function so you might have used this in previous videos the mid function allows you to extract bits of text from a longer string so there are several parameters here the first thing you've got to specify is what string of text you're getting your your values from so I've got that stored currently in next H ref which just makes my life a little bit easier I don't have to refer to vid cat get attribute href then I can say which character position to start at now the bit that represents the useful part for me is everything after the videos for slash so I've got to get to the part where I found where the start of the actual category begins that's everything after the video is in the forward slash there there are several ways I could actually make this work I think one reasonably consistent with you that I guess for the way the wise our web page is set up we know we can identify the colon character fairly easily in the string and find out its position that's one two three four five six and then from there we know that it's going to be followed by a full slash the word video is another four slash so from position number six I can count one two three four five six seven eight nine characters from there so if I wanted to do it that way what I'd need to do is establish how I could find the colon character inside the string so to do that I'm going to use the in string functional in astro as it's actually spelt so I M STR but I can open up some round brackets now I've got an optional parameter to star with the start position just going to begin at position number one if I don't specify otherwise so in fact I can just jump straight to next H ref followed by a comma and I'm looking for inside there a colon character and then I can close the parentheses for that so that will return the the number one two three four five six then I said that I wanted to add one two three four five six seven eight nine characters to that so I can say plus nine and then close the parentheses and that should return the correct URL just to give that a quick test what we're going to do is debug print not vid cap in a text and the href I'm going to save it catch to inner text comma next URL excuse me next URL spell that properly eventually there it is having done that I'm going to clear the contents of the immediate window and then just run this one one more time just have a quick look at what we actually get so that's all looking reasonably good so you've got all the various video categories with nicely formed URLs at this point and we're essentially ready to go now and start browsing to those category pages now because this ability is getting quite longer than will the already I'm going to separate out the next part into a different procedure into which we can pass some useful information so we're going to pass in the inner text as the video category name and the URL that we've just constructed so that means that I need to declare a new subroutines let's just go back to the code down below our previous end sub and I'm going to create a new one here called sub list videos on page and what I'm going to do there is declare two separate parameters so the first one is going to be called I'm going to call it vid cat name as a string followed by a comma and then I'm also going to say vid catch URL as well so let's call this one vid cat URL also as string my closer parentheses at that point in hit enter a couple of times essentially what I'm going to do next is rather than debug print the inner text and the next URL I'm simply going to make a call to my list videos on page procedure so let's just get rid of that debug print statement and instead say list videos on page if I tighten the space they're just CV in the little tool tip pop up we're passing the inner Texas the category name and the next URL has the video category URL now the first part of this procedure is actually very very similar to what we did to start our original procedure so just scrolling back up towards the top I'm going to make one little addition to the code we've written in our first procedure once we've retrieved the response text from the HTML request sorry the HTTP request there's absolutely no reason to keep that around that object has done its job at that point so currently it's going to be still available for the rest of the procedure which we don't really want so what I'm going to do there is just set XML wrap equals nothing just to tidy up the code a little bit what I'm not going to do is again just to cheat ever so slightly is copy and paste a few bits of code from our original procedure so let's have a quick look I'm going to copy the two variables that refer to the HTTP request in the HTML document we'll need both of those I'll paste those back in down to the second procedure I'll also need the code that tries to browse to or open up the new request so I'm going to copy and paste the the open and send a little if statement that checks whether it returned success or not and then the last two lines there that returned the response text to generate the new document and then ends the XML requests or sets it to nothing so let's just copy all of those lines and paste them in and the one single thing that I need to change now is rather than trying to browse to the wall of video URL all I need to do is insert browse to the vid cat's URL so let's just copy and paste it over the top of that and that's the basics of the beginning of that new procedure okay so just to see where we should be imagining with that we actually have this web browser open and we see also what pages we Proust's to so imagine I clicked on Oregon to the first category excel vba and that would be the page that we'd be currently looking at so the next job that I'd like to do is to list through or loop through all of the individual videos displayed on this page so I'm just going to inspect the elements I'm going to right click on this little gray box and choose inspect element and then see what this is actually called so we can see that that's actually an href so it big one is an a tag it's a hyperlink tag with an href attribute and that's going to be quite important to us we'll need to pick that out to construct the URL it's also got some inner text here which which refers to the video name so we're going to need to pick up the inner text here as well you can see that it's part of a paragraph a P tag with a unique class name W a video list default series title is also they're part of a div tag with a class of WL video listro so just collapse the P tag there let's just have a look at what we've got so we've got a classical video list row which has three children one of which has an another child which is an a tag which has the actual values that we want we've also got a second P tag which gives us information about this case this the video author so they can see there's a link there so who authored the video and what date that was uploaded on or post it on and then there's a final P tag there as well which doesn't have any children but it just describes or provides the description for the video so that gives it a little bit of information as to what to work with I think the approach we're going to take as each video is represented essentially buy one of these wo video list row classes we're going to use the gets elements by class name first to get a reference to all of those and then within there I think what we will do is get a get all the child elements by tag name and look for the a tags so that's the approach we'd like to try to take let's head back to the VB editor and try to put that into practice let's start by working out how to loop over the individual video rows first of all so we'll have a couple of variables to do that let's say dim vid row as ms HTML dot HTML element then we can say dim vid rows so we're almost always constructing a singular and a plural the same variable for this these individual loops so let's say Ms HTML dot I HTML element collection okay so if those two variables declared what I'd like to do now after we've successfully got the HTML document I would like to loop over or so first of all get a reference to the video rows using the get elements by class name so I'm going to say down here set video rows or sorry set vid rows equals HTML doc and dot gets elements by class name we can use that here because you're allowed to apply this method to the entire document just not to other individual elements and the name of the class was wo video list row I've got to open up the set of double quotes there and now in closer parentheses at the end and that will get us access to that collection of elements the loop then is just the same as we did earlier on is a for each loop so we can say for each Vidro sorry Vidro in vid rows next vid row and then we can work out what sort of useful information we might want to print about that let's see I'd like to print out I think how many children each video row element has so let's have a debug print statement that will say Vidro dot children dot length again the intellisense kind of fails at that point sadly but we know that the children property returns an element collection so that must then have the same properties and methods of an element collection so we can revert at a length property so having done that I'm just going to clear the contents of the immediate window and we know that we can't run this subroutine from within comrade singly anymore because it has a couple of parameters it requires values to be passed into it we have to either call it from somewhere else or we can also just run a subroutine that already calls it so let's head back up to our first subroutine click somewhere in there and then run that one and we'll get a create big long list of the number three and the reason getting a long list of the number three as well as the number 12 which is the the silver debugger print in the original subroutine appear somewhere I think I'm going to just comment that out at this point just to avoid confusing the whole great big long list of number three's it is to do with the fact that if we switch back into the intent Explorer window and the Dom Explorer we can see that each of our wo video list row class items contains three individual children okay so having established that we can get access to each of these individual video boxes let's work out what the useful bit of information will be so inside each video list row we've got three paragraph tags one of which contains an H ref or sorry an a tag with an href attribute containing the videos link the URL that's definitely import an important piece of information the inner text of that a tag as well contains some useful information it's the name of the video the other bits of information in the other P tags aren't really that useful in this particular case I'm not particularly interested in who posted the video or the link to that person's profile and the wise I'll page or what date it was it was created so I'm not going to ignore that P tag and I also don't care about the video's description either what I really care about here is getting access to the URL video and the videos title so that's contained within a tag within the ulis row there is also another a tag of course the link to the videos author which I don't care about so what I'm going to do is get a reference to all of the elements by tag name with an a tag and then just pretend the first element from that collection so back to the VB editor and we'll need to declare a new variable to help us out with that I'm going to call this new variable something like vid linked as an MS HTML which I'm so getting bored of typing by this point I HTML element get that spelled correctly you would have thought I could have spell that correctly the number of times I've typed it in so far so that's just a single element normal element collection the technique that I'm going to use then of course back in the for each loop is to say set fit link equals Vidro dots and get elements by and again you'll see that the intelligence fails at this point but get elements by tag name open some parentheses and double quotes the letter a closer double quotes in the parentheses and then open some more pretend the sees and refer to the first item in the collections at the same thing that we did earlier on now rather than printing out or debug printing the length of the children collection what I'm going to do instead if I just move that statement down so I'm going to print out vid link dot in a text which will hopefully show me the name of the video for thy comma and then also something like vidlink dot and get attribute opens in parentheses hate ref closer double quotes and parentheses and that should give us some actual useful information okay so if I just clear the contents of the immediate window yet again and scroll back up to the first subroutine in the page and run that one again this time I want to see long lists of different videos from various different pages so that looks pretty successful to me for their starting point so you can see all the excel vba videos or at least the first page of excel vba videos and then the Excel Power bi videos and then some more popular videos and so on and so on excel vba userforms so we're definitely successfully getting different video categories the problem that we've got left and the one big one to solve is how we get this to work for each individual page so where there's a video category which has too many videos to display on a single page how do we get around that issue as always the best way to start working out how to solve this sort of a problem is to inspect the HTML of the page so let's head back to our web browser which I still have open and I'm already on a page which is page 1 of 3 so that should mean that if I scroll all the way to the bottom of the page there should already be eventually when it catches up things go a little bit strange when I'm running my screen recording software so here we go so I've got some paging controls down at the bottom of this page so I right-click on any one of these and choose to inspect the element I should see what that is made up of and fortunately for us someone's been very very thoughtful and giving these paging elements a sensible class name wo paging item that means there should be three of those on the page if the in this particular case there's three different items so indeed there they are in a list so each list item in an unsorted or unordered list there are a tags in there but the great thing about this is they got unique class names now we do have to be slightly careful about this because it's quite possible we could be on a video category which doesn't have any paging controls so for instance if there was should I bother scrolling all the way back up to the top if there was a video category the very few videos in it so there indeed there's one there's only one single video in there that's my attempt of making a video game in Excel see how you can push excel it's like for naturally so you can see there are no paging controls in this particular page so we have to be slightly careful with exactly what code we write anyway if I switch back to the Excel VBA cat we just so we've got something to reference again and just make sure we're happy with what the names of the paging controls are right click and inspect that element one more time so that was a wo paging item so back to the VB editor and let's work out what code now we need to add to get this part to work well we should definitely start by declaring at least one variable this one's going to hold an element collection so we're going to return all the elements by class name matching the one with the we're just looking for the paging controls so let's say dim vid pages as ms HTML dot I HTML element collection to set a reference that's what we're going to do is once we've got the page loaded and before we get the video rose we're going to set a reference to the paging controls to work out how many pages we're going to be working with so at that point our going to say set vit pages equals HTML doc dot and get elements by class name maril is open some parentheses and some double quotes and the name of the class was wo paging item spelled out properly that's quite important paging item okay so there we go so we've got reference to all of the video pages elements at this point we should probably just loop over this collection to see what information we can obtain so this one's I'm actually going to just comment out the vid Rose loop and then let's declare another new variable somewhere up near our vid pages variable I'm going to call this one dim I'll just call it dim I as integer so I'm going to use a simple for next loop rather than for each loop in this case you'll see why shortly thought going to do is we're going to say for I equals 0 to vit pages dot length minus 1 then we'll say next I and what I'd like to be able to do is say debug dot print vid pages open parentheses I close parentheses dot get attribute and then open some more parentheses and double quotes I'm going to say H ref closer double quotes in closer parentheses so at that point I'm just going to execute this ability from the first one and then over I should clear out the contents of the immediate window there first and if I were to run this one this time what I ought to find out is there we go and there's all of the individual paging controls so for Excel VBA we've got three pages for sequel we've got two pages for a point services we've got two and it looks as though everything else has just got one single page now the way this loop works it's going to cause us a couple of small problems because essentially the next step is to shift this section of code here inside our for next loop so in fact let's just do that right now I'm going to move my next I statement down to below my next video statement and then under going to indent all those lines of code inside there one space now currently the way this is set up this isn't going to handle it's not going to deal with pages or categories where there's only one single page of videos so if you imagine that we looked for vid pages or we tried to set of eight pages to all the paging items on a category that didn't have any paging items that would mean that the length of the category would be 0 or sorry the length of the vid pages collection would be 0 so if we subtract 1 from 0 of course we get negative 1 so you'd have a loop that was going to try to loop from 0 to negative 1 so unless you actually added in a separate step minus 1 statement that loop wouldn't process categories where there was only one page so to get around that small problem first of all what we're going to do is add an if function so rather than always subtracting one we're going to subtract if so you may not have used this previously in in VBA it's exactly the same as the Excel worksheet function is just a cell with two eyes rather than just a single one so I believe that makes a pronunciation if rather than just if but anyway if vit pages dot length is greater than zero followed by a comma then I'd like to subtract one followed by another comma otherwise I just like to subtract a zero so this will always loop from at least zero to zero so the loop will always process once for every category of videos this debug print statement is not much good here either if we're referencing an item in the vid pages collection well that collection only exists if the page or the category has paging controls in it so if I try to print that for let's say the c-sharp video series where there's only one page of videos then that would fail miserably because there is no eyes into reference in addition to that we don't want to reference the first page of our multi-page categories anyway because we're actually already on that page if we have this bit of code reactivated anything uncommented then we'd already be looking at this first page so I only want to make this work for items where the length of Vic pages is greater than 0 or federally for where I is greater than 0 so let's add in another if statement this I'm not an if function and if statement we're going to say if I equals of BEC button I group I is greater than 0 then we're going to do a whole bunch of things to make sure we generate and process every single page in all the paged items such as had in the end if and then we'll work out what extra code we need to add in to make this part work the approach we're going to take here is pretty much identical to the one that we use to generate the URLs for each video category so in fact let's scroll back up to our previous procedure and we've got a couple of variables in there next page ref and next URL which I'm just going to steal and pop back into this procedure we've also got a couple of lines of code that work out how to generate the next page repin next URL so let me just copy and paste those two lines from the first procedure we'll need to adapt them ever so slightly but I'm going to copy and paste those inside the if statement that we've just written so if I just pop those in I'll need to make a couple of small changes to the code we've just stolen so of course we're not referencing the video category anymore I'm referencing vid pages I so I'm going to copy and paste that over the top of it cat the next thing we need to do is make sure that we're not printing out the href attribute anymore I want to check that I am in 4d to printing out the value of the next URL so that's first of all move the debug print Simon down and then just tidy up the code a little bit get rid of some extra blank lines that we don't need and then rather than printing out the href attribute let's print out the next URL okay so just to make sure that we are actually generating the URLs correctly for each page in the categories with multiple pages let's scroll back up to the first subroutine one more time and give that one a quick little test so there we go looks pretty successful to me we've got some nicely formed URLs pointing to each of the individual pages or the categories that have multiple pages now of course just printing the URL isn't particularly useful what we want to do next is take that URL and send a request an HTTP request then get access to an HTML document so that we can process all the video rows on that page so to do that again we can choose to essentially copy and paste from in fact in this case from the existing routine so let's get everything from XML record open we can copy that and we'll have a little if statement that determines whether it succeeded or not and we'll get the inner HTML and set xml rut equals nothing i'm just going to copy all of those lines and then inside the if statement just below where i've printed out the next URL I'm going to paste all that in I'm going to do a little bit of tidying up with the indenting otherwise I'm not going to sleep tonight if I leave it looking like that we also need to make sure that we're not getting the video category URL this time I'm getting the next URL so I'm going to cut and paste or copy and paste that and then I'm going to remove or at least comment out that debug dot print statement okay so at this point the core part of the code is pretty much there if I can just uncomment the video rows section that we commented out earlier on this is still doing the same job as previously it's just working with whichever HTML document is provided to print out the individual details for each video in all the video rows the cool thing that we've done now at this point is we've added in the code that will change which HTML document is pointing to based on whether the page hasn't was also whether the category of videos has multiple pages or not and also of course we had in the code earlier on from our first subroutine which changes the page for each category so all the remains at this point is just give this a little bit of quick test let's just clear the immediate window one more time and I'm going to click into the first subroutine and then just from the entire thing and see what we get so that's looking pretty successful didn't take too long to carry out either so let's just scroll up and see what we've actually got so it looks like we've got Excel I think we've just chopped off there are too many videos printed out this in there was about 200 and 210 213 videos all together so chopped off some of the videos at the top but we've definitely got videos pass Excel VBA part 20 so the 20 videos per page if you remember so we're getting all the VBA videos at this point and all these are the ones and then we've got all the power bi videos which is on a single page anyway and then we should also have the VBA forms videos and we've got all of those who have fewer than 20 of those anyway sequel queries again there are fewer than 20 though is sequel programming and there were fewer than 20 those sequel Server 2016 and reporting services so reporting services should be one that's multi-page as well so we've got more than 20 videos there so that's looking pretty good we're getting all the videos that we actually wanted whether it was a single page or a multiple page and from every single video category so what that remains really is to work out how to get this useful information into some cells in the worksheets so getting the new worksheet to be created for each video category is pretty straightforward all we need to do is make sure that we create a new sheet before we start looping over the individual pages so in the second subroutine before we start looping over the pages even maybe before we set the video pages all we need to do is create a new worksheet we need a worksheets dot add and then I might want to set range a1 on that worksheet to be equal to the video category name so I say range a1 dot value equals vid cat names that was one of the two parameters we passed in earlier on I'm just gonna for convenience set the selected cell to range a2 just to make a life easy and then we can think about how we can process up or pass out the values that were interested in inside the loop into the cells in the worksheet as well so inside this for each loop that's slipping through each of the video rows rather than debug printing things all I'm going to do is change debug debug print sorry to active cell a dot some value equals bit link dot inner text so that's going to give us the name of the video then what I'm going to do is take the rest of it down to the next line I'm going to say active cell dot offset 0 comma 1 dot value equals bit link get attribute href so we'll just it with the h4f attribute just for the time being really importantly because of the way this is working we need to make sure that we move the active cell so we can say active cell dot offset like button offset 1 comma 0 dot select in the previous video we looked at how we could use cells row numbers and column indexes to populate the cells so it's slightly different approach in this case we're physically moving down the worksheet somewhat less efficient approach this one actually it's just too Lesko for me to write at this stage so you want the more efficient technique then go back to the previous video I guess this one is just for convenience we should really do something with a video link here as well rather than using the href attribute we ought to generate the full complete URL just like we did earlier on so what I might do actually is just declare a new variable or a couple of new variables up towards the top where all of our other variables are let's say dim let's say vid HOF our string and dim vid URL our string and we'll just use exactly the same approaches we've used already a couple of times to generate a fully formed URL so if I just copy and paste from earlier on the next href and next URL lines that won't take too long to do then scroll back down into the main part so after we've set the vid link up and we've got the reference to its tag what we can say here if I paste those two lines in is we can say next sorry back one that should be vid href of course the one that we've just created so that will be vid href would be equal to vit link dog get attribute href and then video RL will be equal to while video RL of course and then mid vid height ref and in string vid H ref and then rather than putting H ref attribute into the cell we could simply say vid URL okay so that's pretty much all the code written I think at that point last thing that remains is to give it a quick test so let's scroll back up to the very top of our code to the first procedure that we wrote and then we can just execute everything from this get video page procedure I just hit the Run button we ought to see a great big long list different worksheets appearing and if we switch back into excel will hopefully find that we get all the courage valleys in there as well so a few archived old videos everything should be there the names of the video is the URLs that we've formed apart from the formatting of course they're changing column widths and maybe even applying hyperlinks to these cells rather than just having the raw text in there the important thing I'm looking for of course is making sure that the categories whether more than 20 videos more than one page that we've still got all of the videos that we'd expect to see so that's kind of important for the VBA category of course whether were 52 videos in total getting quite to be quite a big series at this point so maybe just for a tiny little bit of tidying up just a little bit of frivolity we could maybe do a little bit of formatting for the column headings and there may be also apply hyperlinks to the cells that we can actually click on them to view the videos how about that so one last thing we'll get back into in fact before we do that let's just delete all of these worksheets first so I'm just going to quick quickly do this by selecting the first sheet with data on its then hold down the shift key and the last sheet with data in it I can click on that as well and then once I've done that second right-click on any one of those and choose delete to get rid of them all at the same time and then head back into the VB editor scroll all the way down to the very very bottom just to add a tiny little bit of extra code I can close down the immediate window at this point just to give myself a little bit more space and if we find the part where we're just adding in the new worksheet so we change the value of a1 to the video category name I'm just going to do a similar thing to range b1 so let's just copy and paste a little bit and say range b1 we'll make equal to just a literal string video url okay and then we can do a tiny bit of formatting there as well I suppose we could say range a1 and b1 dots interior duck color equals something like I know let's go for RGB cornflower blue something on that and same thing again I'll just copy and paste that let's just change the font color as well if you're getting bored at this point do feel free to stop watching we're not going to do anything more exciting or interesting apart from adding the hyperlinks to the air the cells says a tiny little bit of formatting there maybe we should make the font bold as well so last little thing sorry I can never resist doing this so let's have the font in bold as well should've used a with block for that I didn't really think that I was going to do quite so much fun doubled equals true okay scrolling down a bit further than where we've actually put in the values in the cells particularly the video URL what I want to do next is before I move down to the next row assign the hyperlink to the cell based on the contents of it assigning cable links to cells is relatively straightforward we can refer to the same cell we've just added that the URL to so that's active cell door set 0 comma 1 dot hyperlinks dot ad so there's a couple of compulsory parameters the first thing is the anchors of the object the hyperlink belongs to well that's the same object that I'm handing the hyperlink to so I can just copy and paste active cell that offset 0 comma 1 followed by another comma and then we need to pass in the address so the address is the same as the value of the cell which is again the same as the video URL so I could just here reference and vid URL ok the sort of thing once I've finished looping through all of the cells for that particular table so before I go back and then start adding in a new worksheet what I can do is at the end of that loop say range a1 dot select just to put myself back up in the top left-hand corner of the sheet excuse me I'll spell select properly and then finally I'm going to say active cell dot current region so current region refers to the entire table of data that that cell belongs to and then entire column dot autofit so I should just make sure that all the columns become the correct width to display all the data stored in them so with those few changes let's have a quick look now running this subroutine kind of I scroll all the way way back up to the top and then click into the first subroutine let's just run the whole thing one more time we'll see all those worksheets appear again but this time if I switch back into Excel once it's all finished we should see something looking a little bit neater so that looks a lot better if I just as you mean so you can see that clearly so there we go um somewhat more long-winded video perhaps in yesterday and we looked in things in a little bit more detail hopefully if I were just to click on one of these hyperlinks hopefully if they actually appear there we go so we should be able to see now if I click on let's click on yesterday's video shall we scraping web pages just to make sure that's still working and so there we go um hopefully you've enjoyed that one as much as you enjoyed the previous video sorry it was so long but hopefully got some really useful ideas out of that and you can adapt to that again to your to your own needs thanks for sitting through it thanks for watching see you next time if you like what you've seen here why not head over to the wise our website where you can find those more free resources including these videos some written blogs and tutorials meeting some exercises that you can download to practice your skills thanks for watching see you next time
Info
Channel: WiseOwlTutorials
Views: 59,262
Rating: 4.9381046 out of 5
Keywords: excel vba, visual basic for applications, web scraping, web crawling, scraping html, html scrubbing, html, elements, getelementsbyclassname, getelementsbytagname, wise owl, xmlhttp
Id: sGw6r5GVA5g
Channel Id: undefined
Length: 60min 7sec (3607 seconds)
Published: Mon Nov 14 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.