Excel VBA Introduction Part 47.3 - Internet Explorer vs XML HTTP Request

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this wise owl excel vba tutorial in this video we're going to compare two techniques used to scrape websites using Internet Explorer and using XML HTTP requests we'll begin with a quick recap of controlling Internet Explorer using VBA we'll explain how to navigate to a web page check that Internet Explorer is ready having finished loading the page and then how to capture the document of the Internet Explorer object in an HTML document variable will then explain how to use the XML library and how to open an XML HTTP request we'll show you how you can send that request and then check that it's sent a valid response how you can capture the response text in an HTML document variable and then we'll show you how to pass an HTML table and write the contents out into a new excel worksheet so let's get started one of the main reasons for creating this video is to help answer a question that one of our viewers posted on a previous video in the series so in part 47.2 scraping website tables and clicking links chap called will booth I hope he doesn't mind me pointing this out asked the question about scraping a specific website all to do with betting odds for a golf tournament so the website we're gonna scrape is is this one here called odds checker and I thought making a video is a nice simple way to answer the question that we'll asked another main reason thought creating this video is to have a sort of formal reference point for comparing two main techniques for scraping websites in VBA we have covered the two techniques in previous videos in this series so part 47 point 1 and in part 48 and 49 as well cover both using Internet Explorer and also using XML HTTP requests now we've mentioned the use of the two different techniques in those videos but we haven't got a formal reference point for it so this video is going to help create that that video that I can link people back to when they ask the question in the future so to get started what I've done is created a brand new blank Excel workbook and what we can do is head straight into the Developer tab and open the visual basic editor open up a or insert a new module then we can start by creating a new subroutine called scrape odds using IES let's start with the Internet Explorer version if you've watched any of our previous videos which use this technique you'll be familiar with heading to the Tools menu and choosing references and then setting a reference to two extra object libraries first of all to control Internet Explorer we can insert a reference to the Microsoft Internet controls object library so you can find that in the list and then check a box next to that set a reference to that library the other library I want to set a reference to is the Microsoft HTML object library so this is the one that will allow us to loop through the various HTML elements and find the bits of the page that we want to write out into an Excel worksheet having checked those two boxes we can click OK then we can declare a couple of extra variables so let's start by declaring variable to hold a reference to an Internet Explorer object so dim ie as I'll make this an auto instancing variable so that I don't have to determine when I need to create a new instance of Internet Explorer I can get my code to do that for me so dim ie as new SH doc view dot Internet Explorer then I can declare a new variable which will hold a reference to an HTML document so I'm going to call this HTML doc as MS HTML dot HTML document just so that I can see what's happening when I run my code I'm going to make the Internet Explorer application invisible so I can say I e dot visible equals true and then I'd like it to navigate to the page that I pointed out earlier on so I'm going to say I e dot navigate and then I'm going to pass in the URL by heading back to my previous browser which is Chrome as it turns out and then I can just copy the URL from the page I showed you and then head back to my VB code and then just paste that in inside a set of double quotes next I'd like to capture a reference to the document of the internet explorer and store that in my HTML dock area below so to do that I want to make sure that the Internet Explorer application has finished loading the page first and there's a few different ways we can ensure that but let's go with a do-while loop I will say do-while ie dot already state is not equal to a ready state complete or ie dot busy it's a bit sort of belt and braces but there seems to be a fairly reliable technique and then we'll just continue looping around that until basically Internet Explorer has finished loading the page when we've done that we can say set HTML doc equals ie dot document now I need to find out how I could potentially reference the items I want to write out into a worksheet so let's return to the web browser I'm looking at this page and it doesn't really matter if you're not using Chrome you can find an item within the page that you're interested in so if I find one of these cells in the table that I want to capture you can right click on that item and choose an option similar to inspect it might say inspect or inspect element or something along those lines if I do this in Chrome I get a new panel opening on the right-hand side of the screen which immediately highlights the item that I've just right clicked on and inspected so you can see hopefully if i zoom in that this is a TD tag a table cell it has a specific class and a few other properties there that TD tag is part of a TR tag so nested inside a TR a table Row one of the nice things about inspecting in chrome is that when you hover the mouse over an individual item it actually highlights that in the left-hand part of the screen as well so that table row belongs to a table body which is one of two sections in the table as also a table head you can see if I flick between the two it highlights those in the left-hand window and those table sections are parts of a table with a class called event table if I just very quickly go up a little further the table itself is part of or nested within a div tag with a unique ID and this is something that I've been looking for and unique IDs are great for referencing items in a webpage so I know there's a div tag called odds table container I can reference out by its ID I know that there's a table nested inside there some of its children as a table with a class called event table so knowing that I can work out now how to get references to those two things let's head back to the visual basic editor and declare a couple of extra variables first of all dim HTML div as an MS HTML dot I HTML element so that can hold a reference to one single HTML element I'll then say dim HTML table as Ms HTML HTML element as well to capture a reference to the div tag the one with the unique ID I can say set HTML div equals HTML doc and then I can use a method of that called get element by ID so if I open and close open a set of round brackets and double quotes the name of the item or the ID of the item was called odds table container and that will get me a reference to the div which contains the table that I'm interested in now the table that I'm interested in doesn't have a unique ID it does have a class which is called event table the problem with that is that there might be lots of elements in the same div which have the same class there could be multiple event tables within that div the table also obviously has a way to identify it's using this tag name which is just a table but again there might be lots of tables in the same McDivitt tag so if I'm using the class name or the tag name there isn't a simple way to get a reference to a single element there are methods I can use to get a reference to all of the elements with the same class name or the same tag name if I can just briefly show you with a quick bit of sample code if I say HTML doc dot and then get elements by class name or get elements by tag name I could use those techniques I can also apply those techniques to the HTML div as well rather than to the entire document I can narrow this down to just the div tag that I've gotten a reference to the annoying thing is that these methods don't appear in the intellisense unfortunately so let's say for example I wanted to get a reference to everything bytes happening just use the name table I can say set HTML table equals HTML div dot then get elements and again you can see here that I don't suddenly get the addy until he sends help but get elements by tag name they need some round brackets and some double quotes I can get I can enter the name table and then close double quotes and close around brackets you may be able to spot the obvious problem with this the name of the method is get elements by tag name and I have a variable which can only hold a reference to a single HTML element if I wanted this line to succeed what I would need to do is change the type of this variable to an eye HTML element collection which returns multiple elements of as the name suggests however there's a quick way we can shortcut this rather than returning all of the possible tables in the div tag and then looping over all of the tables to find one we're interested in what we can do is treat the results of this this method as a collection or an array if you prefer and tell it to return a reference to the first item in the collection though it returns in a similar way to referencing worksheets or workbooks or other objects in VBA by their index number so for example I could say worksheets and then rather than saying for example sheet 1 using his name I could say worksheets 1 to return a reference to the first item in the collection now in the the HTML library of the elements or the collections I should say are index from 0 rather than 1 but if I simply open this out of parentheses after the get elements by tag name and then enter the index number of 0 closer parentheses that will get a reference to the first table in the collection of elements returned with the table tag at this stage all I'd like to try to do is write out a bit of sample information about the table just to see that this is actually working so let's use a debug print statement and I'm going to refer to HTML table dot class name I'll then make sure I can see the immediate window by heading to the View menu and choosing immediate window or pressing ctrl + G and then if I were to run the subroutine bear in mind this may take a little while it's going to launch Internet Explorer it's going to try to browser that page and wait until it's already that might involve loading some annoying adverts and pop-ups but eventually I get to the point where I've extracted or captured a reference to my event table and I can see its class name written out into the immediate window if I just close the Internet Explorer window here and then just switch back to Chrome just so that we can see that I have indeed captured a reference to the table that I'm interested in its class is event table so that's one technique we can use to connect to a web page and get information from it but it's not necessarily the most efficient it involves waiting for Internet Explorer to load and then connect to the web page and then present it with for any pop-ups and adverts etc to appear before we can actually capture the information we want a potentially way more efficient technique to use is to connect to that web page without using a web browser at all using something called an XML HTTP request before we do this if you're interested in the documentation for this and there's quite a lot of it I should warn you but this is the page you're probably interested in starting at the MS XML library so I'll just zoom in so you can see the URL there so if you want to do out to look this up yourself afterwards this is the place to go now there's an awful lot of information in here I'm certainly not intending to cover absolutely all of it what we are going to do what the bits we are interested in are the XML HTTP request object so it's a bit of basic documentation about the the object itself you can see a little bit remembers its methods and properties and events so what we're going to do is we're gonna create a new XML TTB requests we're going to use the open method which will allow us to configure the URL we're requesting information from and tell it that we're using a get request and then we're going to send that request once we've opened it and wait for a response and then we're going to use the response text property and capture all that response text in a new HTML document object so that will effectively at that point give us access to the content of the web page in the same way as using Internet Explorer just a lot quicker and there's a few other specific pages as well you can see the actual open method help page and the send method help page and the response text property etc so all that's available to you if you need to look that up or you feel like having a bit more of a read and the slight downside to this is that none of the examples are in VB it's all in either JavaScript or air or C C++ syntax you have to do a little bit of working out how to get this to work properly for Visual Basic anyway let's head back to the visual basic editor first of all and then solve with I'm just going to copy and paste the entire subroutine we've written so far the basic structure is still the same and it'll be nice to be able to see what things we need to change if we just copy and paste the entire subroutine and then modify the name of it so it says scrape odds using XML HTTP next we'll need to set a reference to a new object library so let's head to the Tools menu and choose references and then in the list we can find Microsoft XML so let's scroll all the way down to the Microsoft section and then a little bit further and there we go Microsoft XML now you may well find several different versions available and I'm gonna go with the latest version available the version 6.0 and I'd recommend doing the same thing depending on which version of Windows you're running you may see some slight differences here but I'm gonna with the latest version available to me having done that I can click OK and then in my new subroutine I'm going to replace first of all my reference to an Internet Explorer variable with a new one I'm going to call it XML request just for short and then I'm gonna make that or declare that as a new msxml or in my case XML - dot XML HTTP 60 6.0 sus there the version number that I've just just referenced if you've referenced a different version number of the library than I have then you may see something slightly different but in my case this is clearly the one to go for as we're not going to be using Internet Explorer we can happily delete the line that tries to make it visible and in fact we should do that otherwise the the code won't work we don't have an internet explorer variable anymore let's reuse the navigate line though let's not just delete that entirely we'll need the URL of course but instead of using the navigate method of Internet Explorer we're going to use the XML request dot open method so there are several parameters of the open method which you can read all about this I've just pointed out in the help pages the first parameter is what method you're using so what we're going to try to do here is try to get some information so we need to say - what method the request will use we're going to rightly get we then need to put in the URL which is what we already have in there so that's what we're going to do here we're going to use the the same URL as we used in Internet Explorer and then we're going to use the third parameter called VAR async and this is to make sure that we are the we wait until they are the center calls completed so that we've got the fully formed web page returned to us rather than just returning immediately since here the default is true which means it returns immediately potentially before we've returned any useful information so let's head back to the VB editor and to specify all these things we will open the request using a get method we will pass in this completed fully formed URL and then we'll set the third parameter the VAR async to false to make sure it waits until the add the full page has been formed next we'll need to send the request we've opened and then we should check that the the request has sent a valid response so that it hasn't returned some kind of error let's get rid of the do-while loop we don't need to wait now using a do-while loop for Internet Explorer to be ready we can then say XML requests not send and then we can check the status of that request there is a statist property if I just switch again quickly back to the documentation the status property shows you what values can be returned by an HTTP request so what we're going to look for is number 200 which indicates that everything is ok there are many many other possible return values here for the status things you'll recognize you probably recognize 404 errors when you've not found the URL you're looking for and there are many others as well so you can use this table here to determine what to do in the event of some kind of failure what we'll do is we'll just check for the status of 200 and as long as everything's ok we'll proceed if anything goes wrong we'll just show a basic message showing what has gone wrong using the built-in default status text property so to make that work we can simply say if XML request and status is not equal to 200 then we'll display a basic message box which says XML requests lots they just so we'll get the number and we'll concatenate that with the well a separator character just a hyphen and the XML request dot a status text or we're over there say just text there it is we'll then make sure that we exit from the subroutine and then we can say and if and at that point we'll have the the request sent and opened and hopefully succeeded so all we need to do now is return the response text of our request into a new HTML document so let's just make a quick modification to the HTML doc there evil lets me get an auto instancing variable by declaring it as new and then rather than setting HTML doc equal to the document property of Internet Explorer we're going to say HTML doc dot body . inner HTML equals xml request dot response text from that point on everything else should be identical to what we've done with internet explorer previously so to test this let's display the immediate window again i can go to the View menu and choosing media window or as always press controlling G let's just clear the contents of the immediate window and then if we run our scrape odds using XML HTTP procedure we should see the same end result but this time an awful lot quicker without the overhead of having to open up and control Internet Explorer okay so that should give you an idea of the key differences in terms of the code you have to write to switch between using Internet Explorer or XML HTTP requests you'll certainly see much better performance using the second technique although it's a little more tricky to work with us you don't see anything visible on screen it's a little more abstract but I agree but I think the performance.get benefits are are worth it anyway all that remains I suppose is to help will out and show him how to write out the contents of that web page in that table that we've got a reference to into a new worksheet so I guess at this point unless you're really desperately interested in betting odds for golf tournaments you might want to just stop watching all we're going to do now is write a new procedure which loops through the various sections and rows and cells of this table and writes them out into a new worksheet so let's head back to the visual basic editor and I'll close down the immediate window at this point and then let's create a new subroutine which we can call from either of our other two procedures again just to demonstrate that they four at this point everything else is identical it's just how you get access to the web page itself so let's have a new subroutine called right table two worksheet and I'll have a parameter for this method as well let's have a parameter called the table to process and I'll allow us to pass in a reference to an MS HTML dot I HTML element so that'll be the table or the reference to the table that we captured earlier so in order to call this one just to demonstrate this from our scrape odds using XML HTTP we will say write table to worksheet and then we'll pass in our HTML table to our table to process and parameter just to make our life easier and the code a little more readable let's declare a few variables let's have one called table section as Ms HTML not I HTML element and then rather than write that's out many times let's just copy and paste that a couple of times and we've got table section let's have a variable called table row and another one called table cells so all of those will be I HTML element objects I'd also like to keep track of the row number and column number we're on to make it easier to write the information into the worksheet so let's say dim row num probably eventually dinner onnum as long comma column as long finally just one extra variable for convenience let's say dim output sheet as worksheet again I'll spell that correctly eventually there we go we'll use the final variable we declared to create a new worksheet in the workbook and then capture a reference to it so we'll say set pallet sheet equals this workbook dot worksheets dot add what we're then going to do is start looping through the various elements of the table so I'm going to begin by saying for each table section so there are three sections to a table there's a header and a body and a footer so I'm going to say for each table section in table to process dot children so the children are the direct descendants they're the items the elements that are nested directly within the table tag so again if I just switch back quickly to the the Memorial Tournament we can see that within the table and the next level down are the sections at the table so table then T head T body T foot touching nothing in the footer of this table but this will effectively loop three separate times so we'll say for each table section in Table two processed our children and then next table section within there we want to loop through the individual rows in each section so we can now say for each table Rome and again I'll spelled correctly eventually table Rome for each table row in table section dot children so again the direct descendants of a table section are the table rows say next table row again just to quickly switch back to the web page and show you the within a section let's go for the let's go for the body section just for demonstration purposes and then within there we have the table rows so TR tags TR TR TR now within each table row we have a number of columns or TD tags so that children of a row are the table cells so you can probably guess where we're going with this we can say for each table cell in table row dot children and just close that loop off next table cell and there's our basic loop structure for processing each individual cell in the entire table now each time I go to a new row and through a new cell I want to increment my row numb and column them variables appropriately and they'll be initialized to 0 by default if it makes you feel better you can explicitly say Romanum equals 0 and column equals 0 or that will be the default anyway it's a good practice to be explicit about things I suppose in programming then for each new row that we visit I'd like to simply increment the row number variable so let's say within the for each table row loop will say Romanum equals row num plus 1 will do the same thing for the table cell loop for the column number variable so inside the for each table cell loop we can say call num equals call num plus 1 the important thing about the looping through the table cells leaping through the columns is that when we begin a new row we want to reset call num back to 0 so after we finished looping through the cells in a single row we can say : num equals 0 before we move on to the next row in the table at this point all we need to do is write out the appropriate property of each table cell into the cell in the worksheet indicated by the row number and the column number so to make that work let's refer to the output sheet variable and then the cells property and the cells property allows you to reference a range object by its row number and column number so we have nicely conveniently named variables row 'm and call num and we can make the value of that cell equal to the inner text property of the table cell element is a table cell dot in a text at that point all we need to do is give this a quick test so let's just quickly scroll back up to the scrape odds using XML HTTP bear in mind there be absolutely no difference other than performance there's no difference whether you do this with the xml httprequest or whether you just want to copy and paste this into the Internet Explorer version as well just one last thing you might want to do if you're using the Internet Explorer version is to say ie Quint to make sure that the application closes down at the end anyway I'm going to go to the scrape odds using XML HTTP and I'm simply going to run that subroutine and we should be able to see that we end up with a new worksheet although is labeled as a workbook there we go I've got my new worksheet and if I quickly click back click back and I've got all the contents from that table all listed out not particularly beautifully formatted I will freely admit there's quite a lot of text in there maybe a little bit more than than expected but we can clearly see that we're getting details from each row of the table all the odds from the various different bookmakers for all the players in that table if you wanted to compare performance or just prove that the same result will occur with their internet explorer feel free to head back to the VB editor and then run the scrape odds using ie routine this will take a little longer to run because we have to fire up internet explorer and wait for it to load wait for the adverse and the pop ups and so on and there it goes eventually nearly still running still running still running it will get there eventually you can clearly see that the benefits of performance using xml httprequest just from this simple demonstration there we go finally Internet Explorer has finished and there we go we have the same output using Internet Explorer they just took an awful lot longer to get there the form I think is ever so slightly different here as well I got some some rap text options for some of the cells that I didn't have with the XML HTTP requests but nonetheless we've got the same actual data just took an awful lot longer to get of course one bit of information that is missing from this output whether you're using XML or Internet Explorer is the name of the bookmaker providing these odds if I look back at the actual tournament table in the header section I can see I've got these these lovely little logos at the top in the table header section but if I hover the mouse over the logos although I can't really get the logos themselves or at least I can't think of a quick simple way to do that I could get what is shown if I position the mouse over the cells this little pop-up text which shows me the names of the bookmakers so if I find out where that row is I should be able to see in the table header section that we have a row here there it is the event table header row so as is a specific class name so if I know that that's that table row has got a class name event table header what I can attempt to do for the cells in that row I expand a single one of them this is the bet365 but the bookmakers are available I've got a little a site tag I think it's a new html5 tag I'm not that familiar with the sides but I am interested in here is the a tag stick inside the female there whole to hyperlink basically so we should be able to see in here there's a title property which contains the name of the bookmaker nuts in the little pop-up text so I could check that the row I am on in the table header section is has a class name of event table header then I could capture the first a tag from that table cell and then return the title property to the cell itself so bearing all that in mind let's head back to the visual basic editor and let's make a few changes to our write table to worksheet method let's declare an extra variable then that will be another eye HTML element and we'll use this to capture a reference to alia the bookie link so that's a dim bookie link Ms HTML dot I HTML element HTML element their yoga is spelled correctly eventually now what I would like to do for each individual cell is first of all check which whether the the table row I'm on has a class name of event table header so after incrementing my column number variable I would like to check if table row docked class name equals event table header I remember comparing strings in VBA is case sensitive so it's important I spell that were the same case as is listed and in the web page there are functions I could use to convert the case of the class so I've got L case and view case and then I can make that essentially case insensitive but for now I'll assume that I'm spelling it in the correct way so the table wrote class name is event a table header then what I'd like to try to do then is capture a reference to the first a tag within that table cell so in a similar way to the way we captured the first table within the elements returned by the get elements by tag name property earlier we can say set booky link equals and then we can refer to the table cell variable and then refer to the get elements by tag name again we'll have to spell that out unfortunately the intellisense doesn't work and then in some round brackets and double quotes make sure you spell this one correctly the tag name is a that's a by the way and then you can open up some round brackets and then enter a value of 0 to return the first a tag within that table cell there should only be one so we can we can guarantee that we can refer to that with the index number 0 now it may be the case not not every single cell in that event table header row has an a tag associated with it so what we should then do is check if not sorry if not bookie link is nothing then what we'd like to do is write out the contents of the title property of the bookie link into the appropriate cell so rather than writing out the ad the table cell in a text what we'd like to do instead is say output sheet of cells romanum column value equals bookie link dot title we can then end that if statement we can then stick the other code that writes out the the inner text in the else clause of the main if statement and then in that if statement as well finally what we should do is make sure that we set bookie link equal to nothing before we move on to the next cell in that row so assuming that it isn't that we've earned that we've captured a reference to the bookie link we want to make sure that next time we go round that variable is empty so in this same if statement here we will say set bookie link equals nothing okay so having done all that let's just give the whole thing another quick simple test if I were to run my scrape odds using XML HTTP routine again just for convenience and speed shouldn't take too long and we'll get there eventually a little bit more work going on because we've got all these if statements going on but if I switch back to excel now we should find you I just changed the height of some of these rows here we now have in row number four of the table we've got the names of the bookmakers offering these odds okay so of course there's plenty more scope for formatting the output of this we're getting rid of rows we don't need and changing column widths and changing colors and fonts colors etc just to make it look readable but I think at this point those things are fairly trivial and it's not worthwhile spending any more time in the video showing you how to color things in we've done that in plenty of previous videos and it's relatively straightforward to work out anyway and the key points here in this video of course were working out how to use internet explorer versus XML HTTP requests and seeing the performance advantage of doing that and then of course for will himself how do you get the data from that particular table written out into an Excel worksheet so hopefully that's made him happy thanks for sitting through this one you've made it to the end where well done hopefully you'll get some I'll find some uses for some of the techniques we've seen there and it gives us another nice example of scraping data from websites using VBA so thanks watching see you next time
Info
Channel: WiseOwlTutorials
Views: 15,008
Rating: 4.9800997 out of 5
Keywords: vba, excel, internet explorer, http, xml, request, get, website, scrape, tutorial, course, microsoft, wise owl
Id: R0xpDLzVcuw
Channel Id: undefined
Length: 36min 29sec (2189 seconds)
Published: Fri May 31 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.