Excel VBA Introduction Part 57.2 - Basic Web Scraping with Selenium and Google Chrome

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this ysl excel vba tutorial in this video we'll explain some basic web scraping techniques using selenium and google chrome so we'll start with a quick example of opening up google chrome and navigating to a website then we'll explain the various ways you can reference and interact with elements on a web page and that includes typing text into an input box and clicking buttons to navigate to different pages we'll have a quick section on how you check if elements exist and later on in the video we'll look at looping through collections of elements doing something to each one in turn and we'll use that technique to write tables from a website into an excel worksheet towards the end of the video we'll also explain how you can extract hyperlinks from a web table and then assign those hyperlinks to cells in the excel worksheet so quite a lot to do in this one but it's a bit of fun so let's get started here's a basic example of the type of thing we'll aim to create in this video it's basically a search system which allows you to hunt for videos on the wisel website so you can type in a search phrase into cell c2 so i've typed in what i think is an appropriate one for this video scraping websites and then when you click the search for videos button it fires up an instance of chrome to use the search feature on the ysl website to return the list of videos matching your search phrase so it returns it to a table and it includes the hyperlink so you can just directly visit those videos by clicking the appropriate link the way it works is a little too quick to describe as it's actually happening so what basically happens is very much like manually going through this process but doing it much more quickly so it navigates to the ysl homepage in an instance of google chrome it types something into the search box here so i'll type in the same thing scraping websites and then it clicks the go button to return a set of search results we might find stuff to do with scraping websites in various different categories so it will loop over the range of categories that it finds and if there is one called video tutorials it will click on that one to expand that section and locate the table in there and then it loops through all the rows and columns on that table or in that table i should say to return all the various values including the hyperlinks as you've just seen it's not the most sophisticated example you'll ever use selenium to create but it's a nice one to demonstrate all the fundamental techniques that you will pretty much always use so if you haven't used selenium before then this will give you the the basics for doing pretty much anything else you can think of if you do want to follow along of course then you will need to make sure you have selenium basic installed and for your convenience i've created a video which explains exactly how to get it set up um it's one of my shorter videos you'll be pleased to hear so it's uh it's only 12 minutes long and probably five minutes with that is just it's just me uh rambling so um yeah the the good stuff in there you can find fairly easily using the the chapters to get the the software installed there is a bunch of links i only posted that today look at how many views guys thank you so much for this i always appreciate all the views and the comments and the likes even if i don't get a chance to reply to all of them it's always much much appreciated um so there's a bunch of links in that video's description so that you can get all the relevant software you'll need um i am going to be using google chrome for this example but it doesn't have to be chrome the beauty of using selenium is it genuinely doesn't care which web browser you're using um it doesn't judge you if you want to use microsoft edge instead of chrome so feel free to use whichever browser you like assuming you have got the um selenium basic package installed and you've got your driver installed for the browser you're going to work with we'll open up a brand new workbook in excel and then we'll get started with adding the reference to selenium so i've closed down the example i had open and in the new blank workbook i've just created i'm going to head to the developer tab and then choose to open the visual basic editor from here the first thing i need to do is set a reference to the selenium library so i need to head to the tools menu and choose references and then in the list which appears i can type in the letter s to jump down to the section beginning with the letter s scroll down a bit further and find the selenium type library and then check the box next to that then i can click ok and the next thing i'll do is insert a module so i can right click into the project explorer choose inser module and then i'll set up my basic subroutine which will be called what should it be called let's call it scrape wise owl videos there we go the next thing we need to do is create a variable which can hold a reference to an instance of the web browser we'd like to use for my example i'm going to use google chrome so i'm going to declare a variable called ch as and then i'm going to look in the selenium type library and look for a class called chrome driver of course if you were using a different browser then you might want to use an edge driver or a firefox driver or whatever your preference is providing you have the driver for that browser installed but as i say i'm going to use chrome so chrome driver and then i'm going to create a new instance of that driver so i'm going to say set ch equals new selenium dot chrome driver and then i'd just like to to start the application so i can see it running so i'm going to say um ch dot start having done that i'm going to run a quick little test to see what happens so i'm going to run this subroutine and i'll see chrome does indeed pop up um but then it also instantly disappears which is a little bit frustrating particularly as i would like to be able to show you things that happen um in the web browser as we go along through the video the problem is that when the variable goes out of scope at the end of the subroutine it automatically destroys any instance of that browser a simple way to get around that would be to declare the variable at the top of the module so i can remove my declaration of my ch variable so that it sits at the top of the module below option explicit but before any subroutines and then if i just do the exact same things i've done previously i can run the subregime and i'll see chrome appears but it doesn't immediately close itself down when the subroutine ends it's conventional when you declare variables at the module level rather than within a subroutine to use the words private or public so i'm just going to modify that to say private ch as dot chrome driver and then if i run the submarine again just to prove that it doesn't make any difference we'll see chrome pops up and then i can manually close it down myself the next thing i need to do is point the browser to a particular website so i can do that in a couple of different ways but a fairly common and simple way to do it is to use the get method of the driver object so i can say ch dot get followed by a space and then enter the url of the site i want to navigate to it's the only compulsory parameter so i can enter in some double quotes https colon forward slash forward slash www.wisel.co.uk i should know that address fairly well by now so there we go having done that i can run that subroutine again and we'll find that chrome opens up and when it has done so it will then navigate to the url you've requested so that's nice and easy and the technique works perfectly well and if that was the only url of the wisel website you were going to enter into your code then that's almost certainly the technique you would use however if you were writing a longer procedure where you wanted to explicitly navigate to other pages on the ysl website so for example blogs or exercises or videos or whatever that happens to be then you're going to find yourself writing out the same base url multiple times so if this was what you were trying to do there's an alternative technique you can use to save yourself a bit of maintenance let's close down that instance of chrome and then i'm going to go to the end of the start method and type in a space to see my two optional parameters browser and base url it's the base url that i'm interested in so i can get to that parameter in a couple of ways i could just type in a comma alternatively and i think it's better although a bit more effort type in the name of the parameter followed by a colon and an equals what i can then do is copy this url to the base url and then when i get a page i don't need to specify the entire url the page i want to get i can just tag on everything from the end of this base url so for example if i wanted to get to the home page i can just type in a forward slash and then if i run the subroutine again i'll end up eventually there we go on the yzl homepage and but if i wanted to get to the videos page then i can close that instance down and then i can say slash videos and then when i run the subroutine again this time i'll end up again eventually there we go on the videos page instead so um you again this this is useful if we wanted to navigate to lots of different pages on the wizal website but for this example we just want to navigate to the home page so i'm going to take it back to get forward slash and we'll proceed from there now that we've landed on the correct page we want to interact with some of the elements on it and in order to do that we need to know how to identify those elements one way to identify elements is to inspect them in a web browser so i'm going to head over to google chrome again it doesn't have to be the same browser by the way and then i'm going to head over to the wisell home page and i want to first of all interact with this input box to type in my search phrase i can right click on that search box and then choose to inspect it the name of this right-click menu option may be slightly different in different browsers i think it's inspect element in internet explorer it's inspect in the new version of microsoft edge but it should be fairly obvious from the name of the option uh which one to choose i'm going to click inspect and that's going to open up an extra panel on the right hand side of the screen which identifies the thing i've just right clicked on it's highlighted here in pale blue in fact let me just zoom in a little bit so you can see that a little more clearly so we've got several options for identifying this specific object i could use the name of its tag so the very first thing inside the angle brackets there is the name of the tag so it identifies the the type of object we're looking at so this is an input object it's nested inside a form tag which is nested inside a div tag etc now you can probably already tell that tag might not be a good way to identify a single specific object because there may be multiple items with the same tag you can see there are lots of div tags on this page an alternative way would be to use the class so i could enter the class name it's fairly unlikely there's going to be another element with the same class called search on this page but you can't guarantee that id would be the best one to go for if you could guarantee there were unique ids for the objects but this particular element doesn't have an id it does however have a name attribute so what we could do for this particular example is use the name attribute it's called what and then try to return a reference to that object using the word what so let's just zoom back out again and head back to the vb editor in order to do this we need to use one of the find methods of the browser so i can say ch dot and then i want to find an element and you can see that when i start typing in the word find i get a list of lots of methods for identifying elements by various different attributes so the one i want to go for would be the find element by name so i'm going to enter that then i'm going to open up a set of round brackets there's a single compulsory parameter in there which is called name well i already know what its name is it's called what so i'll enter the name watt as a string and then close the double quotes and close the round brackets now assuming that element has been found i want to do something to it just as a very simple example i'm going to send some key strokes to that element so there's a method you can apply to an element called send keys which does very much like what it says it does it send keys and then i can type in a space and then i can enter the um the keys that i want to pass in so just for a simple example in some double quotes i'll say excel vba close the double quotes and that's my code all right so let's run that subroutine and we'll find a new instance of chrome gets opened up and it should now type in excel vba into the element we've identified the next thing i'd like to do is simulate clicking on the go button to perform the search so let's just close down this instance of chrome and then i'll head back to the one that i already have open and i'm already inspecting objects or elements and you can probably tell already immediately below the input tag there's a button tag and if i highlight that with the mouse i don't need to click on it just move the mouse cursor over it you can see it highlights the corresponding item on the page if i wanted to be absolutely certain of course i could right click on that element and choose to inspect it and that that would highlight it in the inspect panel on the right hand side just to zoom in there you can see that this button doesn't actually have a name so we need a different approach to finding this element we could find it by tag name of course or by tag so we could find the button but again it's kind of likely there will be more than one button tag on the same page we have a class attribute which looks like it's something fairly specific again i doubt there's more than one element with the same class on this page so let's give this one a while i'm going to try to search or find an element by class search submit i can double click search submit in this elements panel and then just press ctrl and c to copy that to the clipboard what i can then do is head back to my visual basic editor and then i can type in ch dot find element by class and then open some round brackets and some double quotes enter the name that i've just copied so i can press ctrl and v close the double quotes close the round brackets and then i want to apply a different method to it i don't want to send keys to that object i want to click it so i can type in a full stop and then look for the click method and it genuinely is that simple very much simulating what a human being would do to interact with this webpage so let's give that a quick try then if i run this subroutine at that stage and i'll see it should fire up i might briefly see the home page flash up there it is and then it immediately lands me on the search results page so everything's working quite nicely so far it's all fairly simple isn't it and before we get carried away too much it's worthwhile considering what happens when things go wrong so let's say for example somebody changed the design of this website and perhaps changed the classes used or perhaps changed the names of the objects we're looking for we can't we have no control over that it could happen but we need to write our code in such a way that it doesn't just fall flat on its face when things go wrong let's just head back to the vb editor let's pretend that our we'd um looked for our search box by by a different name let's say we search for it by search uh search input or something along those lines and then after we'd use this code for a while somebody changed the design of the website and renamed that search input box as what so while our code would have worked previously if we attempted to run it at this point it now no longer does we'll see the web page fires up of course but nothing gets typed into that search box and if i close down that instance of chrome i'll find a runtime error sitting behind the scenes in the vb code that's not particularly helpful if only there was a convenient way to check if an element is present on the page um fortunately in selenium there is an incredibly easy way to do this um immediately before we try to send keys to that search input element i'll give myself a couple of blank lines and say if ch dot is element present who could have thought of a better name for a method or a function which returns whether or not something exists um there is one small problem with this a slight complication i should say rather than a problem in order to find the element i'm interested in i have to use something called a by object it's not a particularly meaningful name by but if you think about the methods we've used here so far find element by name find element by class that's kind of what the buy object is is dealing with so just to show you how one of those works i'm going to head up to the top of the subroutine just ignoring that my code has a syntax error at the moment and i'm going to declare a new object or a new variable called find by as selenium dot by i then need to create a new instance of that and i can do it in a couple of ways one would be to do what i've done with the chrome driver i could explicitly set find by to be equal to if i can type it properly i could equal to a new selenium dot by that would work perfectly well alternatively i could create an auto instancing by object or by variable so i could say dim find by as new selenium.buy so that would create a new instance of the of the buy class whenever i needed it to make use of that what i then need to do is head back to the line which is currently indicated with a syntax error and inside the is element present method i want to say find by followed by a full stop most of the methods in here relate to methods you can use in the find element methods of find element by name there's a name option find element by class there's a class option there's an id option there's a css there's a tag option etc the one i'm going to use here is the name method so find by dot name and then in some round brackets i can pass in the name of the thing that i'm trying to look for so let's pretend i was going to look for the search input i'll just copy and paste that rather than risk mistyping it and then i can close a set of round brackets for the name method a set of round brackets for the is element present method and then type in then a couple of blank lines and the end if now the is element present method returns a boolean value so that's either true or false and currently the way i've written my if statement is going to do something in this section of code here if the element is present so if this returns true i want to flip that around to check if the element is not present i can do that in a couple of ways i could check if the result of that expression equals false alternatively i could just say at the beginning here i could say if not ch dot is element present either will be either will work i both are perfectly acceptable choose the one you like the best when i've established that the element does not exist what i'm then going to do is first of all quit from my web browser i'm going to say ch.quit in reality you might like to leave it visible on screen so you can look at the state of the web page at that point in time and check what is going on but just for convenience here i'd like to show you how to quit the browser and then i'll show a simple message box which says could not find uh search input box or dutch input box as i've spelt it there let me just correct that and then i'll show a vb exclamation symbol on the message box as well and then finally i will exit from the subroutine okay so i know that this object doesn't exist so i know that this this if statement is going to be triggered so let's run that subroutine and we should see that when the web the front page loads it immediately closes down straight away there is no delay there almost at all from the home page loading to it telling me that the object doesn't exist i can click ok at that point and then this subroutine will end now you'll often encounter websites whose content is created dynamically often using javascript behind the scenes so when you first load the web page all the elements you can interact with might not actually be present at that time it could take some time for those elements to be generated so one of the nice things you can do with the is element present method is put in a delay give it chance to actually create the object you're looking for before it immediately says no it doesn't exist um i'm going away and quits the application so in the is element present method just just before the final closed parenthesis there i can type in a comma and you'll see there's an optional parameter there called timeout as long now the value you enter here is the number in milliseconds that you want to give the method chance to load the element on the page so let's say we wanted to delay this for three seconds if i enter a value of three thousand if i run the same subroutine again now you'll find that rather than immediately closing down when the home page loads it delays for three seconds giving the webpage chance to actually generate the elements before it then finally decides yep it's definitely not there so i'll close down and give you that message okay so now that we know we won't generate a runtime error if that element isn't there we can change our name that we're searching for back to the original what we know that does actually exist so let's change that in two places in the find by dot name method and in the ch dot find element by name method we should then do something similar to establish whether or not the search submit button exists this is one of the nice things about using a generic find by object if i just copy all of this code here and place it immediately above the line which clicks my search submit button so i've just copied the entire if statement and i'm just going to paste that immediately above the find element by class in the find by object rather than using the name method i can use the class method and then i could copy and paste the search submit button or class name i should say into that class method there's still a timeout property there for the is element present method so i've still got the three second delay i should maybe change my message box so i could say uh could not find a submit button or something like that there we go submit button and then we will avoid seeing a runtime error if the submit button is not there so having added all that extra code it does make it a lot longer of course but it's worthwhile doing this sort of stuff let's run that subroutine one more time and we should see that everything goes back to normal it finds the search box and the search button and ends up on the search results page okay so we're making progress what we need to do now is deal with the search results let me just close down this in instance of chrome and then head back to the one i already have open i'm just going to type in the same search phrase we typed in programmatically excel vba click go and then land on the results page so it's the video tutorials results section that i'm particularly interested in i'm going to right click on video tutorials and choose inspect and if i look over now in the elements list i'd normally get quite excited about this and i get excited about the saddest of things but anyway the the id usually an id is a fantastic thing to find because it's a unique way to reference a particular object or element on a page so if we knew their unique id of the video section accordion part 4 then we could ordinarily reference it however what if we search for a phrase where we returned different results sections different numbers of results sections and the accordion part for the videos wasn't part four let me just go back and search for something different let me let me search in um search for scraping websites we returned fewer results for that so scraping websites i do have some video tutorials but if i right click on this one and choose inspect this one sadly isn't called accordion part four it's called accordion part one so sadly we can't use this what would normally be a great thing the unique id we need to think of something different what we're going to do is capture a reference to all of the elements that have the class w-o form accordion part so we're going to gather all of those up into a single um single variable and then loop through the collection of accordion parts to find the one we're interested in let's deal with getting the entire set of accordion parts first i'm going to double click on one of those just to highlight its name and then copy that to the clipboard and then i'm going to head back to my vd editor and i'm going to declare a variable at the top of this page or at the top of this subroutine that can hold multiple elements so i'm going to say dim i'll call it result sections as selenium dot web elements you'll see there's a web element and a web element you can probably tell from the names that one can hold an instance a single instance of an element and the other can hold a collection of elements so in this example we're going to use web elements while we're here we're also going to declare a variable which will hold a single web element so i'm going to say dim result section as selenium dot web element then we can head down below just below where we've clicked our search submit button and we're going to say set result sections equals ch dot and then we're going to try to find elements by class so we've already used find element singular now we can use find elements by class if i scroll down far enough there it is and the only difference in between this method and the one we've already used is that this one can return a collection of objects so find elements by class open some round brackets i'm going to paste in the class name i've just copied from my instance of chrome so i can open and close some double quotes and then paste in that name that i copied and then close the round brackets and that will return all of the results sections to that variable okay now that i've captured the collection of results sections i'd like to loop over that collection to print out some useful information about each one to do that we can use we've got a few choices here for the various loops in vba i'm going to use a for each loop to do this so i'm going to say for each result section in result sections so fairly standard code to loop through a collection of objects a couple of black lines and then say next result section and then inside that for each statement i'm going to debug.print the result section dot text so the text method there returns the the textual representation of that element on the page now then if i were to run this subroutine we'll see that chrome opens up again of course we're searching for excel vba so we'll get quite a few sections for this one there we go and you'll see that the list of sections we've got all accordion parts as the class is named training pages eight publications one miscellaneous pages one etc and we can see those items reflected in the uh immediate window if you don't have that displayed it's the view menu immediate window option that you'll need and we've also got a question mark below each one as well so there's obviously another element in there which doesn't have a sensible textual representation and we could probably find out what that was if we looked at the various uh sub elements in each accordion part but i'm reasonably happy with what we've returned we're clearly looping through the collection of objects that we're interested in now we don't want to do something to every single results section we only care about doing something to the video tutorial section and now that we've got access to the text of each section we can use that to identify the one we're interested in so we can simply add an if statement above the debug.print statement inside our for each loop we can say if result section dot text and i don't want to use the equals operator because of course the the actual full text here that's returned will be different depending on the number of results so i want to essentially check whether the section text begins with video tutorials and the space in the open round brackets there as well so i'm going to just copy that from the immediate window and then in the if statement i'm going to use the like operator so i can say if the result section text like and then in some double quotes video tutorials open round brackets followed by an asterisk which is the wildcard character so that's essentially the same as saying if the result section text begins with video tutorials i'll add the then keyword to the end of that line indent my debug.print statement and then don't forget the end if at the end okay so having done that now if i just clear the contents of the immediate window and then run the subroutine again we should find this time it only returns the name of or the text of the video tutorial section so in the immediate window that's exactly what we've got the question mark by the way i think is related to the down arrow symbol that's presented on the section that you can click on so there's no way to represent that symbol in the immediate window so it's replaced with a question mark symbol instead we should just check that this works if we test for a different search phrase so rather than excel vba let's go for scraping websites and then even though the number of results will be different we should still return um the video tutorials section text so it's in a different position there's a different number of results but we still get that item printed out into the immediate window now of course it is possible that the search phrase we enter doesn't actually return any result sections at all if i go back to this open instance of chrome and i type in some complete nonsense text in there and click go i don't get any search results which means it's a little bit pointless trying to loop through a non-existent collection of objects in fact what would happen were there no sections to return let's close that down and let's find out so if we change what we're searching for to some completely nonsensical text and then run the subroutine again we'll find this time when it enters the search box types in that returns no results back in vba well nothing happens um it's a little different to trying to find a specific element by name we saw earlier on if you try to find a specific element by name or by class and that object doesn't exist then it returns a runtime error when you try to return a collection of elements so whenever using one of the find elements methods it simply returns an empty collection or a collection with a count of zero if that makes more sense which makes it a little bit more convenient for us to do something different we don't have to use the is element present method and the find by object what we can do here simply after we've captured all the result sections if there weren't any the count of items in that collection will be zero so we can simply say if results sections dot count equal zero then and i'll add an end if and then inside those two lines i'll do something simple like ch.quit will show a message box saying nothing at all was found and with a vb exclamation symbol and then we can simply exit from the subroutine so exit sub okay so i'll clear the contents of the immediate window again and then run that subroutine one more time and although we won't return any sections it will then tell us that that's the case and close down the instance of chrome and it won't therefore enter that pointless loop or do any of the other things that we'll do below this either okay so we've identified our video section what we now need to do is get at the table which sits inside that section so again let's head back to this instance of chrome and we can expand our video tutorial section so we've just clicked on that element and inside there there is a table if i right click on any cell in that table and choose inspect i'll find that i end up somewhere nested inside this table tag in the elements list so you can see that the table is nested inside the div tag for the form accordion part i want to find this specific table the one that belongs to the video section but it doesn't have a specific name it doesn't have a specific id or even a class it's just a generic table tag i want to find the one single table inside an existing element and one of the really nice things about the web element class in selenium is that you can apply the find by methods to it to find elements which sit within that element if this isn't making sense hopefully the code itself will so i want to find a table tag which sits within the element called wisel form accordion part for the video tutorial section okay so let's head back to the visual basic editor and then let's scroll up to the top and i'm going to declare a new variable i'm going to call it dim video table as selenium dot web element so it's another singular element if i now scroll down towards the bottom and just below this debug.print statement in fact let me just comment out the debug.print statement if the results section is the video section then i'm going to set my video table to be equal to i want to find an element by tag but i don't want to do that by applying the find element to the entire browser i want to apply that same method to the result section that i'm currently looking at so i'm going to say result section dot find element by tag and then in some round brackets and some double quotes the tag name nice and simple is table close the double quotes and then close the round brackets and that will capture a reference to the video table just a little bit of housekeeping next if we're looping through the results sections and we do find the video tutorial section then it's pointless continuing this loop afterwards if there were more sections below it so a very quick simple instruction to add in here after we've captured a reference to the video table we can exit the for loop to avoid processing the other sections unnecessarily another thing we should take into account is that we could return some sections but not include the video tutorial section so let's add after this loop we'll check if the video table variable is nothing and if so that tells us that we had some sections returned but no videos so we can add an if statement we can say if video table is nothing then a couple of blank lines and an end if and i will once again quit from the browser so ch.quit and then show a message box which says something like no videos were found and then a vb exclamation symbol and then simply exit from the subroutine before we do anything else uh we'll then need to change the search phrase to return something uh but not videos so our current search phrase is nonsense let's change that let's go for dax measures i know we have lots of material on the website about dax and measures but i'm pretty sure there aren't any videos yet so let's run that subroutine now and it will do the usual thing but tell us that no videos were found and just to prove that that does actually return something let's just go back to this instance of chrome and search for dax measures there we go so we get training pages miscellaneous pages blogs and newsletters but no videos well at least not yet um hint so there we go so we've handled a bit of done a bit of basic housekeeping to make sure that we uh return sensible messages according to the values returned okay so we're at the stage now where we can write out the results of this table into a worksheet in the workbook there are several different ways we can do this but there's one particularly quick and simple if a bit limited way to do this using selenium if we reach this stage of our code after this if statement we definitely have a video table in this variable so what we can do is say video table and then we're going to use a method of the web element called as table what we can then do is apply to that the method to excel we've got two choices here we could say data which returns an array of all the information in the table or use the 2xl method which will just write the entire table into a worksheet providing we specify where it should go so i'm going to use the to excel method and the target parameter it is optional but i'm going to specify i want my target to be the top left cell on a brand new worksheet so to do that i'm going to say this workbook dot worksheets dot add and then that returns a reference to the worksheet that has been created and i want to reference range a1 on that sheet and it genuinely is that simple that one single instruction let's go and enter a search phrase which we know definitely returns some results so we'll go back from dax measures we'll go to let's go for scraping websites this time and then if we run the subroutine we'll find some video tutorials nine video tutorials we can then have a quick look back in the excel workbook and find the layout isn't particularly nice i haven't really spent any effort on that of course yet but you can clearly see that we've written out the nine results including the column headers for those videos you can obviously see some obvious limitations here as well however i suspect um one being this slightly weird formatting for that cell let's change the row height for that cell it's done a wrap text thing in there and it's also not included the urls or the hyperlinks to the urls of the videos themselves this is fantastic when you have just basic raw table data you want to copy out into a new worksheet you can easily tidy up the formatting of this afterwards with some basic vba code the really irritating thing for us though is that it doesn't include the hyperlinks and if we want to get at the hyperlinks then slightly annoyingly we're going to need to loop through each table cell in this table and extract the hyperlinks and write them into the cells individually in the new worksheet okay i'll start by tidying up the workbook i'll delete this extra worksheet and then we can head back to the visual basic editor and to make this work i think i'll create a separate subroutine which we can use to process the video table i'm just going to comment out the line that we've just used to write the entire table out in one go and then below this subroutine write a new subroutine called process table and then i'll create a parameter for that which i'll call table two process as selenium dot web element a couple of blank lines in there and then we can start writing the code to process each individual row of the table followed by each cell in the rows we're looking at let's declare some variables then to help us work through the rosen cells in the table i'm going to start by declaring a variable called all rows as selenium dot web elements so this is going to contain the collection of rows or at least the collection of web elements with the table row tag and then i'm also going to declare a variable called single row as selenium dot web element so this is very similar to the structure we created for the results section and result sections similarly we want to be able to then loop through the individual cells on each row so i'm going to declare another variable called all row cells as selenium dot web elements and then dim single cell as selenium dot web element i'll spell that correctly eventually there we go okay the first thing i'd like to do then is to capture a reference to all the rows of the table just to show a little bit of information about the tags involved in a table you may have spotted these earlier on but if i switch back here to the um to the instance of chrome i have open and have a look at the elements list you can see that within the table tag we've got a table body tag and then each individual row is indicated with the tr tag within a table row you've either got a th tag when you're in the header row of the table or a td tag when you're in just a basic data row so we need to be able to handle all those situations whether we're on a header row or on a data row and we need to do this for every single row in the entire table no matter how many there are so to make that work let's go back to the vb editor the first thing i'll do is capture a reference to all of the rows in the table to process so i'm going to pass that in of course from the previous subroutine so the way this is going to work is i'm going to say process table and then i'm going to pass in my video table to this procedure so video table will come in via the table to process parameter and then i'm going to say set all rows i'll try that again set all rows equal to table to process and then i'm going to find elements by tag i want to find all of the table rows so find elements by tag okay i should have just written the entire thing out when i'd reached that stage the tag name that i'm looking for this time for table rows is tr so i can enter tr in some double quotes and some round brackets and that will capture all of the rows so from there what i can do is write a for each loop i can say for each single row in all rows give myself a couple of blank lines and then say next single row so that little loop there will loop through all of the rows in the table that we've passed in now on each row i want to capture all of the individual cells on that row and that's a little tricky because i've got two different tags to look for here it could be either td or th most of the rows will be td rows there's only ever one header row so what i'm going to try to do to begin with is to say set all row cells to be equal to single row dot find elements by tag and find that quickly find elements by tag and then i'm going to use the td tag to find the table data cells what i could then do is check whether or not i've found any cells with a td tag so like we did earlier on to find the count of result sections i'm going to say if all rowcells.count equals zero then i'll just add my end if before filling in the details so if i don't currently have anything in the all row cells variable i'm going to say set all row cells to be equal to and then it's just the same here single row dot find elements by tag so i'll just copy and paste this i should have copied and pasted the entire line really and then change the tag name from t d to t h instead okay so having captured the um all the cells in the row we can then loop through those and we can do that by saying for each single cell in all row cells a couple of blank lines and then say next single cell so there's the basic loop structure set up we've got the inner loop looping through the cells in the row and the outer loop looping through the rows in the table so just to establish whether or not we're finding anything we could use a debug.print statement to write out the text of each cell into the immediate window so let's say something like this let's say debug.print try that again debug.print single cell dot text okay looks good so at this point i'm going to scroll back up to the original procedure just check what we're looking for so we're looking for scraping websites which does indeed return some results so that's good so we can just run that subroutine by clicking somewhere within it clicking the run button or pressing f5 and it will find nine video tutorials and it will print out well apparently absolutely nothing although it has printed something look you can see the the scroll bar in the immediate window has indeed changed height just to prove that let's just clear out the contents of the immediate window so there's currently nothing there yet and if i go back up to the original subroutine and run it again you'll see that the scroll bar on the immediate window does change height as it writes out some values not the values it's writing out all the text that it's writing out apparently is blank now there's a good reason for this um it's because the text property isn't returning some actual text because the table isn't currently displayed so what we need to do in the main routine once we've found the video section we need to make sure that the table is exposed to the browser by clicking on the result section object so let's just do that it's a nice simple thing to add in so in the if statement which checks whether we're in the video tutorial section before we set the video table variable we can say result section dot click just like we did for the button earlier on okay so let's have a look at what difference that makes now if i click back into that subroutine and run it one more time we will see this time that you can see that in the background i hope already um first of all on the um chrome browser you can see that we have indeed expanded that section to expose the text and then in the immediate window you can now see that we have indeed printed out the contents from each cell into that window okay let's move on then to writing out these values into an actual worksheet in the workbook we'll scroll back down to the process table procedure and in here we're going to declare a few new variables i'd like a variable to hold a reference to the output worksheet so i'm going to say dim output sheet as worksheet then i'm going to have two variables to keep track of the row number and the column number so i'm going to say dim roanum as long and then on the same row i'm going to say call num as long as well okay so those two variables simply keep track of the row number and column number and then finally i'll have a target cell variable so dim target cell as range now all these will be helpful they're not absolutely necessary um but they will be helpful for for identifying where we're writing these values out to i need to create a new worksheet and i want to capture a reference to it so i'm going to say set output sheet equals this workbook dot worksheets dot add and then inside my two loops my two four each loops i want to keep track of the row number and column number that we're on so for each row number so for each single row that we encounter i want to add one to the current value of row num numeric variables in vba begin at zero or are initialized to zero so we can simply say here row num equals ronan plus one i want to do something similar in the for each single cell loop so each time we go to a new column on that row we're going to say column equals column num plus one we also want to make sure that when we reach the end of a row we reset the column number back to zero before we move on to the next row so in between the two next statements we're simply going to say call num equals zero what we can then do is capture a reference to the target cell so we've got a variable called target cell as a range and after we've calculated the row number and the column number i'm going to say set target cell equals output sheet dot cells open some round brackets and say ro num comma column and that makes it much easier to manipulate that cell so we can say on the next line we can say target cell dot value equals and then we can say single cell dot text so rather than writing it to the immediate window with dbook.print in fact i'll just comment that line out we can now write it out into an actual cell on the worksheet having done that let's head back up to the original subroutine click somewhere within there and then we can run that subroutine we're returning the uh the scraping websites list so that's nine rows plus the headers and if once that's finished we switch back into excel we will find that we get all those values written out into the worksheet so we're basically at this point at the same stage we were with our single instruction from earlier but now we've also got the ability to capture the hyperlinks which is what i wanted to do okay so we're getting closer the next thing we'll do is have a look at how to grab the hyperlink from the last cell on each row we have a few choices for how to do this but let's focus on the the important parts first of all we need to know how to capture the hyperlink itself we've got the text from the cell by using the the text method to get the hyperlink what we need to do if we navigate to a table which has some hyperlinks in it if i right click on one of those and choose to inspect it you can see hopefully over here we've got a tag with an a next to it so the the a tag represents a hyperlink there's a particular attribute of that tag called href which actually stores the the address the url that's the the thing links to and that's the particular thing that we need to capture now each cell in the last column of the row contains a single hyperlink except for the the header cell but the other cells on the same row don't contain any a tags so we need to be slightly careful about how we get a reference to the web element for me i think the safest thing to do rather than trying to check for each cell we encounter whether the a element exists i'm just going to capture all the a elements within each table cell and then if the count of those a elements is zero i won't bother doing anything if the count of those a elements is more than zero i'll get the href of the first a tag that i've captured and then assign that as a hyperlink to the target cell in excel and if all that sounded like a little bit too much to take in in one go let's break it down and do it step by step so the first thing to do is grab all of the hyperlinks sitting within the table cell we're currently looking at back in the vb editor let's scroll down into the process table subroutine and i'll declare a new variable that says something like dim video links as selenium dot web elements what we can then do is scroll down into the for each loop which is looking at each single cell and after we've written out the text into the cell let's add another line in fact let me just get rid of this comment i don't need that any longer i'm going to say set video links equals single cell dot find elements by tag so find elements by tag open some round brackets and some double quotes and look for the a tag close the double quotes close around brackets and that will do one of two things it will either find return a collection consisting of zero items when there are no tags to find or no a tags to find or a collection containing the one single hyperlink belonging to that last cell in the row so at this point we're going to check if video links dot count is greater than zero then we want to do something so we can head down a couple of lines and then say end if and then inside that if statement just to explain how to get access to the href attribute i first of all need to be able to reference the first item in the video links collection i can do this by saying video links open some round brackets and then just pass in the index number of the element i want to return the indexing in selenium for vba or selenium basic at least is in a has a base of one so the first element is numbered as one rather than zero as you sometimes encounter in different libraries so video links one and then i can refer to the attribute method and then i can specify the particular attribute by passing in its name so in some double quotes i'm going to say href and then close the double quotes and the round brackets just to show what that returns i'm going to add a debug.print statement to the beginning there then i'm going to clear the contents of the immediate window and then i'm going to head back up to the original subroutine and click somewhere inside it and then run it so again we're looking for scraping websites so we should return nine rows and therefore we should return nine hyperlinks and if we look in the immediate window that is indeed exactly what we get so we've successfully returned the href attribute of the a tags for the individual cells next we can assign the hyperlink to the cell in the worksheet so let's scroll down again into that same little if statement where we've just used the debug.print statement i'll comment that out and then i'll refer to the target cell variable so again target cell is the range object in excel so i can say target cell dot hyperlinks dot add and then there are two parameters to fill in the anchor object which will just be the target cell and then the address that we're trying to reference which is what we've just returned from the href attribute so i can just copy and paste video links one dot attribute href and that will assign a hyperlink to the text in that cell okay so we're pretty much there let's just head back up to the original subroutine again i'll just tidy up in excel as well i've got a couple of extra worksheets here that i don't need at the moment so let's get rid of those and then if we return to the visual basic editor select the previous subroutine or the original subroutine and run that one again once again returning the scraping websites table when i return back to excel i now have the hyperlinks added to the visit text in those cells so at this point we're basically there we've done all the hard work using selenium to get the data out into an excel worksheet including the hyperlinks apart from this all we need to do is the tidying up and this is just fairly bog standard excel code at this point so there's not much to see here but just for the sense of satisfaction of completing the entire example i'm just going to spend a couple of minutes making sure that their table columns are all the correct width turning it into an actual table so we can quickly sort it and then adding the basic interface to sheet1 so we can type in and then click a button to return some videos so back into the visual basic editor first i'm going to head down to the end of the process table subroutine and after we finish looping i'm going to refer to the entire region of the table that we've just created so i can do that in separate ways but output sheet dot range a1 dot current region and i want to refer to the entire column of the current region and then i want to apply the auto fit method to that to us to make sorry not auto filter i meant auto fit to make sure all the columns are the correct width what i can then do is convert that region into a table so i can quickly sort and filter it so i can say once again i can say output sheet dot list objects dot add and then i can set the source of this so i can say source colon equals and then i can say output sheet dot range a1.currentregion so i'll just copy and paste that from the previous line so having done this if i scroll back up to the original subroutine and then run it again i should find that the output is a little bit neater looking this time so back into excel and there we go there's a nicely formatted table with all the columns the correct width finally we could head back to sheet 1 and then we could add some basic interface to this maybe type in some text into cell b2 uh search for videos on and then we'll use cell c2 to enter our text so we can do some basic formatting here search for videos on i'll just change the background color and the font color nothing particularly complicated this is not my forte as i'm sure you'll be aware if you've watched any of my previous videos and then let's just type something into there let's type in let's go for excel or vba that returns quite a few videos and then let's add a button to run our main subroutine so i can head to the developer tab in the ribbon and head to the insert tool and then i'll add a basic form control so i'll click on the button and then draw this into the cells below if you want your button to line up exactly with the edges of cells just hold down the alt key on the keyboard the one at the left hand side of your spacebar while you draw your button and it will snap to the edges of cells which is kind of satisfying i'll assign the only macro scrape wisel videos click ok and then i can alter the text on the button i'll call it find videos and then we just need to modify the original subroutine to read the contents of cell c2 preferably checking that range c2 is not empty let's head back to the visual basic editor and scroll all the way up to the top of our original i'm just going to modify the codename of sheet1 here so it's got a more sensible name i'll call this one search sheet and then after my list of variable declarations but before i do anything else i'm going to say if search sheet dot range c2 dot value is an empty string then and if and then inside that if statement i'll simply exit from the subroutine why not i don't even need to display a message i think i can then scroll down a little further and find my send keys method and rather than writing in this explicit search phrase i can remove that and simply refer to search sheet dot range c2 dot value and that's the whole thing done so back into excel at this point we can type in any search phrase we'd like but when we click the videos or find videos button it will fire up chrome go to the weisel homepage type in our search phrase return all the results find the video section video tutorial section expand the table copy every single cell from the table takes quite a long time here because of all the rows there are 126 results here um we can speed that up a little bit actually by just switching some basic settings in excel so yeah but the end result is you get your table listed out onto a brand new worksheet so i think that brings us to the end of the video on basic web scraping techniques using selenium as i said at the start it's not the most sophisticated example you'll ever create but it does hopefully lay the groundwork for how selenium works and gives you the fundamental techniques you will always need to rely on whichever website you're trying to scrape so hopefully it opens up a whole new world of possibilities for you using selenium to scrape any website using any browser um thanks for sticking with it particularly particularly if you are still watching um i appreciate this one's been quite a long one um hope you enjoyed it thanks for watching see you next time
Info
Channel: WiseOwlTutorials
Views: 21,038
Rating: 4.9762845 out of 5
Keywords: excel, vba, visual basic for applications, selenium, web scraping, scrape, scraping, scraping websites, website, web browser, chrome, google chrome, chromedriver, webelement, webelements, loop, if, for each, copy table, tag, class, findelement, by, wise owl
Id: y7yWL0oCB3k
Channel Id: undefined
Length: 65min 39sec (3939 seconds)
Published: Fri Jan 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.