Webscrape dynamic contents with selenium

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone today I'm gonna show you how to web scrape down any Content so at this time I hope you are familiar with web scraping traditional static HTML content so today why we are trying to new web scrape dynamic content is that a lot more content these days on the Internet is using dynamics so you user has to click something in order to activate a certain webpage and such which is a lot more neater and nicer than the traditional static HTML content so in order to respect these dynamics you would probably need some separate packages then the the traditional Python so we're gonna cover that and today I'm going to use a package called selenium so you probably notice that I've opened up a selenium driver so I'm going to import the web driver and I'm gonna use the Chrome as the browser that selenium is going to simulate a user usage and this is the location where I have the chrome driver located you would probably have your you will need to change it to your location so that's the the path and I'm gonna have the driver have that path and I'm going to initially execute this and make sure that it is all running and currently safe to run the selenium package so now I've run this now it is activating the chrome driver and it has opened up a blank web browser which the selenium is going to use for web scraping okay so now I have to get a web page on that I would like to web scrape and I'm gonna use the Sephora web page and trying to web scrape its consumer reviews so down here you could see that there is a lot of consumer reviews and and if you try to web scrape this true the traditional HTML format you probably won't be able to do so that I'm gonna show you this by using the traditional control you shortcut to get this page source so if I or I can use the right-click and if I could click the view page source you probably noticed that I'm not gonna be able to find this this comment that starts with although it does feel if I try to find that I try to find although it's gonna it's giving me zero results which means that it's not being able to find this dynamic content that the web pages have so in order to get this what I would need to do is call the what the Chrome browser had is the inspect so if I hit inspect then what I'm going to see is that this is the area where I could see the this comment and if I click on keep on clicking down and here we go this is the part under the span class PVR our review text now you can see although it does which is matches the although it does on the web browser that the users are able to see so this is the dynamic content that user has to get and how the the selenium has to get so um down here what I'm gonna do is get that that is I'm gonna get actually selenium has a function gulp called get and so what I need to do is that at first I need to get the URL which is this URL right here and copy this and and once I run this what this is gonna do is that remember the empty browser that I had it's the this empty browser right here if I click and run this that empty browser is going to have that same URL so once I run this now you could probably notice that now this browser is running and getting the same page that we are trying to get ok so now the the the browser is open that this the selenium is simulating so the next step that we would probably need to do is get these reviews and you can notice that these reviews are under what we call bvr our review text so I'm going to get this element and there is a function called find elements by class name so that's the method that I'm gonna call in the webdriver so I'm gonna call this review to be under driver dot find elements by class name and the class name was bv RR oops bv r our review text I believe and if I go back here it says PBR our review test yes our ante in capital letters so once I get um this part I'm gonna run this and now review is going to be stored all these consumer reviews so what I'm gonna do next is probably have a for loop and start printing out the reviews so I'm gonna say for Post and review and start printing the post and view it as a test so once I run this let's see what happens okay now it works same it seems that post is having that review and being able to print it all correctly it's it's posting multiple reviews not just one so you can see that the first post that starts with although it does feel which is right here although it does feel and the second post is got a mini one and you can see that here we go it got a mini one this is the second post that you are being able to see now it seems that um the first page is only showing one two three four five so it's only showing five and there were a lot more posts than just five and you know we have a lot more we have 4200 so um you're interested in web scraping being not just by but a lot more than just five in order to you know correctly get the consumer sentiment of this product so what we're what we need to do is now that we need to get the path of the remaining consumer reviews which is we can see that we can get this from these paths here so you can see that only the first one is unclick Abul the second third and fourth and fifth one is you can see that is all clickable because you know you see the hand on the arrow goes back to the hand now you need to get the path XPath of this and how to get the XPath is again high like this and probably and inspect it again control shift I and once you inspect it so this seems to be the URL and the server that the sephora.com has which starts with HTML reviews.com and such and such so it has page view second page view is equal to two so now it's going to the second page so what we need to do is that we are going to need the our selenium browser again to simulate that it is going to the second page so right now what I'm gonna do is that I'm going to ask the driver to go to the second page so what I'm gonna do is going to find the element and I'm gonna get the XPath and well so I'm gonna get the XPath and the XPath name is so you don't to get the XPath what I need to do is just go to this page that has two here that is highlighted and if I just right-click and I can start copying what we call the copy XPath which is the XML path and once I copy that I'm gonna put it inside here and so I'm gonna put it in single quotes because there is a double quote already used so I wouldn't need to use a single quote and I'm gonna simulate and click OK what what this click does is that I'm gonna ask the web driver to click to this page and if you remember if you will go to this web page that is selenium is currently using it is still on the first page and what this function does is that once I run this and execute this what we probably notice is that now this page has gone to a second page so it seems that the the common consumer comments has to change and also you can notice that one is now clickable and two is not clickable you don't see the hand anymore you see it on the one so um so this page is going to get the xpac of the second page and now now we can basically do the same thing over again that we did for the first case first pages run whoops did i okay here we go is that get the class name and get the reviews um that is I'm gonna just call this review to just to not get confused so if I run this now then what selenium is gonna do is going to get the second batch of comments okay so if I hit execute this and now you see a different set of comments you can see that this premiere does not cover pores but it's worn out and such and such which is different from the first comment although it does and such and such so now you have correctly um web scraped to reviews that are to dinah both are dynamic the first one was review and the second one was review too and you can you can probably do the same thing for the third one third pages and so on and so forth because it seems I did the Sephora site is just using two and then you can just change this to three or you can even have a for loop on this and you can like loop it forever until I don't know maybe like about a hundred or a 200 word when you think that you have enough web scraped a text analysis that you could get from these consumers so so so this is how you can get web scraping and the web scraping for the dynamic contents it's a little bit more interactive and it has to simulate a human usage and acylium is a pretty good package for that so I hope this um video kind of helps you go through some of the paths and the logics of how you can web scribe dynamic content Thanks
Info
Channel: Jinsuh Lee
Views: 29,181
Rating: 4.9410319 out of 5
Keywords:
Id: O--WVte1WhU
Channel Id: undefined
Length: 12min 35sec (755 seconds)
Published: Wed Jul 06 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.