How to SCRAPE DYNAMIC websites with Selenium

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome John here in today's video we are going to be looking at more selenium stuff we are going to be web scraping dynamic websites using selenium this is part two of my little mini series part one was on the basic so if you haven't seen that go and watch that and let's get going so this video what we're going to do is we're going to scrape some data off YouTube what we're going to do is we're going to take a YouTube channel we use mine for this example and we're going to filter by and sort by top videos and we're going to use selenium to go ahead and get that information and get us the top videos and the views and when they were posted now you can't do this using requests and beautiful suit because YouTube dynamically loads all of its content and you won't be able to get that information out so that's where we get to use selenium so the first thing was to do is import selenium into our scripts so we're going to do from selenium import webdriver now that's the only one we're going to need to import for this one the next thing we want to do is set our URL and this URL is equal to now I have this opened up here this is my YouTube page and what I've done here is you can see that we have our URL and I have gone into videos and then I've done sort by most popular and that has given us this URL at the top this will then sort it by the most popular videos and it will give us the top sort of however many it is on the page loaded that we can get the information of what so I'm going to copy that put that back over there and paste it in here nice and long so the next thing you want to do is to use our selenium driver to access the page so we need to assign that so we're going to do drive is equal to webdriver dot and we're going to use Chrome I prefer to use Chrome than all I just do and then we can do driver dot gets and the URL so now if I if I run this now this should open up this URL that we just looked at in Chrome there we go has worked and we can see that it's got the most popular views most popular videos at the top in sorted order so that's good now we want to open our browser up again and inspect the elements to see where the information is that we want to extract so if I open this back up and we go to inspect I'll make this a bit bigger and zoom in a bit hopefully we can all see so we need to try and find out with our inspector where everything is and what what the elements look like that we can get to so straight away I can see that this is where the video title is right here because it's highlighted on the left hand side now if we wanted to just get this particular element and the information from it we could just copy this and go ahead and run that but we want to be able to loop through each one of our videos so we need to look up a bit further we need to find something that we can that's something that's common that we can get all of the elements off and then loop through those so if we just go back up a couple that won't work that's not good enough keep going that's there okay this one now if we click on this one we can see that this seems to have all the information of the video highlighted so this is a class of style scope video render okay so I'm going to copy that for now and if we go back down to a nother video here and we go we can see that we've got it again here video renderer and it highlights the next one so this looks like a class that we can use to get all of the elements that match this class and then extract the information from that so I'm just going to go and paste that in over here so we've got that saved so within each of one of those we want to be able to get the the the title and the views and the how long ago it was posted so we're still within our class that we've gone and found here so this one now within that this is where the title is so we can go ahead and copy the path and we can put that in and then the views let's copy that and also the two months there that's down here we can see that's behind me you can see it there let's copy that it close that because we've got all the information we need so now we can start writing so let's do videos is equal to driver dot find elements notice that find elements be we want every single one of this class and it's by class name like that and let's put this in here so that's the class that we found that had all the information in for us okay so now we want to be able to loop through this video in videos and let's do title is equal to now if we were to do driver dot find element by path we would only ever get the first one up but we need to do is we need to go within each individual element we've done so we want to call video so let's do video dot find element element by path and let's put this in here now what this is saying is we're going to look with within our every single element that matches this class name because it's a plural we're going to store that in here and for each one in there we're going to store in this variable and then we're looking in that variable video for the element of that matches this and that will get us the title out now there's one other thing that we have to do for this and then we have to put in a little dot here because what this means is that the dot search is within this element as opposed to within the whole page so that's very important because otherwise we wouldn't get the information out that we're actually after so let's do the same for views views is equal to again video dot find element by path and let's put the views one in there again with our dot and then for the we'll just call this video dot find a path and put that in and again our dot so you can get rid of these we don't need these now what we do want to do is we want to put text on the end of each of these so we get the text information and not the object information whatever else might come out so let's print that out now to see that that's working title views so now if we run this hopefully we've done nothing wrong this should give us all the videos there we go you can see it's just gone through each one of those let's close this out we can see the title the number of views and how long ago is posted so that's basically the crux of it what we'll do is we'll import pandas and we'll store that information in a data frame so it's easier to view and manipulate or export to excel or whatever so we'll import pandas as PD and this is my favorite way although pandas a very powerful library used for a lot of data analysis and data science I find it really easy for small things like this just to import in and put my data into a data frame so then I can export it to excel or csv or whatever I want so what we're going to do is we're going to create an item so I'm going to call this vid item because I've already used video and let's say the title is equal to the title that we found Hughes is views and we'll do let's call it posted those did and that's what we called win so that's going to store all of the information inside our dictionary here and we can then if we create a blank list we can add that to its let's call this video list the imaginative tames that's our blank list and let's say video list got upend bid item so all we're doing is we're going through each of the videos like we just saw where we printed out the names and the views and we're storing it into a dictionary and and then we're going to append that to a list and then the last thing we want to do is we're going to put that list into a data frame so let's just do DF is equal to PD dot data frame and the video dist like that and then we can print out our data frame okay so there we go we've managed to use selenium to extract information from a dynamic web site we've used the YouTube channel and we've gone ahead and filtered it and sorted it how we wanted copied the URL and we've actually extracted by expend expecting the element of finding where everything is we've extracted the title of the video the views and how long ago was posted thanks everyone for watching hopefully hopefully you've enjoyed this video give me a like if you have leave me a comment if you like what you see subscribe there'll be more this was part 2 of my selenium mini series in the first one it was much more of the basics where we still did some cool stuff in there so if you haven't seen that go and watch that one and this one we've used to lynnium to extract dynamic information from dynamic websites and in the next one we're going to explore how to run your selenium scripts headless on a Linux server that's really cool we could run this this little script that we've written here say we could set it up to run automatically every month every week or something and then give us the information so stick around for that one thanks guys bye
Info
Channel: John Watson Rooney
Views: 49,930
Rating: 4.9590445 out of 5
Keywords: python, learn python, selenium, dynamic websites, web scraping, webscraping
Id: lTypMlVBFM4
Channel Id: undefined
Length: 11min 4sec (664 seconds)
Published: Wed Feb 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.