How I scrape JAVASCRIPT the easy way

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video i'm going to show you another method for scraping dynamic content and javascript websites with python in previous videos i've showed you how we can use selenium to load the page up and then go through all the elements and extract the text that we need but this method is going to use selenium just to load the page and then we're going to return the html that it's rendered and we're going to put it through an html parser like beautiful suit it's quicker it's easier it works really well on some websites so hi everyone and welcome my name is john if you're interested in more web scraping content please consider subscribing to my channel there's lots on there already and there's lots more to come so let's get into it so i'm just going to be working on one of the test websites today this is the quotes to scrape the javascript version just as a demo so if we look at the source code we can see here's our script tag it starts up here and we can see it's running jquery and down here you can see there's the loop that it uses to create all of the divs with the clo the quotes in them and if we go to the website and we did inspect element we would actually see the html here but we can't actually access it so if i was to use let's say we'll try it with a request so import requests and let's say our url which i've just copied slash js which is the javascript version of this website and we'll do r is equal to quest.get our url and then let's print r dot let's do text for now if we run that we can see that we just get back what we saw in the view source so known of the actual data so we couldn't then pass this information with butyl soup to get the divs and the text out but what we can do is we can actually use selenium to load the page up as i said and then we can return the html from that page so we only have to do it once with selenium and then we pass it with beautiful soup so it's much quicker than just going through and getting each and every individual element from within selenium i'm actually going to use helium which is a wrapper built on top of selenium i just do this because it simplifies all the commands i suggest you do the same if you want to use just pure selenium that's absolutely fine it all works the same way they share some commands too so i'm going to do from helium import everything and i'm going to say our url is still this and i'm going to do start chrome url and i'm going to run it headless is equal to true so i'm just going to save that and double check that there's no problems yep good so what we want to do is instead of just having start we want to put this into a variable so i'm going to say browser is equal to now this will work in the same way but what we can do now is we can do html is equal to browser dot get sorry dot page source so now what that's done is we're going to use the helium chromium browser because we're using chrome to load the page up and then we're going to get the page source which is the actual rendered html from that so now if i do print html and text is quite big but if i scroll up right here we can see that this is where everything that has been rendered if you look through this it's all the divs with the class and the text so then we could use beautiful soup to actually get that information out i'll do a quick demo of that now so if we do from bs4 import beauty full soup and then we do we'll actually remove this we don't need this in here we can just do soup is equal to beautiful soup and then we give it the browser.page source and we're going to specify the html parser like this and then we can do print dot title dot text to check that it's worked which it has so we can do quotes is equal to soup dot find all because we want every single one the div and it was a class of quote so if we print that out we should get all the divs the html divs with all the text in okay yep there we go that's worked so now we can just do a for loop so we can just do four item in quotes and we can print out item.find and i think it was a span tag and it was class and it was text so if i just put text at the end of that so we just get the text of that element we should get the text back for each and every single one uh which we do so that's it that's quite a cool way of just using the browser to get the html for you um this does work in quite a lot of applications i've used it quite successfully before so if you've got any others as i said if you've got any sites that you're trying to scrape javascript with give this a go um so hopefully you've enjoyed this video and consider liking commenting subscribing the video always helps and i will see you in the next one thank you bye
Info
Channel: John Watson Rooney
Views: 6,767
Rating: 4.9141631 out of 5
Keywords: python web scraping, scraping js websites, scrape javascript, selenium web scraping, python helium, web scraping with helium, render javascript, scrape dynamic sites
Id: onlQ7fL4ey8
Channel Id: undefined
Length: 5min 25sec (325 seconds)
Published: Wed Oct 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.