Check for THIS before Rendering that Javascript Site! Web Scraping with Chompjs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so for me one of the most important things when you're looking at trying to scrape a website is to actually do a good bit of recon and figure out where you're going to take the data from so if you're doing an html site it's quite easy you might want to just take all the html data pass through it and get out what you need but when we're talking about javascript sites it can be a bit more difficult so the one i'm looking at here this one is the standard quotes to scrape the javascript version so what you might be thinking is that you need to render this page but if you actually take a look at the source you'll see right away that when we scroll down we have this script tag here with the information in it and it looks somewhat like json data this is quite common i've definitely seen this on a lot of sites and you can actually just go straight to this and grab it because when we look at the source code this is what we'll see when we hit up with our requests a good way to test if this information is here if it's a bit more of a complicated site which it probably is is to copy a specific part of the text and have a look and see if that comes up in the source code like this and we'll see it comes up right here so we know that this is here so what we're going to do now is we're actually going to use a library called chomp.js and that will allow allow us to easily get this data out so this is the this is the github page for it here um it shows you some examples how to import and it shows you a scrapy example here which is quite cool although in this example i'm doing we're not going to be doing we're not going to be using scrapy we're just going to be using request html so let's get started i'm going to as i said be using impress request html so i'm going to from requests underscore html import html session and then i'm going to say s is equal to my html session that just allows us to access the page nice and easily so then we want to say what our url is i'll just go and grab that from here the js version and then we want to do r for the response is equal to s dot get because we want to use s for our session and then the url now if i go ahead and print r.html.html we're going to get the whole html page back and we should be able to on this simple website scroll up and see that we are indeed getting this script tag data within the response now this is important because this means that we can use trumpjs to get this data for us and it will turn it into a json object that we can then manipulate so if you haven't got trom.js installed you'll need to pip install it i've already done that so i'm going to go ahead and import it import trump js and we're going to head back to the documentation and we can see that we need to use this trump js dot pass js object it also shows you a little bit about here we can say here's the script css and this is just like a css selector that lets us access the right script tag i'll show you how to use that just now so when we go to our source code we can see that we have the script tag and it's preceded by this var their data here equals we can actually use that to make sure we get the right bit of information so what i'm going to do is i'm going to say that the script css just like the example we just looked at is equal to and we want to say script for the script tag and it's uh contains this is css selector and open bracket have another couple of parentheses sorry quote marks there and we wanted to say var data now this matches this here so that's important that that is reflected in our code what we can do then is we can say that our script text again this is here this is this part that we're replicating except we're not actually going to be using the regular expressions in this one because we don't need to and we're going to say r.html dot find and we're looking for the script css which is here because we're going to be this is our css selector now that we've put here i've just split it out so it's a bit easier to see and first is equal to true so we don't return a list because with request html if we use our.html.find it always returns a list of objects a list of elements and we only want the first one in this case so what we can do now is if i run that to check that there's no errors we can see that there's no output but that's all good so let's print out the script text and see what we get this should return the element for us and we can see that it has indeed returned a script element so now we know we are in the right place the next line of code that we want to do then is we want to do our trump js.pass js object i think it is so come back to this code here and we can see that we have inside this i know this is a scrapy example but it's okay we're going to copy this bit here just so i don't type it out we're going to put that in there and we can have a quick look and see what it is so we're saying that our json data this variable here now we don't need to import json because trump will do everything for us is equal to the pass js object here and we're giving it our script text now there's one thing different between this example and the scrapy one that was on the website is that we just looked at this script.text element and it returned the element for us not the actual text so we just need to put dot text on the end here because we want the text within that element which is the script the data we were looking at earlier as opposed to just the element itself so now if i run print json data we should have back all of that information from the top of the page and we can see there it all is so what we've done is we've used trompjs to turn this script code up the top here that's got the data in it into an actual json object within our price within our python code that's really cool it's really useful and we can actually they'll just manipulate this in any way that we want so we can actually just manipulate the data from here it's all in lists and dictionaries so if we take a look at this one for example we can see that this is a big list and the next thing is a dictionary so if we follow that along we can see that it ends about here and the next one starts here so we can loop through all of this and just pull out the actual data that we're after or we could take all of it if that's what you wanted so i'll just say let's do a quick for loop so we can print this up a bit and let's just say four quotes in json data and we will print quotes and let's say i've gotten what it is already let's just print that for now so we can find it again and we can say let's just print the text i think we should be able to get just like this because we can call it like a dictionary and there we go all of the text information there so that's it guys hopefully you found this useful this is a really easy way to get data out of the website if it is presented to you in that format definitely check that when you're looking at a site to scrape just to see if it's in there and then use this library to get it out so i'll put the links for the chomp.js down below so you guys can see it does a few more other things as well you can work with json lines and you can do all sorts of escaping of characters by default it pulls out the first square bracket or curly bracket the first list or dictionary so that's the best way to use it so hope you found this useful found some value in it if you have let me know down below and hit that like button and consider subscribing i've got loads more content on my channel already for web scraping and similar subjects and there is more to come to so thank you very much guys and i will see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 1,986
Rating: 4.9669423 out of 5
Keywords: chompjs, data in script tags, json in script, scraping dynamic websites, web scraping, python, learn web scraping, web scrapign techniques, data extracting from websites
Id: VKI69VF8Exk
Channel Id: undefined
Length: 8min 30sec (510 seconds)
Published: Wed Feb 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.