Python Web Scraping: JSON in SCRIPT tags

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome today's video is going to be about how to extract the date JSON data from inside script tag on an HTML page so this is the page that we're going to be using it's a student accommodation and I've picked London so if we have a look you can see that the accommodations all loaded up like this and if you go to view source you will see that there is no useable HTML data for us to actually scrape out using beautifulsoup but what we do have is this long list here of what looks like JSON data we can see that it's got lots of the information that we're interested in so you can see that it's got the address the area of the city postcode picture links etc etc and the value so what we want to be able to do is use Python to extract this information and pass it with Jason just to go out what we want or to make our own data set so the first thing we need to do is get over to our text editor and copy the URL so I'll get this we're going to need to import a few different libraries so we're going to import requests to go out and get the information then we're going to import Jason because we're going to need that to work with the data and then we are going to do from ps4 imp or beautifulsoup because we're going to use that as well to pass the HTML take that it's all good right so we'll set the URL that's a nice long one there we go and we need to do now is R is equal to go and get the data requests get the URL so now if we do print I think it's our door let's code we're getting a 200 which means we are connecting fine ok so what we want to do is we want to create our soup variable and pass that information into beautiful soup so we can extract the data from that script tag so we're going to do soup is equal to beautiful soup and then our content and we're going to specify the HTML Parsa like this as i tend to always do let's print something out so we know that we are in the right place I tend to just do the title or something like that okay now we've known that we were in the right page what we need to do now is just a bit of recon and we need to find out where this information is and how we're going to plan to get it out so the first thing we need to do is if we go back to our source code we can see here that it's inside a script tag but there's nothing else that defines what this script tag is there's no ID or it's not in a div or anything like that so in these cases the easiest way to do it is to literally count down how many script tags you are in and then use an index when we do find all with beautifulsoup so I'm going to start up here and I see this is the first one why not it's closed - that's the third one that's the fourth one and this is the fifth one so this is the one that has our data in so we need to use beautifulsoup to find all some script tags index out the fifth one which would be number four because it's a zero index and then get the information out something to do script is equal to soup dot find all and we're going to look for the script tags and we saying that we need to fall so if we now go and print out that and see what we get you can see right away that we are we are in the right place and we are getting back all of this information so that's great but the problem is is there's that this here has got a lot of extra data around it which means we can't just dump that straight into JSON library in Python and get the information that we want so what I tend to do is I like to copy all of this so let's copy all of this out all of it we go all the way down to everything inside the script tags and I put it into an online JSON poor matter this is the one I use because it will tell you what the problems are so if we paste this in we can see right away it's saying that it is an error and is expecting a string or blah blah so this means that if we try to load this in as a JSON object into a Python script it will just fail so we need to change this string up a bit before we can load it in so the first thing I can see here is there's a lot of white space at the beginning so that's fine we can get rid of that nice and easily so we can do dot strip first of all we undo dot text sorry just so we get the text from this and then we're going to do dot strip and this is going to remove if I make this a bit bigger and come up to the top this is going to remove the text it's going to remove the script tags and the dot strip will remove the leading white spaces so let's do that again okay so now we are just down to this so that's good so we go back to our formatter and we go well we've got that we've got rid of the leading whitespace but we're still not quite there yet what we need to do is we need to basically chop off the beginning and anything at the end of this string to make it match the JSON parser so we can get that information so what we want to do is basically we want to count how many characters including white space that we want to get rid of at the front of our string and Jason will always start if you look here so you'll start or something like this and we can see that the first thing that we match is the bracket hi there so we want to get rid of everything before this bracket so I've just counted this out and I think it's about 55 so what I'm going to do is I'm going to go ahead and I'm gonna put our index for the text and this sorry I'll slice it for the text and I'm gonna go ahead and say remove the first 55 characters from this string you can see here 55 that means start 55 characters in so loop on this again let's see what we get okay so I'm not quite there yet I've still got a few left so let's say 55 and white bit of white space 56 57 58 so let's go for 58 and then go again that's great so there's nothing before our leading eye curly bracket there so now we know we're getting that we can get rid of all of this at the start now if we try and validate again we're getting an arab end so we can see that it shouldn't end with the curly bracket not this bill semicolon well that's nice and easy we can apply the same method and we're taking 58 off the front if we do minus 1 that means we're going to leave one left at the end so we're going to come in one from the end so if we run this again you can see now we've gotten this semicolons gone and nothing at the start so what we want to do is if we come back here and we get rid of this semicolon and validate and sometimes this doesn't work so we need to copy it delete it and we paste it in there we go so now this is telling me that if we now that we've cut our string down to this we can pass this into the json library and python and we can then extract information from it as we would do normally so let's do that now so what we want to do move this down into that we want to do let's call it JSON object it's equal to actually no let's call it data is equal to json dot load s because we're loading a string into it and we're going to do script like this and now if we print our data we should get exactly this back again there we go so now we basically have a JSON object loaded in and saved into our data variable we can now go ahead and manipulate as we would normally so what I'll do is we'll just come back here and we can see that it's inside our main bracket we've got properties which is where we want to be and then listings then groups which then becomes a list that's a list and then results and then property so we need to go all the way through this first so if I just quickly do that so we want to do properties then it was listings group listings groups I work nope can't spell listings groups and then zero fry zero index and then results should give all the results and then if we pick the first one that is essentially the first one here which is this and that's all the information it so you could go even further and you could get just the addresses out or you could go and just get say the postcode and then the price you can create your own data set or you could scrape this every day and see if any new properties come up or something like that so that's how I would go about approaching this we can use beautifulsoup to get the to find the script tag and we've counted how many script tags down because there was no idea if there's an ID you can find it that way and we're basically just removing characters from the end and the beginning of the string to make it into a JSON format so that we can then we can then manipulate it with the json dot loads and going through that way and i always find that the online parsers are really useful I'll leave a link to that one that I use and also I'll leave a link to a couple of my other videos where I explain some more of the other concepts that we've used in here that I've probably glossed over really quickly so hopefully this has been helpful to you guys let me know in the comments any questions or queries give it a like if you liked the video consider subscribing on my channel there's more web scraping content and there is more to come cheers guys bye
Info
Channel: John Watson Rooney
Views: 13,656
Rating: 4.9612904 out of 5
Keywords: python web scraping, web scraping with python, json in script tags, web scraping, learn python
Id: QNLBBGWEQ3Q
Channel Id: undefined
Length: 10min 13sec (613 seconds)
Published: Thu May 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.