Scraping Data from JavaScript rendered tables with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so quite often you might come across tables that look like this online and you want this data out for whatever reason maybe you want to create your own data set and do some analysis or something along those lines so your first thing that you might do is view the page source and that's not going to help us at all there's no actual information in there just because this is being sent to us probably by javascript to our browser next thing you might do is okay maybe i can render this page out so i'm going to go to the inspect element tool and let's have a look at the first one and you can see here there's a lot of div with the class names that are quite convoluted and they might be dynamic they might change a lot so this possibly you could do this but it's going to be a lot of work and probably not the best way there's a reason why i always go on about this method and if you just take the moment to go to the network tab click on xhr hit reload we're gonna see straight away that there is a lot of useful information coming back directly that we can see and we can intercept now these are the three that are interesting me the most it says there's type json you can just see that there and you can look and see the size of them as well so this indicates to me there's a good bit of data in there i've looked at this already i'm going to click on this one and then i'm going to go to response and there we have a load of information from all of the teams that are available in its entirety you can see it's 320 um bits team information there and this list is is only goes up to 113 so we get everything so if we go and find the team that is at the top at the moment and just check to see what information we get if we go down and find it where does n come in the alphabet somewhere around here there we go we can see that you do get the rank information rank club rank power rank and the power points here one five one three which is what we're seeing right there you also get a load more information like where they're from and etc etc and even some of the active player information which is just down here now what's interesting about this is that if you come across to the headers you can see that there is actually no mention of a cookie at all now that's really uncommon normally that the when you do this sort of request it stores your browser cookie and that's how they verify that you're actually looking at it this way but we don't have that information here so what you can actually just do is go ahead and copy this url and just put it into a new browser tab and we're going to get all of that information that we saw here loaded up from just hitting this api endpoint and this url and there's the team information that we were just looking at so this is just the best way and the easiest way to get information out we basically just have access to that api with no authentication needed etc etc so what can we do to get this into our pythons code and maybe extract it that way all you need to do copy the url come over to your new python script i've already pip installed requests which is what we're going to use and we're going to do our r is equal to requests.get put our url here and we're going to do a new variable because our url is nice and long url is equal to put that in there we go save that so pycharm doesn't have a fit at me if i'm not refactoring my file properly and then we're going to do print r dot json and it's that simple that's three lines of code right there that you can see and i'm just going to run this and we're going to get all of that json data that we just looked at in our browser back here you can see there it is there's a lot of information here and we can see there's some points and rankings etc etc now where would you want to go from here if you actually wanted this information to be stored in some way or another you might want to transform it by extracting just the bits that you're after maybe you just wanted the points or the rank or the team names and you could access it that way and then put it into a panda's data frame or if you wanted to store this information in its entirety you could do something like a mongodb database which is a nosql database which we'll be covering real soon and just take each and every item from the id and then store that information you can see over here we have items i can't make this any bigger oh yes i can see we have items and then the id so this would be a whole entry so i'm going to cover that in an upcoming video i'm going to be looking a bit more into mongodb if you're interested in scraping like this you can click on this video here which goes into more detail or if you're looking for more database stuff this one right here which also goes into more detail on how to use an sqlite database which i think is more important at this stage than learning a new thing like mongodb
Info
Channel: John Watson Rooney
Views: 3,164
Rating: undefined out of 5
Keywords: python web scraping, scraping javascript tables, web scraping, scraping tables, js tables, web scrapping
Id: qxj7EXYeNls
Channel Id: undefined
Length: 5min 3sec (303 seconds)
Published: Tue Sep 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.