How to scrape SPORTS STATS websites with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome john here today's video is going to be a demo of a stats website uh scraping the information from that and also we're going to talk a little bit about how you would want to approach working out what method of scraping is best for you so i got sent this website for by a viewer so thank you very much for that it's a stats website for the afl um if we were to go to view source like we might do start with we would see that there is not enough not much here uh none of the actual data's here just some of the surrounding stuff so that's no use to us so we couldn't use requests in beautiful soup to get this information out if we scroll down we can see there are 827 total results over multiple pages so you might think maybe selenium would be a good choice you could load up every page and click on this these buttons maybe to get the information but there is a better way i've touched a bit on this on a different video i always now this is my go-to first port of call for scraping websites like this which clearly aren't it's not it's clearly not an html table so what we want to do is you want to go to inspect element and then head over to the network tab and over here if you click on xhr now if you refresh the page it's going to give us all of the requests that the website is making to get the information so if we look down here we can see the type is is json now straight away i can see and i think that this data is coming from an api which is probably not accessible by us but having said that when we look at these get requests there's this one here which is cds202014 we click on that it's uh the player stats from stats pro for season and this is obviously the current season so we click on the response and we can see it has been truncated which is fine but this seems to be all of the player data look we've got the name surname height weight kicking foot that's interesting the team and the game so this person hasn't played any games so sucks to be them um so they've got no stats but this is really interesting because this website has then called this api to load all this data now what we can do is we can actually mimic that request the get request and we can actually get that information ourselves in vs code and download this json nice and easily now to do that we're going to use a program called postman which is for testing apis and replicating api calls but what i'm going to do is i'm going to go here and i'm going to click on the request that we wanted and i'm going to go copy and i'm going to do copy as curl c url windows that's important so now we're going to tab a minimize out of that and go to postman if you haven't downloaded postman you can go to the website postman.com and download it and once you get going it will look like this instead of clicking new let's click import and we can go raw text and we can paste in that curl request that we got there if we click continue and import we get it open in a little tab here and it shows us the url we're doing a get request and the headers that we are sending with that request if we just look at some of these headers really quick we can see that we are sending a user agent on firefox and we are looking for json language the referrer which is the website the content type and a miss media miss token so this is going to be quite important because without this token we're actually going to get rejected when we send this request so let's hit send and it's going to come back and it's going to return this json data which we which we just saw in the browser so what we've done is we've basically skipped the browser scraping and we've gone directly to the api endpoint where the actual website was getting the information from as i said that this token is really important so if we untick this token which means we're not going to be sending that we click send we're going to get access to this site is forbidden um so we're going to re-tick that and we can see that we get all the information back now what's really cool about postman is we can click on this little button here that says code we click on that we can see that it gives us all the options to to download the code snippet and the one that i've got highlighted already is python python requests now any of you that have done any kind of web scraping will know some of this and it'll look pretty familiar so i'm going to go ahead and copy this minimize postman and i'm going to paste this right into our vs code so this looks pretty good straight away for some reason it gives us an empty payload dictionary i'm not entirely sure why but we can see we've got all of the other headers that we wanted including this token which we need i'm not going to run this right now because if you do it's just going to print out a load of data at the end i'm actually going to remove this print statement and i'm going to change this this um request a little bit just so it's a bit tidier a bit more how i like it so instead of response i like to call it r and we don't need to do requests.request with a get here we know we can just do request.get remove that url and we can remove this as well because we don't need this as there's nothing in it okay so to get our request which we know is going to be a load of json data because we can see it here we've run it from the postman we're going to need to import json so let's do that now import json and we can do let's call it playerdata is equal to r dot json so all that's going to do is it's going to take the json response from our r variable and it's going to load it into our player data so now we have a json dictionary within player data so what i'm going to do is i'm going to just quickly run that and check that i've got no errors we don't so that's good make this a bit bigger so we can all see nicely um and i'm going to do python 3 dash i and then the python file again python 3 for me it might just be python view so let's just check some of our data so let's go player data dot keys let's see what keys are available for us in here so we can see we've got search total results and players so that is this these ones here turtles are search total results and players so if i was going if i did player data and asked for the total results key we've got the value which is 827. to make our lives a bit easier i'm actually going to download the response i'm going to save it as a file just so we can open it in vs code and we can look at them at the same time so let's open that up okay so this is the turtle response and if we go back to our website and get rid of that we can see that let's look for this guy search okay so we've got him there and we can see that he's played nine games and has lots of stats i'm not going to confess to know what all of these stats are or what they mean as i don't know the sport particularly well although i did watch some videos of it earlier and it looks pretty brutal okay so now we've got this open here we can see there's a lot of information but the information that we're probably most interested in is and the let's go back to the top so we don't want search we don't want total results we do want players so we want to get all the player information so if we go let's say player data and player there's players i think and then we do let's go for zero for the first one so we can see just got all of the information for the first player on the list which is jazz mclennan um this guy who hasn't actually played any games so there's no stats for him what we want to do now is get this data and put it into a format that we can easily uh analyze or manipulate so that generally is xlr csv now the best way to do that is always to use pandas in my opinion but because of the way that the json is it's all nested so if we look back at it here we've got um starts off with the list and then we've got things indented and we've got all sorts of different things we can't just load this straight into pandas but what we can do is we can do we can use pandas to do json normalize for us and what that will do is it will take everything and it will expand it out and make it into a bit of an easier chunk of data to see so what we're going to do is we're going to take our player data we're going to actually just import as pd as we always do and then underneath here we're going to do df for our data frame is equal to pd and we're going to do json normalize as i said that's going to take our json it's going to flatten it all out it's going to normalize it all so we can actually see it all in a data frame and then we're going to put in player data but as we as i showed you down here with the keys we don't want the search the total result or the total results key we actually only want the players key so i'm going to put that in there like this so if we save that and if we just run that we should get nothing no output and no errors okay that's great so now we can just do let's do df dots as print df.head to get us the first five results just to check that it's all working properly we can see that we have got five the top five rows and 134 columns it looks like that has split out games played the name and all of the individual stats which is what we want so all the the last thing we want to do is do df.2 csv and let's give it our name of playerdata.csv and i'm going to do indexes equal to false so we don't have the pandas index zero to however many there are and we're going to run that that's worked now we can see we have this playerdata.csv it's kind of hard to see in vs code what i'm going to do is i'm just going to reveal an explorer and i'm going to open it in excel and let that make it a bit bigger and we can see that we have got all of this data here so we've got player id games played name hi everything so we could easily analyze this data see we've got 134 columns so let's do let's have a quick look and see who is the tallest player let's do search sort by largest and smallest so these two guys are both 2.1 211 centimeters tall cracky that's really tall and we can do who's played the most games so all these guys have played the full 12 games by the looks of it so you could do all sorts of data analysis on that so that's it guys hopefully you found this really useful i'm just going to recap real quick make sure you go to the network tab on the inspect element check h xhr refresh your page and see what you get you might just get lucky you might find the json data copy that as curl put it into postman you can manipulate requests this one did this one gave us all 827 results in one go sometimes they're paginated but that'll be in the headers somewhere so go ahead and change it so you want page one page two pastry or results number 100 or 101 to 200 etc etc and then you can copy that code out and put it into vs code and get it working the last thing is to make sure that if your json data looks like this one you do pd dot json normalize and we chose the player's key so that's it guys hopefully you found this useful i thought this was really cool um cool website and a great way to get all of the player data out for this specific season this will work for other websites as well if it looks like this one did so let me know how you get on any more questions write some comments hit that like button and subscribe for more uh web scraping comment content to come and also check out my previous videos as well for extra web scraping content cheers bye
Info
Channel: John Watson Rooney
Views: 17,968
Rating: 4.9846153 out of 5
Keywords: How to scrape SPORTS STATS websites with Python, web scraping with python, python web scraping, learn python, python json, python api endpoints, python apis, scraping stats, web scrape sports data
Id: SEQjNEawceo
Channel Id: undefined
Length: 12min 53sec (773 seconds)
Published: Thu Aug 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.