How to Scrape SofaScore for Free Football Data (Updated Method)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
SAS score is one of the most popular websites to get football data and if you've seen my previous video on SAS score and scraping data from SAS score you know that it used to be a very simple and easy way to just gather data via their API well that is no more because they have changed a couple of things in their API so I'm making this updated video to show how we can still extract that data even though we have to work a couple of workarounds and do a little bit of hacky things to actually extract that data so sofas score just a brief summary of what we do to get data from sofas score they have all of this data on their page but they load most of it through apis by calling and fetching that data from API so the way to actually see what apis are being extracted if you rightclick and then you come down to inspect if you're in Google Chrome okay and then you come over here and you change this to the network tab then you can start to see all of the apis that are being loaded so I like to click on this Fetch xhr and then you can start to scroll through all these you can kind of Click around on the pages so if we go to the statistics page you can see they have this shot map here so what sofascore has been doing is they have been playing around with their API I don't know if they've been seeing a lot of people scraping it or if they've just been changing how they're using their data but basically in an effort to make it more difficult whether this is intentional or in unintentional is they have started to do something where they randomize the data that is returned from the actual API you probably need some sort of token or you need to know just some sort of internal tool to actually just hit the API through Straight python code so what we are going to do is we are going to use something called selenium which allows us to mimic a Chrome browser and then we are going to use that selenium Chrome brows browser to actually read these Network responses and these apis that are being loaded in the page so that's how we're going to get around this for now and this should be a good way to both learn how selenium works and as well learn how to kind of solve problems with web scraping so to do this we need to install a couple of packages so come over to a terminal or a command prompt depending on what type of Os or computer you are running and then we're just going to install two packages that we're going to need so the first one is going to be selenium and if you say pip install selenium okay and then two equal signs and we're going to be using 4.2.0 okay and if you already have it it'll just say you know successfully installed or like you already have it essentially like requirement already satisfied if you already have everything otherwise it will go ahead and install that that for you the other one that we need to install is called Web driver manager and this going to allow us to actually run the Chrome browser so if you say pip install Okay web driver- manager and then we'll say is equal to 4.0.1 like that so this will install basically the same thing if you already have it installed and then you need to make sure that you have Chrome installed as well and otherwise it won't work and then the next thing you don't necessarily need to do this but there are things called Chrome drivers which I usually just download just in case this allows you to run Chrome and it's called chromium to be able to run selenium basically on your desktop through python code so if you come over here and you I'll put this link in there you go to chromed driver. chromium.org you go to this Json endpoints right here and then you just find the version of Chrome you're running so 124.0 you know I think that's their most recent one and then they have the different uh ways that you can download those files depending on what system you running so I'm running a Mac arm and so I would download this one right here okay but if I had a Windows uh bit 64 I would download this one if I was on Linux I would download that one okay so download that if you need to you most likely won't need to but just in case you know it's a good thing to do and then what we're going to be doing is we are going to be scraping this page right here so this is a game from a couple of weeks ago it's when Messi had five assists in a single game so we're just going to be scraping this page and what we're going to be extracting is we are going to be just extracting this shot map so let's hop over into a Jupiter notebook and if you come here let's open up our Jupiter notebook and then we're just going to start by importing our packages is so I'm going to zoom in a little bit just so you can see and so the first one we're going to import is going to be Json so we say import Json and then we'll say from selenium okay import web driver and we'll say from selenium do webd driver. chrome. service import Service as Chrome service like that and then we'll say from web Drive dricore manager. Chrome import chromedriver manager okay so go ahead and run that and that is why we needed to pip install those packages because we need this selenium package and we need this web driver manager package so then to actually boot up and start our web driver we need to set a couple of settings and then we can go ahead and we can create our Chrome instance for testing and web scraping so if we say options equal web driver. Chrome options like that and then we'll say options. setor capability okay and then this one we need to type it a little bit weird but it's going to be go so Google and then logging prefs with a capital P in the middle and then close that and then a comma and then inside a dictionary we need to say performance is the VA key and then the value is all capitalized and then we'll do the second one browser browser all as well okay and then close that dictionary off and then close off the parentheses so now we need to actually create the web driver and we use this a combination of both selenium and this web driver manager so what web driver manager does for us is it actually is going to download the necessary chromium that we need so it's going to download this based on what we what version of Chrome we have okay so we'll go ahead and come in here so to do this we just say driver equals web driver. Chrome okay and then we'll say service equals Chrome service and then chrom driver manager and then in parentheses. install with more parentheses and then a comma and then say options equals options like this so what we do now is we just run this and it's going to do this it's going to kind of take a second to actually go and install and then you should have a new window of Chrome popup so this is your automated Chrome you can tell because it says Chrome is being controlled by automated test software when you first load it so that means we can now use Python to go ahead and manipulate this browser go to different pages and it's really the power of selenium and using it for automated web scraping so now that we have that setup we need to tell selenium which page we actually want to go to so come back to this inter Miami vers New York Red Bulls and we are going to essentially grab this URL right here okay so grab this URL and so the way sofas score works is they use this match ID right here to tell it which page and which API should be being called to load the data so as you can see this is the 11911 1622 if we look at this shot map right here we have on this headers tab it has sasore do.com API and then that 11911 622 so that is the match ID which tells it to go get the data for that specific match okay so if we come back over to our jupyter note but go ahead and copy this URL and then we're going to tell it which page to go to so the first thing we want to do is tell it basically do not load a page longer than a couple of seconds because there are Pages which will load JavaScript and it just causes like an infinite Loop of loading we don't want that to happen so we'll say driver. setor pageor load uncore timeout and we'll just say 10 so 10 seconds and then we're just going to say try okay we'll say driver. getet and then in parentheses and a quote put that tab put that URL close it and then say accept and then pass and then we'll say driver. executor script and then in quotes say window. scroll to and then in parentheses 0o comma document.body do scroll height parentheses with a um semicolon and then finish it off with the actual quote and a parentheses so what this is going to do is this is right here going to go to that UR URL it's going to tell our selenium to actually go to that URL and then it's going to execute a script which is going to scroll all the way to the bottom which is going to allow us to load all of the apis on the page because some of them you do need to scroll to actually get those apis to be called and so we're going to use this execute script to run a little snippet of JavaScript to scroll our browser essentially to the end of the page so now if you run this we can come over and we can kind of see what happens it loads and then it Scrolls all the way down to the bottom of the page so now we're just going to use some python code to extract all of these Network responses over here and just parse out this shot map one so to do this we are just going to say logs uncore raw equals driver. Gore log and then performance okay so if we run that that's going to give us all of those Network responses essentially it's going to get all of the apis that are being called and then we can look at all these if we say logs equals bracket json. loads okay and then LR bracket message and then a parenthesis and then another bracket with message okay and then four l l r in logor raw so this is starting to parse these logs raw if you want to kind of get into the nitty-gritty of what is in those logs Raw it's not very beautiful and I mean you can go look at it if you want you just say print logs raw you can see it so now we have all of our logs so these are all dictionaries we can look at the first one so we say logs bracket Z so they're all basically going to have what is being loaded through those Network responses and those apis so now we're just going to Loop over every single one and find our actual shot map API so if we say 4X in logs okay if shotmap in X prams okay. getet headers comma empty curly braces and then dotg colon path and then an empty string we want to print just that same thing so you can copy this x pams get headers thing right here and then close that off and then we need to say break at the end okay because we're going to use the x value when it actually does find that shot map API so now just kind of so we can tell what's going on here basically basically we are looping through every Network response so that is what our log is it is all the network responses and we're saying if shot map is in the pams and then we're using this dog which is a safe git which means that we are basically looking for headers in this dictionary and if it is not in there it is going to get and just return an empty dictionary so that we can continue to try and fetch this path and if path path is not in that dictionary then it will return an empty string otherwise it will return the value of path okay so that's kind of how a git works it's just kind of like a safe way for us to actually get that so if we run that you can see that the API comes back and there is our end point okay and our x value we basically have stopped it on that end point so this is our X right here so it has a bunch of information about that Network resp response and it has all of the different cookies different headers it has different parameters that we can use so now we're going to run some code to actually extract that data out of our selenium browser and put it into our python code so to do this we'll just say shotmap equals json. loads okay and then driver. execute unor C dpor CMD okay and then we're going to say Network doget response with a capital R Body and then put a comma and then the in curly braces so in a dictionary we need to say request ID and that's going to be our key and then our value is going to be X prams request ID okay and then curly braces and then do another parentheses so it lines up right here and then you need to say in Brackets body like that and then close that off okay so this is going to load all the data in and then to actually get the shot map it is nested one more time so say shot map so now that has Ran So let's look at our actual shot map so now we have the actual shot map data from that a score API and just to confirm that they are not randomizing the data on this one we can go look at the expected goals of this Amil forsberg shot was 78 because it was a penalty in the 97th minute it says that his location right here player coordinates 11.5 50 let's go verify that that actually was the case so if we come in here shot map so we have player coordinates 11 11.5 50 expected gos are 78 so that is how we can extract data from SAS score and actually get around their mechanisms that they've tried to implement for blocking and this basically allows us to continue to use SAS score which is a great website for scraping and it allows us to kind of practice extracting data from apis and collecting really good uh data as well that we can use for our own analysis so that is it for this video this is a great video and a great way to get started with scraping if you do want to learn how to scrape more metadata then go ahead and watch this video here which is learning how to scrape transfer market using some more of these unconventional methods to really bolster out your web scraping skills
Info
Channel: McKay Johns
Views: 3,531
Rating: undefined out of 5
Keywords: coding, python, data science, programming, code, data analytics, sports analytics
Id: lvJqz2EZHY8
Channel Id: undefined
Length: 17min 15sec (1035 seconds)
Published: Wed May 15 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.