Scrape NBA Advanced Stats with Python! Selenium Tutorial For Beginners!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everybody welcome back to the next part in our web scraping tutorial in our python for daily fantasy sports web series today we're going to be going a little bit more in depth with a selenium package using the mba.com stat tables as a real world example for how to navigate interact with and pull data from different web pages now if you haven't watched part one of the selenium series you're definitely going to want to check that out i'll put a little teaser card up in the top right hand corner of the video because there there's a few steps that go into getting your your machine ready to go for selenium that's a little more complex than just installing the python package let's get to coding all right so just like when we went through beautiful soup uh the very first thing we're going to do is to import all the modules that we plan on using now if you watch the previous selenium video you may have may recall that we did from selenium import and then imported a specific module to make our code look a little bit cleaner and be more efficient so we're going to do that again so from selenium import webdriver is going to be the first one then from selenium.webdriver.support.ui we're going to import select and that's with a capital s and that's going to allow us to select as the name implies uh different aspects of the page aka click on that's what select is going to be um we're gonna need beautiful soup and we're gonna need pandas now there's a couple other things i typically do using selenium uh you can tell the the remote driver to wait a certain number of seconds before checking for something to give the page time to load um so you're not constantly erroring out saying it can't find things and wait you can establish expected conditions and the code will not progress until those expected conditions are met um that's a little bit more complex and for the purpose of this tutorial while we're going through cell by cell and jupiter notebooks it's not that relevant but again if that's something you guys are interested in let me know down in the comments and i can make a more in-depth video on how to set those types of parameters up and we can we can look into it further there but for this video we're going to keep it simple and of course beautiful soup is one word all right first thing here we need to establish that we're using firefox as our web webdriver webdriver.firefox i think those are both capitalized nope just the first half so that's establishing that driver is going to be our firefox remote browser there so if we want we can go ahead and run that and we should have a firefox instance pop-up down here there we go so we have our driver now we need to establish the url that we want to point it to so we are going to go ahead and go to nba.com and we want nba stats and let's see we don't want see let's see all player stats you can go through and you can get all sorts of different stats here whether you want leaders whether you want the general stats we have our traditional stats here we can do our advanced stats we do all kinds of things we're going to go ahead and use our just the general advanced stats here because i assume if you are wanting to continue through with this type of web series you're interested in maybe some machine learning and using advanced statistics to create uh predictive algorithms for how different players will perform in different scenarios we're going to go with the advanced stats here so we have our url so we're just going to copy and then we're going to make that a raw string because there's some escape characters in there we don't want to deal with okay and then we want our driver driver dot get url now let's put this side by side with our firefox window there and see we come right to this web page now a couple things to notice here and we're going to go back to our website that we had up instead of the the firefox ones i don't want to mess with that couple things you've got a bunch of different things up here you can interact with that cannot be handled in beautiful soup you can change the table you're looking at also notice there's 524 total rows here and we're only looking at page 1 of 11. so we're only getting is that probably 50 50 rows or so on one page all right that's not that's not good enough we need all of them but if you look at our url up here we change it to all the url doesn't change so that means we wouldn't be able to handle this within beautiful soup we have to come up with some way of getting to this web page coming to it clicking on this drop down and changing it to all or if that's too big of a table we need to be able to iterate through each one of these pages and pull down the information off of it none of that can be done with beautiful soup on its own so we've established what we need to do we need to identify this element and we need to figure out how to change it on our web page so let's come back over here to our code we had driver.geturl go ahead and that should be fine so i'm going to establish a select variable and as you can see i'm using select with a capital s we're using select module from the webdriver support ui package so that's how you can kind of clean that up so i don't have to type in this whole thing and move the select back after ui i can just type select with a capital s and that's going to let python know what i'm trying to do so i want to select and then i need to pick that little drop down button so i need to dot select driver dot and then remember in our first video we did find element by name this time i'm going to find element by xpath that is my preferred method oh didn't mean to hit enter there to put a quotation so we're gonna have a raw string here and i need to find the x path of this right here so i'm gonna simply right click inspect all right so this is what it brings me right to select class stats table pagination or what page you're on in the stats table which is what we want and if we highlight here you can see it's making that box turn green on the web page so we need to select that so let's copy so we right click that uh description there copy full xpath and we're going to come back to our code and just paste it in and i'm going to zoom out a little bit so you can see because that gets pretty big we just paste it close off that string make sure our parentheses are both closed all right so we've established that that we are that that is what the select variable is now i'm not doing anything here i'm just saying that i'm declaring this variable select to mean i'm if i call select again with a lowercase s that means i'm uppercase selecting that element on the page so if you i mean if you look back at our firefox window here the dropdowns not there i haven't selected anything if i were to just run select it's gonna just run me back you know the web driver support select that so i need to be able to call it with something so if you if we look at this this is going to give us a list of the different options we can choose and if you notice all is the first option and if we go back to inspecting the element there here let me do it on chrome so it's a little bit bigger so if we go back and we look at this our first option is all and the cat wants to join the video today because she is the most important thing and no matter what's happening in the world so we can see all is the first option over here all is the first option in the drop down when we open it up and if you remember from i believe it was our beautiful soup videos i mentioned that python indexes start at 0. so the first option is going to be zero the second option is going to be one so if we want to select all we need to select the zeroth index in that list let's make this a little bigger now while i type it out so that's going to be select dot select buy index we're going to do zero all right so what that's doing is i'm calling my select and then we're select by index just like we did find element by xpath we're doing select by index now let's pull up our firefox window that's being remote controlled and let's see if it does what we want it to do so we run it and it should and there sure enough now it's changed it to all so now if we scroll down we should have all 523 however many there were 518. there's one two three four five six so there's so yeah about 523 524 total records in this table so now that we have all of the records in one table now we can switch over to beautiful soup and we can start bringing that table down into pandas so this is going to start to look very familiar or it should at least because this is basically what we did in the the beautiful soup videos so if you haven't seen the beautiful soup videos i've made or you aren't familiar with it from another education source and definitely go check those out i'll throw a teaser card up in the top right um it's a quick overview kind of like this is with selenium going over the basics basically what we need to do is we need to establish the page source for beautiful soup to look at so that's going to be taking all of that html with all of those records live on the table and pulling it down we need to define a parser beautiful soup and we're going to look at that source that page source that we just defined and we're going to use the lxml parser again we need to define the table so now we're going to parse through that page and we need to find that table so let's take another look at the table so we inspected element on our drop down menu here to get the the labels that we needed now let's take a look and inspect the element uh and we don't need that so let's just inspect element here let's see that didn't give us anything we just do inside the table inspect okay there we go and an easy way to do that if you're not sure where the table begins and ends on the page just click one of the data values inside the table and then you can just start closing up these little arrows over here and then you can get back up to your your div the the class equals nba stat table overflow and as you can see we have nba stat table but then within that we have the nba stat table overflow and i in it you can kind of tell here that it's not going to be a problem but i typically go with the most interior or most specific option i can to get started that way if there's multiple maybe there was another class nba stat table that occurred further further down right i would want to make sure i'm being as specific as possible because if you click both of these are covering the stat table there but then we have the mba stat table overflow as well so i like to stick with as specific as possible so i'm going to go with the nba stat table overflow and we have our table table head table body so it's going to be very similar to what we did with our beautiful soup we're going to get the class nba stat table overflow so let's pull that back up parser.find we want to find the div with attributes equal to this and when you put attributes in it's going to be in a dictionary so we want the key to be the type of attribute and the value to be what we're looking for so if i know there's multiple mba stat table overflow classes then i would need to specify something in addition to that so maybe nba stat table overflow and then the next key is data fixed values two and then so on and so forth to be as specific as you can um it just so happens for this one we don't need to we can just do stat table overflow and that's two underscores there and then we can close off our dictionary and then we can close off our parentheses and so that's going to find that table and then i'm just going to declare an empty list here for our list of rows which i actually don't need just kidding so we close off our dictionary we close off our parentheses and then i'm going to find my headers table dot find all th and then header list equals and basically what i'm doing is i found all the table headers here and i want to see so i want to pull out the text value and i want to strip away any uh any leading blank space any following blank space any new line characters anything like that that could be hiding in this i don't want i just want the text values right here which is going to make some of these with the percentage signs look a little goofy but we can handle that we we know what data we're pulling from we know what it looks like that doesn't matter for right now at least so we want the h dot text dot strip for h and headers and we don't want the first header so just like in our basketball reference example the first column was the rank this one it's not showing it to you here because it's just sorting by whatever you do but there's a hidden value there that's the rank that it's sorting on for each of these categories and we don't want that and we'll see in just a minute what else we need to do because it's a little more widespread than that but for right now that's going to be fine so if we run that let's go ahead and make this bigger for now let's go ahead and take a look at that header list and see what we have okay so we start with player team we go down we've got all of our thing like i said some of those are going to be a little different with the slashes because that's an escape character so it's got to be coded a little differently to come through in the text but wait a minute okay we have all these ranks here that we don't want because all those are hidden fields that are going to define how those tables get sorted based on whichever instant whichever field you choose to sort it on so we need to update this a little bit equals header list better list equals a for a and header list so what we're going to do is we're creating a new list here and we're assign here let's let's compare it so we'll do header list one so we can compare them so a 4a in header list if not rank and a so remember when i talked about list comprehension before the reading it this way can be a little confusing but let's start from the back so our first one if not rank in a so if that string rank all caps is is not in that string okay then we're going to add it so for a in header list okay so now read it backwards again if rank is not in a then for a in header list we want to add a to header list one another way to write that would be this for a n header list if not rank n a header list one dot append a so that would be the long way of writing that so we loop through for every value so we can do for header and header list if that makes it easier to understand instead of an abstract just a so for header and header list if not rank in header header list one dot append header so four we're looping with a for loop here for every value in header list we're gonna look so we're gonna look at player if rank is not in header rank's not in player okay so we continue on header list one dot appen so we're adding player two header list one do the same thing for team we're going to keep doing that and keep adding all of these until we get to games played rank and then we're gonna see well rank is in header so we just skip this part all together move on to the next one which it just so happens all these are at the end so if you were so inclined you could just cut off at the end but that's not always going to be the case so i typically like to find one commonality of all the things i'm removing and just cut them all off at once that way so let's let's do a little comparison here and make sure they're the same better list two equals an empty list and we're just going to do header list two data pen so we're going to do it twice one for header list one one for header list two and they should be the same so let's look at header list first off the normal header list has all the ranks okay better list one does not have the ranks header list two also does not have the ranks all right so let's just do a quick check to make sure the header list one and two are the same to verify that this is the exact same calculation so header list check equals a4a and headerless2 if not a n better list one so basically we're going to do is if a is not in header list two we're going to put it in header list check if it's in header list one so basically anything that's in one and not the other is going to go into header list check and let's just do that real quick and it's empty so there is nothing that is in one and not the other so we can feel good about that um so i'm just going to let's see if remember my hotkeys i don't so let's just comment that out so header list one we have and that's what we want and normally i would just reassign that to header list so i don't have multiple string or multiple lists floating around similarly named but for the case of this we'll leave it separately so you can tell it's a separate separate entity so now that we have our headers again here we go now you may notice here um and i'll show you in a minute what will happen if we don't catch this because it several times i've gone through this and not caught it before so let's take a look at our stat table here make sure we're seeing everything okay so we can see here we have player team age going all the way down ending with true shooting percentage usage percentage pace and pi all right if you notice in our headers we go past pi yo goes made attempted field goals made per game attempted per game and field goal percentage so what that tells me is that there are some hidden columns here that i'm not seeing right now but i don't have data for it so that that's going to be a problem going forward for right now just kind of keep in mind that that happened um and we'll we'll deal with how to address that in just a moment because i want to make sure we can get all of the data out of the table as well so next we need to establish what the rows of the table are i just usually do a quick easy variable for rows equals table and remember i've already established the table variable here table dot find all except now instead of finding the headers i want to find the rows okay except i don't want the first row because the first row is going to be the header so let's let's just take a look here if i just find the first rows let's see what that gives us yeah so that's going to be the actual the actual header which i don't want so just like we did in our beautiful soup example with basketball reference we're gonna take the first row out and start here or the zero throw out and start with row one because i just want the data underneath here i've already pulled the columns out now rows equals table dot find all okay and then i so now let's take a look at that real quick just so you can see okay so as you can see that's a little bit gross but that is all of the data here if we were to inspect element and look at the html that's what we'd see so we need to clean that up so i'm going to do player stats equals and this is going to be another list comprehension td dot get text dot strip so if if we come back over to here and we're looking at our data in the html each of these cells of data is a td um i don't know why it's not tc because i think of that as a cell and th table header tr table row td table d so i have no idea what the d stands for table data probably would be my best guess so each of these is going to be the actual cells going across that row so that's why we're finding all td and i want to get the text from there because i don't want again just like the headers i don't want any trailing or leading characters or new line symbols nothing like that i just want the text from that cell so i'm going to get text strip or td and rows i dot find all pd one so close that one and that one for i and range plan rows and then make sure we're all closed out okay now let's run that it runs so let's take a look at that zoomed out a little bit so we can see and this may be a little hard to read but i want to be able to see it on one screen so let's just make it full screen so we can see it a little bigger okay so player stats let's take a look at this again from the back so for i in range length rows so to give you an idea what that is rows is every row in the table length rows 524 so there's 524 rows in this table so that's what length rows is so that's for i in range we could have put 4i in range 524 would have been the same thing but typically you don't want to spend the time to find exactly how long it is beforehand you just want to be able to automate it so it does it on its own so basically 4 1 through 524 we are getting the digit the table data and every single value going across the column because that's what this is here so for every row so row 1 row 2 up to 5 24. we're finding all of the table data for that row again skipping the first one because we don't care about the rank and then for each of those that we find we are doing a get text and strip for it so that is what that layer or that list is compiling and then if we look at it we have everything going across that table dillon brooks memphis 24 years old 68 games played 32 36 we have all of this data here now let's double check and we should end on 5.9 for his record we sure do 5.9 that means that we do in fact have hidden values so let's just look at one record first so dillon brooks so we end at pi 5.9 and 103.39 24.4 etc you can sell we're not going past pi so we we have verified that those are hidden hidden columns that we're getting uh past pi for the field goals made when we looked at the header list so if we didn't catch that and we just tried to dump this into a data frame dd.data frame player stats equals better list one so if we just try to dump this in we're going to get an error that says 27 columns passed past data has 22 columns so what that's saying is our columns variable is giving us 27 columns but the data we're passing only consists of 22. so that means we have five extra columns somewhere let's take a look better list one you can see all right one two three four five yep and we're fortunate here all five of them are at the end so we can just chop up the last five values if they were spurs throughout unfortunately you'd kind of have to just go through it and figure it out um i'm sure there's a better way to do it but the level we're looking at here the easiest way is just to pull them out if you're pulling straight out of a table you can be pretty pretty sure that what you see is going to come in in this order so we're just going to do better list one equals header list one starting at the beginning ending taking off basically the last five values so we're going to end five values before the end so if we run that and we can run header list one again and we can see now that we end up pi because we chop those values off the end so now we should be able to rerun this since we just kept header list1 as the variable name and just reassigned it we should be able to just run that with no error and now we can come back and run stats again and now we have our data frame of advanced stats let's take a look here 524 rows 22 columns we have our generated index here creating it so again we have all of our data here and then we can sort we can group by we can do all sorts of stuff here if you want to take it out to excel to look at it dataframe.2 excel would i name it stats that's stats and let's just put it oh where do i have everything saved at on this computer i'll just put it in fanduel all right so we'll just put this also i don't need that and they are i am a mess there we go so we'll put it in this folder and we'll name it advanced stats example.slsx we run that and now we can pull in to that file location open it up in excel and there you go you got all these green dots because they're coming through as text when they should be a number so you can just bulk update that and obviously not all of this is ready to go into a model by any means but this gives us an easy way to extract that data on a big picture and allows us to sort and compare and visualize if we want however however we want to do it so that's going to do it for our selenium example coming up next we're going to be looking a little bit more into pandas and actually handling all of this data that we've got now they've got a strong framework for data gathering data collection web scraping i'm i'm pretty confident that with the coupled with the documentation what i've shown you you'd be able to go to virtually any data repository and get what you want if you're having trouble with anything specifically be it a certain website or a certain function that's just not working the way it looked like it worked in one of my videos feel free to let me know down in the comments and i am more than happy to take a look at it and possibly make a video going over that specific issue if it if it's going to be beneficial enough for everybody um so thanks again everyone if you want to stay up to date go ahead and subscribe hit that notification bell and i will see you guys in the next one
Info
Channel: Nick's Niche
Views: 3,110
Rating: undefined out of 5
Keywords: Python, Python Tutorial, Learn Python, Python For Beginners, Daily Fantasy Lineup Optimization, Web Scraping, web scraping for daily fantasy, Python for data science, sentdex, bs4, Web Scraping Tutorial, Data Scraping Tutorial, How to Data Scrape, How to Web Scrape, Request, Selenium, Selenium Tutorial, Selenium Python, How to Selenium, Selenium for Python, NBA, Python Selenium, selenium tutorial for beginners, selenium webdriver tutorial, selenium python, selenium python tutorial
Id: fAW6AxMHego
Channel Id: undefined
Length: 32min 59sec (1979 seconds)
Published: Thu Aug 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.