Tidy Tuesday live screencast: Analyzing beach volleyball in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Dave Robinson and welcome to another one of my screencast where I'll be using our in our studio to analyze data I've never seen before as usual the data set comes the tightly are the tidy Tuesday project which is an amazing weekly project run by the art for data science online learning community special shout out to everyone who's who's coming in live and gonna be watching this analysis as we go I've been really excited in these last month or so that I've been doing live screencasts to get suggestions from people as this as the screen press proceeds so if you see if you see I have an idea for a graph if you have an area you would like me to explore the data you'd like me to explore if you see a bug in my code I really encourage you to speak up in chat so the UM so yeah I'm really excited and let's get started first people can hear me right I don't know I could never know if someone said someone tell me if they can hear me I'll keep an eye on that alright alright and what I'm gonna do is today's data comes from beach volleyball so the so I don't know anything about beach volleyball so exciting to see let's see uh awesome people tell you they could hear me awesome I don't want to about beach volleyball I have done one data set flood how you Tuesday before there was on tennis tournaments which looked at a which side one hour tennis it's not that similar to be trawling bald but one thing I really remember was fun about that was using early rounds in a tournament to predict who would win later so that might be um fairly that might be fairly interesting to look at here even so even though I don't know much about beach volleyball oh there's things like disadvantage and Servian okay let's grab the data I'm gonna take it get it using the tidy Tuesday our package and a new arm D and save it as beach volleyball I could do a library tidy tidy verse library um I'm gonna do also themes that being light find I often wants scales I'm going to do that while I'm at it scales package for things like percentage pretty not do it as a percentage of scale on the y-axis oh it's downloaded a lot of data let's find out what cynic it's one really tall one really wide data set well that's interesting I'm probably gonna want to gather this a little too pivot this a little bit we didn't a new way to do it so we have V be met it's gonna be V be matches so wow it's really big still downloading that's exciting I don't if anyone tell me I tell me while I'm waiting is anyone following along do people follow along on the code I'm really interested at how many people do and how many people just um watch for maybe follow along later all right so it looks like we have data on the UM on circuits tournaments country so each of these is going to be I would see it looks like it's one observation per match and then there's the tournament okay so if I said count circuit and tournament and date we get okay so met some have a lot of match I was quickly looking through this again I'm also gonna get get some quick help on it by showing this I see okay I can see the readme at the same time that's so cool nice welcome to tomba tidy Tuesday our package and looking through what we were what I want to look at the circuit form and country year date team it looks like teams have um genders is male and she's just male and female yes there's male and female about beach volleyball and we have time I am curious over what period of time are we looking at I'm actually going to because it's always it's generally one object I'm going to say choose data VB match is if nothing else I think that'll make the autocomplete a little better and here's my VB bet VV matches here we go alright and VB matches okay and I want to count by year looks like we have 2000 to 2019 so it's only twenty first century games and came through so we have the date we have the match now we have how many combinations of tournament circuit date we have 700 tournaments here but a total of this is how many matches alright so one thing we notice right away is that there's a lot of columns that are like W to W W and L winner and loser winner player one name when it play one birthdate okay so we have a wide data set I'm supposed to be using pivot longer for this am I gonna use for bit longer I'm going to use pivot longer for this instead of my instinct is to use gather but I'm going to do it the new way which means I need to remind myself how it works this is a um this is the way generally they recommend the tidy rows now to take data and make it why I make it a long go from a wider form to a longer form I'm gonna say is pivot longer and the columns I want to pivot are anything that we see names it was I'm gonna take anything that can it starts with I think do VARs starts with W or starts with L let's find out if I got this if I got this um structure right nope hmm nope what is the way to do it let me take a quick look calls equals starts with O it doesn't like that okay I can't Oh what if I do a C nope doesn't like that oh okay this in this case actually showing that it doesn't like the fact that there are some that I have a birthdate and some that I've character vectors which actually makes a lot of sense so I wonder if there's a way to deal with the types names peep types I'm really curious about this so the the names types I'm seeing this type of edge character common types the trouble is the values to column it the problem is this going to sometimes be a date other other things I'm actually gonna do something where I'm gonna say me you take at for all cases where starts with W starts with L I'm going to say as character going to change although even the ones that are dates the characters on that pivots it longer I also notice that it has a name and a value I'm actually gonna say name - I'm gonna say I can keep it a name in value no I'm actually gonna say keep this as VB long it's not quite Qaeda yet it's not tidy yet because right now look at these names these names are gonna be like a this so I actually need to separate them separate is a great function are where I said it would I take this at IDR what I say I actually want separated by underscore I want to separate the name out into three columns I want it to be a winner loot winner loser or I'll call winner loser player and that is now named will just be age birthdate country etc so the that was a little bit oh my bad I actually need this to be in a vector I don't catch that nobody told me about it so I'm actually doing is I'm separating these out into three columns and there are a few that I'm missing which ones in here a distinct name which ones don't break into three components this one doesn't W player two and this one does a W rank so some of the and some of them contain them additional underscores the way that I can fix that is I say separate separate by underscore who's going to return to a bit of a on data tidying process but I also want extra equals merge if there are if there is some additional ones and fill equals right what do I mean by fill equals right well let's count winner loser player named Phil is going to fill in the missing values if there are if there aren't enough then um and I'm going to play this to view I don't really need a count I could do a distinct all right and notice that now we have up is the player this is the rank so that actually tells me is that here we go this pew player one needs to be this should just say name in these cases but if it's rank it should say rank wait unless L rank Oh loser rank okay so it's not a player in this case alright so what I'm gonna do is like L player two LP - alright distorted so so what am i underst what am I seeing here the structure of this wide data is that it says L player 1 as opposed to LP one name I really would have liked it to be LP one name in fact I think it might be a little easier if what does I do what they say rename - W I could do this a few other ways but I'm actually gonna do it this way I'm gonna say W player 1 rename that the WP 1 name and w p2 name p2 player - I'm doing a few renames this is definitely not the only way to do it I could have done it with some if else's down here but something about this kind of like speaks to me I wanted to say name there see I wanted to have at the players name because they believe that field has a player's name here we go and W rank alright and the last thing is that that is that it's like um that the rank does it the rank occurs by team not by player so we can say play it where we can actually say like let's see for the remaining one I can quickly say mutate player is coalesce player with coalesce will place any missing value with link with team it is if there's a metric that is not does not happen by the by player wanna by player - and I'm gonna call this I'm actually gonna try doing all this in one step so I do VB why am i doing this in once because this takes a little bit of time this pivot longer takes a little bit of time turn this into a wider and do it tie to your data set a taller data set and yeah the go give you long player so winner loser wp1 name is Kevin Wong and this actually I do want to pivot that out wider yeah I do because you're never you're not really gonna want name birthdate age all stacked on top of each other it was useful at that moment but what I really want is to spread these back I used to say spread now I'll say pivot wider what I'll say is pivot wider and I quickly remind myself because as I said I'm kind of used to gather and spread but I'm gonna push my own comfort zone today and myself all this now works the new recommended functions ID columns yeah so the I don't need I don't want that I want names I want to get the names from I'm going to show this really quickly I want have been names from name and the values from value that's pretty easy to omit remember names from name values from value and now I called this VB long I'm separating these stuffs because it takes a little bit because this process takes a little bit of time failed to create output due to bad names some quotes does that help a big so no it doesn't I imagine one of the issues is the is there are their names that are missing there are I thought I fixed that did I not fix it I thought I did oh I see oh look at that the trouble that I did was it's W rank it's not W so I did the coalesce on the wrong thing it actually should be coalesce name yeah name with Clare which is a little silly and the name is name coalesce and then player and a show with the output of this is its players if else player if player is rank then use King just because I don't like a sane raid oh it said nobody could have done could on this I could have said W rank equals W team rank L rank OOP I got that backwards I just I'm just renaming these but notice I could have done that a couple different ways now again I'm doing a little bit of hiding him dude a little why of work with this why because it was in one one Roker match and I might want things by the player okay so yep there it is p1 p2 team and the team is just like that the rank level we have winner or loser I'm gonna do one last thing I don't like winner and loser be in lower case I'm actually gonna do a winner the winner loser is string to upper of winner loser I like it that way and next thing I'm gonna do is I spread the names that and the values out but the other thing I'm going to need to do with that is is turn those things back into dates or into other um no yeah I am gonna have to oh nope account name I wonder what my dad names are failed yeah do it do to bad names I have a silly question if I just did spread name value does this work there are 14 names many values that works though the rank is gonna be a little bit weird isn't it rank is gonna be weird I'm still gonna keep it like this because I'm I may be VBS gonna call this Phoebe players we're to drink and Beth rank ended up in a super weird column all by itself so actually gonna say BB players filter name is not equal to rank and again I do not yet know how to get um pivot to pivot wider to work with us and you spread instead which which is still supported in a tidy on will be indefinitely even if it's not no longer being actively developed so here we go alright and then they say match and then you see we actually have now we have four rows the winner player one winner player to lose a player one loser player two and that we have the various ages we have their birthdays we have their the country's height etc and now we actually have all the observations about each app player we've actively shaped the data thanks to a combination of a gather of a period longer a separate and a spread so now we actually we have one row per player I just kind of instinctively want them not there this one I saw just instinctively wand it in a form like this because I knew I might want to do things with column like total attacks and you can't do that you can't do very or um total what's an ace anyone know but Bob is I don't know it's a beach volleyball move the ACLU's data is uh what's an ace aces pointed needs an ace is a point ending serve that's cool serve kind of I think starts off a volleyball game and if you get the point on it I say yeah I don't see a lot of I don't see a lot of nan na that nan na values huh hmm I could check that in a second but alright I'm gonna look at a couple other columns we might need to tidy so the Cup circuit torment the date the gender which match it is and now we have again one observation per match per player and we have scores why are there two scores for this hmm why are there multiple scores within one match this doesn't have to do with our reshaping um that is it it was only one observation but it still had 21 to 18 21 to 12 eyes I up I get it because of matches multiple games so there's like you could say game one game to game 3 here's a here's something fun with that so uh some said a sec so this there's looks like a set is best out of three is it always three I'm not sure but I'm gonna find out a little bit more about that um that it is that now we have a duration and duration is in um wonder actually BB players what is the class of duration at big time that's great so the reads that parse the CSV parsing actually worked alright one thing I learned through commenters and men men's volleyball 3 it's 3 sets women win in women's it's two sets win all right yes so you can see here it's like actually that doesn't hear first the second team won the first team won then the first team won this was a two sets one huh well the other thing you look at that in a bit but ok and then we have and then we have the player level data okay so this was B B players I might have one more thing that I want to look at which is um I'm actually gonna go back to the matches not the player level and we don't have it obsess but check this out imagine I said imagine that as soon as I grabbed this I mutated this and I said match I did his rownumber you'll never find this match ID in any tournament did any other day didn't make any any other data said I'm just keeping it so that I have a unique identifier for every match not for every player not probably set and then I can select match num and I also want do I want any of these other things yeah why not I could say circuit through a date through gender if I liked after matching them if I liked and this is this some things is common across a match like here's match one and then I include score score score score okay now the question is what can I do with this can I visualize scores not much well it's all in one um well it's all in one value so got one more tidying trick I'm gonna show it's called separate rows food hi yah well say I want to separate the score into multiple rows based on a separate separator of comma space that's different can separate which divided one variable into multiple this divides each observation into multiple now the first one has been unnecessary to separate scores and now it's really great about that is all of these fingers crossed are going to have two scores a team one score on a team to score so I can see and is it always I wonder I'm gonna check out analyst does this actually show is it always oriented as the winner as a person who won the match get goes first or like maybe let's actually take a quick look that a script the documentation on score where is the documentation on score here it is it doesn't say it doesn't tell me right there so I'm going to say and then this will it see him this one it was so far the winners it has won each of those I'm going to actually view it a second I was wondering if if does the winner always come first so if I look at five it goes um winner came first winner came first okay I'm gonna assume the winner comes first and then I test that what I'm going to do is call it is I'm gonna separate the score into winners score losers score oh and I need to put this in a vector and it automatically separates on non alphanumeric characters up but there are additional pieces where could I say so where what if I actually took this and I said slice that is I want one call to Rose I'm going to grab two of those examples 1031 and 1071 forfeit or other okay so the thing about this is second and alright what I'm gonna do actually say count score do you see any of these that don't look like yeah not so many sort of actually gonna do is say mutate and I'll say score equals nice little trick na if if the score is 4/5 or other replaced it with an na and separate we'll just go ahead and ignore that it'll have na is in both of those fields I so or so I assumed but what am I missing slice 14 104 select score that mmm-hmm oh I didn't own up that's the wrong one because I need to say I need to do it after the this step 14 104 we taught retired huh I'm gonna try and just get rid of those cuz I don't know what they mean and I don't understand the question I won't respond to it I'll say score equals string remove score retired it looks like it's pretty rare and if I keep running into my bad look at me and I'm missing a parenthesis status there they okay at least it's separated well this time but notice their character there's still strings and until I do convert equals true at which point does automatic type conversion using typed convert and you have au 21 18 21 to 22 okay so the so this is called VB sets these we now have it at the set level and we can actually say and now I can test my hypothesis so I can say how off when does the that is hypothesis that the winner the winners score greater than the losers score so I'm actually going to say group by and actually gonna say mutate winner 1 is w score is greater than L score that is the winner what sounds kind of silly where the overall winner 1 and then I'll group by match num summarize the what am I doing here I'm trying to find how often did the winner will win and percent with a winner 1 mean winner 1 haha didn't work mean of and anyone catch where's my bug Oh hmm no that's not it not at either how Oh Oh silly me I didn't match num I actually meant to do match ID because I was then I didn't rerun this I did not rerun this after adding a match ID or I didn't include oh that's the probably didn't include it okay now if I plot how many games did the winner win I would hope that it's always a third or more yeah w is the ight so that was a quick confirmation that the first one listed in each of these is the O is the match winner I could be missing some one or two times so that this happens but mostly I'm not overly worried of the data being wrong in one of in a tiny fraction cases okay so now we have sets so that was thought was some some cleaning of the data I have it at the player level I have it at the set level and now we've spent half an hour on data on data cleaning we learned a couple tricks like separate and pivot longer and spread and sand separate rows did separate twice I am going to do I'm gonna look at players when they going to be interested in is like who are good who are good volleyball players who tend to win it can I take a data before some point and create points after so I'm gonna actually look at our VB players oh one quick question is do is if I does everyone does every volleyball player is there we've all played a volleyball player double does it this always have a name what if there was some that didn't have a player two can I say is na a name no doesn't happen looks like as always it's always doubles volleyball so I was two against two that's worth checking because it the spread and get and all the other steps would have still occurred if it was sometimes na all right so I hit my players I'm gonna group by the player name and summarize a number of gum what's the number of matches I do more than just an end matches otherwise we're gonna count I'm gonna do a lot more all right that's cool I'm gonna do a percent winner is mean winner loser is W how often do they win all right so Jake Wow Kerri walsh-jennings on the most organized by gender just so I can see the things Kay Walsh Jennings well and really incredible player apparently I know again nothing about beach volleyball but has played more games than almost anyone and is one eighty seven point six percent of them so we'd expect all these these um top players would expect they've all played many games they've won a majority they're not that's why they stuck around so long and when I say stick around song I mean played in many matches so it can also say started is min date last I'll say first game is min date last game is max date and look at this so Jake Gibb and Carrie Wall Streeters were both active have both been active for something like two decades and yeah you'd have that for a lot of time to be at the top here so this is um this is interesting stuff so what we're seeing as so imagine I said bye-bye player I got a question that is let's see I'm gonna jump into that question a second but I think there's a really interesting one about can I see which players tend to play together that is an interesting question I'm gonna graduate jump into that second so Carrie what see Carrie well whilst ratings clearly a really terrific player I'm going can visualize I'm gonna say who who's been playing the most and how often are they a winner and I'm only gonna do it for people they have played in at least 500 matches this is as a start and I'm gonna be on point I think I should do no I don't make this sale Enid put something on the log scale do i maybe I do want to add a few things here we go generally the more someone's played more likely they are to be a winner I'm doing a quick little uh I'm doing a quick um scatterplot just to like see that it's what I would have expected bad it's on a log scale yeah I like that more I'm also gonna say scale why if you've watched my screencast for a while you know a big believer in log scales and percentage on the y-axis maybe throwing a number of matches since 2000 and percentage of matches one that's cool stuff did I okay I go good yeah i already ungroup the data when i summarized that someone asked a great question the comments why ungroup it's because otherwise I'd still be grouped by name I could slow down even a step like this filter it would be slowed down by that summarize only peels off one of the levels so I've got number of matches percentage of matches one if I were and yeah general you can see oh there they're positively correlated that's one that is one thing worth knowing naturally that and where's Kay yeah there's Carrie well Kerri walsh-jennings this play item is a players even better just for kicks I'm gonna throw in it does it's not the most salient thing about it but I'm gonna throw in color equals gender for men's verses and both the time all three of the other individual players who have won the most we're playing in women's volleyball all right and then we sit with them overall we we have mixed more mixed together a little bit more so there's one incredible player was played almost a thousand games and she's won about ninety percent of them and I can quickly check on that by saying arranged now I could say arranged descending percent winner that's not gonna work it's gonna include a few people who are who have played very few games I could go back to n match was greater than 200 and Misty may-treanor it looks like as a player who's Pelusium one ninety percent of games and she played for sixteen years alright so I'm guessing people that are fans of volleyball are um are enjoying this I got one interesting question which is people have seen that folks do is label the axes when scale Wyatt log can they should they say log scale I think that's a pretty good idea I think I put it on the wrong access definitely don't do that they put on the why when you've logged the X I think there's a good idea because it might otherwise not be obvious having said that I feel like half of my graphs I log scale so I definitely recommend looking at a class of them I say there's my 305 to 1000 yeah all right cool so bye like but this is still just that an early start but I think it's interesting okay so that that was one of some ways we could act aggregate by by player now question we could have is see this for a second let's see what else we have at the player level total I'm worried about all these ones how many knows how many na s there are I knew Sian I'm gonna say summarize if if you're numeric because I'm a little less interested in the UM the other ones say if nope none of them are numeric oh I forgot to do something to my data cleaning did anyone catch it I don't see anyone mention it but I could easily have missed someone is that all of these are still strings even the ones that really should be numbers so here in my VV player is a spread name value what I need to do is I need to mutate ACK I need to say for all the items from let's see from age through totes serve errors all of these I need to say type there's actually I miss this weight type convert actually curious about something does this take all right I'm going to be doing using type convert instead of mutate add let's see how this works I'll use type convert from radar I think I've seen that before but I had forgotten it exists and I'm gonna turn all the columns any column it is still it that is a string right now it's gonna double check it and see aha it now knows that um that's great it now knows Oh age be this birthday should be a date and so I want to put it back into a really handy format oh isn't that great I think it's great and now we can ask some questions about player level statistics so first I'm going to actually say summarize all mean not is an a dot that is I want to know oh hmm I'm looking at by player I shouldn't be looking at by players you looking at BB players what fraction of the data is not know I like typing this together why because then I get it in one tall format it's the kind of thing I've done sometimes I just say for every single column what fraction is known most have a country a score an age a birthday duration the duration these are things that are generally the player and then then we have a player level one oh man these only 18 percent of our games have attacks Diggs kills hit percent what's it kill and errors and total surveyor so only a fraction of games have these let me glance up to tax Technic swings that kills point ending attacks and Lu you see losing player total hit percent okay lots of things we got at the game level but again only on 18% of our 18 percent some have aces and blocks but I better the same ones as have this what if I actually said look at VB players and filter not is na total attacks was total attacks one of the ones that yeah total tax was one of the ones we we end up with a much smaller set I'm curious what is it by if I said summarize mean not isn't a this then I'll find out the percentage by year that are not missed that our not missing we would so all of these first few years are missing it and then it's a fraction from there on okay worth knowing I'm gonna say percent has a tax gonna call this and I'm gonna say this the number why am i doing that because I don't just want to mean I want ourselves what if I said tournament arrange to send him and I'm trying to send why some have it and some don't okay and what if I said circuit there are two circuits FIV B doesn't most that's that wasn't helpful and a VP does have it sort of a year and date okay it looks to me like something like most tournaments never have this information and some of them have it half the time or two-thirds the time no tournament always has it so this is something that I'm um that I'm seeing okay so I'm yeah I'm going to look for a second just that the ones that sure look at the ones that would have it and try working from there I don't know we really interesting it would be interesting to see things like how does the total attacks total kills etc affect this I could do like the regression should I should I should I should I yeah here's what I'm going to do here's what I'm going to do I'm gonna be gonna say I'm gonna take on a problem we were to try analyzing a question the question that I'm going to analyze ooh wow wait hold on I got some suggestions and questions we walk through these first then I'll talk about what I might do with remaining 20 minutes to explore this data all right so one question is David is there a workflow of some sort that goes on in your mind when it comes to cleaning data for example view data set check data types check for Knowles etc the truth is since I check in for nulls I sort of just you as I think of them and as I browse through the most important thing I do when I open a new data set like what we looked at like when we looked at VB the original one VB what was the original call BB matches the first thing I do when I look at it is figure out what each out what each row represents so much observation represents in this case one per match which we also could have guessed from the name and then I notice oh man these this data is wide you've got you've got say players in two columns you got ages in two columns we got height in two columns so that was something that made me they told me that I probably wanted to on that Toba that if I wanted to reorganize that my also if I want I knew that if I want to do anything with score I'd have to reshape the data there well so one of the first things they did was just get the data into a couple shapes that I'd want to work with now I did end up spending half my time on those shapes but if I was looking at spending more than one hour on this data set this this this effort would pay off tremendously and I'm still not done cleaning the data there's all kinds of things I could probably do could probably do to work with this I have a player have a this is really not players this is player matches yeah I'm actually gonna call it that because I actually don't like the name VB players it's B be player matches I'm not going to keep all right and okay so that was one answer is why I like to think about how how do I want my data shaped all right and I have a couple of the questions is are there any zeros in the hai na columns maybe na means zero uh-oh hmm I think that that's interesting except that has a question from John if that were the case if that were the case I would have expected I would have expected it to depend so sharply on tournaments the fact that the most common tournament G stead and the fifth most common torment had no data at all makes you think there's probably not always zeros but it is a worthwhile question filter total attacks of places equals zero yes there are zeros for attacks and aces so that's not the issue as thanks John great all right so the UM oh and John points out that he that he that he asked before I did the aggregation by tournament great so at least we don't have to worry about zero so I think that was a good hypothesis I love getting these in someone's just to make it a shiny metric dashboard I've made a few of those I love doing that stuff but I've made a few of those recently I want to try something different today here's what I'm going to do I think yes I'm gonna say I'm gonna ask a question my question is how would we judge a rookie player so every player so I'm gonna say rookie even though with the state only goes back to 2000 how can we judge a player from their first year so a question that I'm curious about I'm just curious I don't know if this is the the question that someone diving into this who was a volleyball who was a real volleyball expert would ask this is um imagine I only took the players with it when I saw how they perform in their first year but in particular I could group by the player the the name and say first year is many year and now I could say go on group now I could say year is I could have them actually gonna say maybe player first now just do it I'll just do it in separate steps i I was trying to be too clever what I'm going to do is a group filter where I say year equals men year player first year so imagine that I actually wanted to say so for every one of these players we now I'm only I'm looking only at how they did in their first year what if I want to summarize if what if I summarize the performance based on that so what if I said group by player and now I'm going to do it I actually should have said name because it's not player diseases they player one or two so little unclear and I'm gonna include those things that are that I've done before number of matches percentage winner first game I'm gonna summarize by all those things and now I know them for each of those players first game last names less interesting here because in there it's only in their first year it's only it's always gonna be within one year some players played 50 60 times in their first year but not not too many now I'm going to quickly add I'm curious I used height is that the heights always one to one almost I think I gained one observation here oops I'm still grouped by a name at the semis so what if I said and greater than one there was one player whose quote changed height I'm nervous that could happen in other cases so I'm actually going to say mean height is mean of height and a dot R M equals true basically if they give multiple heights oh this could be that oh wow this could be two players with the same wow this commuter players the same name could is it I don't know I just in case I'm actually going to separate them I'm gonna try this I'm gonna say name height they could get taller over the course of it feeling they don't okay throw and I'm gonna throw these in his group eyes instead of activating them birthdates and so that name height birthdate and Jen let me see their gender and what else is at their level their country and finally year which is always gonna be one for each year so there's one player with two heights that was the one that popped up other than that I didn't get any data duplication from that no one changed country for example within a year all right so the so we see the name of the gender the height etc what if I said arranged by descending let's see if they arranged by descending and matches somebody told me I thank you if somebody told me that Leonardo the one with two heights is two different players so I'm thankful Leonardo Gomez is two different players someone thank you for thank you for that Massimiliano so I if I do so if I said okay these the players play the most those are these are the kind of boring things I don't want to hide anymore what I want what I can then show is I also want I also want to know the errors and I want to know the you see total errors is mean of total errors and I will move the missing data I'll say actually I'll call it average errors some of them are going to have no data a lot of them in fact might have no data especially look at the first one that's a little frustrating percent winner and if I do average errors and I could do average what else was interested average attacks total fix attacks yes some people like em but that means that these people have none in therefore in the - that makes a little sense cuz mm what we saw that there was no data for the first couple of years in this data set I'm curious within this player first year summarized now and what if I take the ones that have an average attack all right it's a start so if average attacks average errors what else would I want from there the data I can show for this player like their attacks they can say score nice and salaries and this is good and what if I said what if I didn't just say did they win or do they lose what if let's see I'm just thinking about um this is how many matches how many this is how many matches today one what if I what if actually want to know things about there I'm the gap in their score actually no I let me get my got a few more things kills average kills is mean this could kind of be done differently but it's not incredible easy to so I'm going to say total kills and AR M equals true kill is point ending the tack and appointed new serve average aces alright and Oh someone had a great a great idea which is their age in that year so I don't know the exact ages actually yeah I don't know I could actually get the exact age what I would do is I would say or very close to it um I'm just gonna I'm not going to put too much working this I'm gonna say the age is technically they could be born into seven summer plate and really over here but I'm actually gonna say year of to get that I need library Bluebird eight year of their birthdate year - year their birthday so year is a function you Nam in how did that does that work how did that work cuz I did a summarize I don't even understand how that work that should have had multiple observations in it Oh birthday was a group was a group that's why it didn't didn't hurt that's interesting I did not know that work that I did this based on the birthday huh okay alright and that was their um their first years performance well and imagine I actually threw in a not is an a-average attacks I'm gonna add one more thing which is n games and with data is the sum of not isn't a total attacks because one of my fears is that maybe that basing these players only like one game with data something went wrong here something went wrong why do I have the same Claire coming up over and over again but if I moved and I did I caused this a do this cause a problem yeah I don't know what I don't know what happened right there really interesting um scary so here most players have at least a few where they have information on their tax their errors their kills base the faculty they're the exact same I believe if you haven't data on one you have data on all of them okay so then I can ask questions about these for instance see someone has a really good point which is someone could have started their career before 2000 which is the first year I have data for absolutely throw I'm just yeah I have been working with what I have here you know instead of looking for first year what if I said I wanted to predict the 2019 and that gives me more data to he was I'm gonna do it when I'm gonna try yeah yeah yeah four players before yes 2019 I guess I'm changing my question which is great because I didn't have to go in with a question 2019 and let's see so if I say I year is less than 20 so here's players summarize before 2019 I guess I ended then I'm gonna compare to the 2019 so I'm gonna quick question which is I'm actually gonna filter and say filter and with data must be greater than 10 and with this extra data why does um only only uh 30 players have more than 10 games with observations that's a little frustrating I'll turning down the number of the number there but that's a little bit frustrating all right then someone has a question which is okay we graph a percentage versus service error percent I serve error I actually don't didn't save server but I will average serve errors is mean hope serve errors and I can actually grab now I can grab average serve errors versus average eight percent average ace ace percent would you suggest that maybe it's the average aces yeah and this compares surveyors to aces with a theory that maybe some players are more aggressive so this is average aces within a match and this is average surveyors we're gonna match it suggesting that kind of see could there be a trend I'm also going to throw in an Aes size equals n with data this is the number of games yeah and the um let's see yeah and is it positive there might be those be doing the Cadabra Gatien I'm curious for something else what if I then said they're 2019 performance where if I say I'm actually gonna grab all this I'm going to turn this into a function called summarize players typed and if you haven't seen that before you can create a function with dot type so I say summarize players run this through players 2019 I mean it's gonna be the same thing but the year equals 2019 all right so now I'm I should I should not as an am very few players have data on average attacks that means I'm not gonna do and let him do that filter here and then here I'm just interested in players way nineteen oh I don't want player first oh this is what it would happen wrong look at what happened wrong I've been doing this whole thing on player first year cyber that folks players right Levy player matches nope I want to be be yep no player matches is right not player first year who all right that that's a relief now I have all the players for 2019 that have some that have at least some data I like it's gonna be way better they've probably set up a couple bigger ah yes alright now that I'm no longer looking at that players first even so I set up a little alright there's great news the average cases the average serve errors are correlated generally something like every two survey or errors for every ace that's really cool that's like aggressiveness but it also could have to do with whether their serve it or not so I am this is actually one issue do we have a total number of serves column let's see I can't actually look at I can't look at that I know but I just realized because I'm yeah well total where's total serves total aces total is to list total hit percent that's out that's also nice for democracy ah nuts I actually don't know how many serves they did so I don't know so there this we would expect that this is not a good metric don't use this please because this doesn't include the total number of serves only the total serve air is actually frustrating to me yet don't do this I'm putting this in here don't trust this this is probably mostly correlate with an up total number of Serbs not the rate at which you serve so don't dumb even though it's an average like the total serves per game alright but what I'm finally going to say is take a look at players before 2019 and join it to the 2019 performance so I'm actually going to do an inner join on player 18 by by the name by name and I'm actually gonna do this only on select name and PC key one was a called percent winner and matches percent winner PCT winner by name suffix is I'm adding none matches for nineteen PCT winter 2019 play a performance drilling day so what I'm now looking at is here's all my play by 1200 players and I'm taking that performance before 2019 and I'm used to predict their 2019 performance why don't want to do that stick around for three more minutes um a few minutes even though it's six o'clock Eastern Time what I'm gonna do is say well how good how does their performance before but filter for ones for whom we have let's say and games and matches greater than 10 how does PC key winner compared to winner in play 19 I'll expect a cool um a positive correlation not as much as I thought you what if I say you must have at least 10 and 2019 nope no one has at least ten and 2019 oops aha and matches the 2019 at least I don't know five with him okay so one thing I learned from this is that it's actually harder than it looks to predict them at least like Jim a blank color equals red if actually said here's the overall here's the the best fit line there's a lot of regression to the mean even players who exceptional even ones who have won every game before 2019 and this they've played at least 10 games could be average players in 2019 so I think that's actually that's pretty interesting I could have done that as a logistic regression I could have said and matches gotta throw in a quick thing I gotta say and wins is n matches times PC key winner 2019 and wins why am I doing this I want to do a logistic regression because it's uh and I want to say I'm explaining the number of wins by the number of losses that's see mine is how I'll do a matrix of wins and losses for logistic regression we're a little short on time so I'm not going to details there data is the is here and I'm what am I going to explain this by let me explain it by percent winner and I'll pipe this to summary I could do it in the brooms tidy function too so overall there's a positive association and a significant if you've won before you're likely to win again this is using a generalized linear model but I'm actually curious about something what what else could I use to explain wins versus losses and one question I have is could I use one I could I use their age oh my goodness oh my goodness I grouped by age didn't I I grouped by age did anybody else catch that bug look at me grouping by age I can't do that I can't do that at all that changes every year and I grouped by year nobody caught that sort of that I don't mean to blame you folks I'm a little confused as to how this worked if like at all because I would have thought I wouldn't have with more way more players favor you in all this okay that's a way cleaner data set in terms of mountry win before 2019 how much you win now ouch I had algae's right I had duplicated data and now if I say percent winner fix a few yeah I had a fix a few things like even even even closer correlation loops Saba bad folks I was joining by age I'm not gonna look at age instead instead we're right now instead I could add a few other things like I could lump country by to the top three and throw in plus country and now add Poland Russia United States other maybe Russia has a positive impact maybe not and then one more thing I'm going to say I don't let a love country because you are that's probably already mostly included in there pre 2019 performance what if I said average pace average errors does their previous error predicted the answers mostly no I think and my guess here is that the my guess here is that most of what you would want to know average errors average US survey our servers most what you want no probably has already incorporated into that percent winner more I think about it the more sense that makes it's hard to predict if you already know someone wins seventy of us at a time it doesn't help you to know if what their averages might help you know other things like age but I actually I'm not going to dive into that I'm gonna say it how I'm pre how to predict the player will win in 2019 all right so people asked a lot of questions this is one way I don't run over time a lot but I did run over time on this one I really want to go a lot deeper in terms of can I predict whether a player will win or lose based on it might be interesting is to combine two players perform and see if um you can predict based on is it based on the best player the worst player or the average of the players that would be on some of the really interesting things that I would dig into but I'm but yeah that that's gonna be it I for this one okay so just to just to finally go through the things that we looked at we did some data reshaping in particular we turned this matches data set into a data set of players and matches and one of sets we didn't use the sets at all but that one had this winners score and the losers score within each set which we could have started joining back in to our particular and built some models based on players we also did some aggregation by player saw things like I got a sense of which players had won the most matches in which it played the most matches saw that you really have to go down fairly far before you get to a player that is only winning 50/50 and we and I did a little bit on dividing it between three twenty nineteen and twenty nineteen and mostly saw but yes how much you've won before is a predictor of whether you win again to answer yes I'm looking through the other questions I got one quick one was um was why did I want the total number of serves the answer is because this is a bad graph number of surveyors and number of aces both of those should probably have a denominator of the total number of Serbs within a game now it is the average with across games but maybe some player bad across pardon me I should say matches but but so maybe some players that have more serves than than others so I'm actually I really want this to be a percent and I don't believe we had that okay so that was um I had if there were fewer people with questions what I'm gonna do is I'm gonna stick around in the chat after I stopped the video and feel free to I'm gonna answer a couple questions or hear from you but overall I thank you so much for joining me for another live tidy Tuesday screencast on beach volleyball it'll be up on github soon I hope you had fun I certainly did I'll see you next week
Info
Channel: David Robinson
Views: 2,816
Rating: 5 out of 5
Keywords:
Id: MfDdmsW3OMo
Channel Id: undefined
Length: 66min 38sec (3998 seconds)
Published: Wed May 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.