Tidy Tuesday live screencast: Analyzing Tour de France data in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone I'm Dave Robinson can someone tell I'm glad that comments open on this live cast the first time I'm doing a live tie Tuesday screencast can someone tell me if they can hear me people can hear me people can hear me guess all right so I think there's a slight delay like a tiny delay but yeah welcome everyone this is the first time doing a live tidy Tuesday screencast I'm really excited about it as usual I'm going to be um thank you all right so I am saying yes this the first time we're doing a live tidy Tuesday screencast where I open a dataset at our that have never seen before and analyzing it using our our studio and the tidy verse as usual the data set comes from the tiny Tuesday project an amazing data project in our from the R for data science online learning community I've got the chat open in a separate window but it probably looking at it intermittently in particular I think what I'm going to do is I'm gonna go through this data a little bit and ask it a couple points what do people want to see next and wait for ideas also definitely if I get stuck if I run into a bug that'll be a great time for people to jump in and try giving suggestions so I'm really excited to try this out it's an experiment we're gonna see how it goes okay so let's see we have this week what I know is that it's the Tour de France data and the let's see I actually know very little about bicycling and that's another reason is exciting to go live is maybe people that know more about it can comment on it as they go and the uh let's see I'm going to bring in ooh the tidy Tuesday our package yes I'm gonna try the tie Tuesday our package to bring in here we go I'm gonna open this up save it and I was a tour de france library tidy Tuesday nope do I not oh really look at me not even having the the package installed yet and I'll spit out how you Tuesday are don't have that one either do I an auspicious beginning tidy verse oops the updates load up the tidy verse a lot of tattoos they are fantastic let's see we have the data from today tie Tuesday April 7th alright and let's take a look ooh that's really cool it actually gives you the documentation oh that's really cool I hadn't seen that from Tuesday data yet let's see what data sets we have first one is TF winners these are the winners it looks like of the Tour de France then there's stage data let's see the that's a multiple stage okay 21 day long stages the course of 23 days okay and then oh wait no there's TDF stages Tour de France stages and stage data let's look at all three of these first I'm actually gonna look at the PDF winners alright so it looks like this one winner for each year winner named winner team that looks like it's the distance of the race foes the overall time the margin but it sounds like that margin by which they won here we go we have this here time and hours taking a complete give us a fish in time but the race winner and runner-up so I can ask questions like is it getting tighter over time the um let's see the number of stage wins possible when they okay and then uh so then there's one on TDF stages citrusy nice doesn't have a year in here huh you say all right I'm actually MC and we have a number of stages they win the number of let's I'm just looking through this stage is spent is the race leader by the election winner height weight age all right so we can find out a lot of things in terms of Tour de France winners even though there's only gonna be about what a hundred and six of them the Tour de France looks like we're going on since 1903 so the let me see and we have both country nationality things like that if you vote before you know if you've seen these screen cast before you know that one of the first things I'd like to do is take some categorical data and make make bar plots out of it so let's start by looking at the PDF winners I'm actually gonna save this as an object I kind of like though winners kind of like this I'll do it for the other ones too did what I often do is say let's count the birth country and I could make a bar plot of this should I go am I going to yes for this one I am I'm also going to test something out do I have the latest version of ggplot2 I don't even know if I do let's find out whoops let's restart our I've recently learned that from someone on Twitter that the ggplot2 package offers the now let's say the newest version and my understanding actually supports the I don't need to a cord flip into a bar plot let's see if that's true if that's the case here maybe around these all right what I'm gonna try looking at is birth country by n genome call but I'm actually gonna flip the birth country and the end does this work there's a first time I'm trying to travel yes yes look at that can you believe it looks amazing it doesn't look amazing yet because they do birth country goes absolutely reorder of birth country by n yeah I know look at that that's fantastic so it's a cohort it's already code flipped birth country and who went on let's see if we do I could actually drop the y-axis really and say title is what countries have the most where the most Tour de France winners born in that's an easy one and looks like France Belgium Spain Italy us makes it with ten I'm guessing I'm was named Lance Armstrong is probably big within that all right and who are the most common winners I'm not gonna make a graph I think of this one but I am gonna count the winner name last officers one seven I don't know much about cycling but I do know that Lance Armstrong's pretty famously cheated in at least some of those Tour de France's all right and if I didn't quickly throw in the birth so it looks like Kenny u.s. winners but that we're really really ten us wins but really only seven seven sorry really only two winners one Lance Armstrong at the top and Greg LeMond won in the USA all right some things we learned about about just like the categorical variables but said I might be interested in this let's see yes senator be interest in this is what has changed over time something might be interesting is what let's say the age how's the age distribution changed over time a good ideas of how to do that what is a box plotted I'm not gonna do a box plot what I'm gonna do instead is count nope not count I'm gonna group by the decade as let's see I'm gonna round each decade so what I'm gonna do is I'll say library date now there's this birth there's a start date man I really wish there was a year variable I feel like I'm gonna wanted more more and more so doing a year of start date up here throwing my library date up here library libertad' PDF winners now I've got a year column like the year of their of the race and now I can say year never seen this before truncated division mod 10 what does that do well it may wasn't born it makes each of these just be divided into decades looks like it was shut down for World War two makes a lot of sense I wonder if it's being held this year I would be in bed and when it's held but I would be surprised if it wasn't what I'm doing here is Mexicans say summarize every winner age is mean of age what was the age or that was the average age of the winner just to confirm age is they each of the winner yes and when her age oh heck I can even do winner height me and height you know what's interesting know that and that's not gonna work you know it's interesting that ooh I'm going to try something that I haven't tried yet do we have a cross nope I don't have the latest version of deep liar I'm not gonna do it on this one I'm just gonna do there's a new functionality why not tell me in comments is anyone here use the cross before sure deep liars a new functionality across that is like useful for this exact thing they thought I had the latest version let's see here we go and here we go alright check this check this out so what we're doing is right now we could take the average age the average height and I are I'm true some just nobody had it nobody had an age fair enough the story is what we can actually do what summarize now is Frost I think it goes like this let's actually check cross what know what's happening did I did I not reload the newest version of D play our let's see d play our zero point eight point five and pretty confident that's the newest version why am I not seeing it I really am a little bit puzzled is it not in deep layer because they don't we have a understand of what's going on okay my sister emmalin suggests that it could still be in development I'm not gonna use it this time because if it's still in development um but yes all right I'm actually oh yeah okay so it looks like some Compton's were noticing 1.0.0 not in the crown ready player I'm not gonna solve it right now too much you go wrong I'm gonna use she's gonna do it kind of a simple way lacrosse does it makes this a lot more compact web doing age and height and weight it makes this considerably more compact so that's on me for trying to try to design new rate runners which I know more about them bicycling have a saying never do anything new on race day but I try to do something new on a live screencast all right so what am I gonna do is did you plot let's look at by year what's the winner age that's good for giome line and I'm gonna add an expand limits y equals zero you win a decade in that year all right has the average age been changed over time it doesn't really look like it has the average height been changing over to see look it's kind of like and maybe used to be 25 now it's then it was 30 but like we only have a handful of up to ten within each decade so it probably is not changing changing so you try a winner height we didn't have Heights in the early data not really changing that makes some sense what if I look at winter weight and all this necessarily needs to be on a zero axis of distances weight in kilograms yeah we're not really seeing any trends but that's something that's in the word checking out all right so the I'm gonna leave him when her age was one with kind of the most data we could have used a box plot for this as well we could have used um a lot of things but yeah all right that's not taking a quick look the add some data on the winners was there anything else in the winners I wanted to look at the birth country could look at the nationality I don't think I'm going to born is really only interesting to me with regard to the age which I already have and Oh time margin let's see okay yeah yeah let's look at this oh man the first two races they won by hours and then I let's fun let's try this out what I'm gonna do is say here we are winner margin was mean of what's the name of the column time margin you know suggesting is we can actually do summarize act I'm not gonna do it I'm just gonna do it I'm gonna keep doing it this way uh-huh and huh we could have used also a time margin window margin oh look at that that's definitely that looks pretty meaningful to me the races so I'm actually gonna throw this in as let's see mm-hmm basically getting closer but we also like notice the data is not meaningful in the first data point that's a 19-13 it starts being a being relevant Wow look at that rate or to fret the winner Tour de France races have been getting closer so I'm gonna add that as a um it's going to call this by decade that's what we're creating by decade and then we see X's decade why is average time margin average margin by which the winner wins hour in hours I'm actually going to do sign there I'm gonna say decade must be good must be at least 1910 here it is because I think that first one is just like it's it's only two data points in that what that was too little too little data and this kind of zooms in and says oh okay in hours they got a lot closer heck at that point I probably multiply it by 60 and do minutes average margin of winner make this a little shorter and say Tour de France races have been getting closer that's a cool observation I'm going to I've waited all together too long did anyone catch that I didn't do then set my theme Tour de France races have been getting closer used to be in the early nineteen hundreds that winner would win by thirty five fifty five minutes then I was system and when ever since like the 1930s win by less than twenty less than ten most recent decade winning by a couple of minutes on average all right that's pretty cool all right so I let's look at what other people have suggested looking looking up I got a question from Thomas Mach who after who actually curates that tie today Tuesday data sets which is fantastic and on Thomas's Jeff Thomas a suggestion is to look at the time of the winner so when a time mean winner time time winner time something time overall so that was so this was one and now I'm gonna take a look at by decade the winners time us okay I don't want to use time I want to use the overall speed cuz looks at the distances have been changing let's take a quick look at this for this to remind myself what's going on distance yeah distances I'm guessing in kilometers where's the distance distance the kilometers time in hours so then do we have a speed anyway speed speed speed we don't I'm gonna throw that in in the data cleaning step I'm going back up I'm gonna say speed is Kalama is what is it distance distance divided by time overall so to speed in kilometers per hour and now I'm going to say instead of winter time when our speed there it is and now if I take a look at by decade who've they've been getting faster oh wow they really have been getting faster let's let's make a graph of that winners have been getting faster notice I copy pasted that last graph and I said average speed of winner kilometers per hour by decade I don't think I need the filter anymore do I not need the filter no um I need to change this to speed winner speed look at that that's a great-looking graph we can see that the average speed and early in the century was like night was like 25 kilometers an hour and it's been going up and up and up and it looks like it kind of flattened out in the last three decades I know they've been things like doping and such I that like really kind of getting all the speeds someone can out of this process really kind of hovered around 40 kilometers an hour I do some biking but I don't think um what is 40 kilometers it well oh here's the way I would think of it a marathon is 42 kilometers so they are biking I'm almost a marathon in an every hour which is yeah that's all they're doing that for days and stuff that is um that's a lot of distance Wow uh-huh and I see I've seen a couple theories let's see so some of the I'm glancing at the at the comments saying one is courses could be getting easier I think that's true I wonder if there's anything about any we hear about things like elevation this is origin to destination stage type one of the stage no I don't see any anything we can use to control for have their ways but getting easy I would have been looking for something like the UM like the average elevation Sun like that alright and I'm also seeing other suggestions things like better the bikes are getting better the Train is getting better okay that's I think it's really um it's cool we've seen that they began faster and closer those are the two things that we learn just looking at this winners data set okay I think I'm gonna move on to some other I I want to move when I when I'm looking at stages I won't actually look at the stage did and not at the overall winner data is there anything else there's a addition that's what I could have used to can use to join between these is there anything else that people want to look at just from within the winner data see looking at comments for a second getting to see if anyone else has suggestions this is fun this is actually like I haven't done this slide before and it's kind of really cool because I'm usually kind of doing this by myself and then I have to wait a couple of day or or two there is a delay though I think everyone's that one oh yeah I'm looking at oh I didn't see how much sugar delay there is that's cool all right we have some ideas someone's just so much does an age of winner we did take a look at age of winner and it looked like it was pretty much flat I think that it looked a little bit different here's a good question from Nick she sweet Sweeney bus and sorry if I bashed in but can we see the average life expectancy of Tour de France winners that's a cool question one thing I just thought about is that we'll need to do the distinct bye people okay so what I'm going to do is average life expectancy I think it's a great question because we have the birth and the death so if I look at winner PDF winners I just hit me the problem with this is that we have right we have censored data huh how many have even died let's actually find out filter not not is in a died and I actually before that needed a distinct winner named doc keep all equals true there are 38 who have died out of a total of 63 so we actually if we wanted to see the average life expectancy of a winner we would have to do survival analysis we might I might do that but I might want to wait on that I wonder do I have that for all the competitors something like that do I do I do I do not I do not have it for all the competitors all right money to survival yes I'm gonna say what's the life expectancy of a um all right what is the problem with doing um what is the problem with the saying life expectancy the problem is we are we there are Tour de France winners who are still alive in fact those because they've only been playing for a century they are disproportionately likely to be ones who are the ones for the ones we have of died of dis pushing likely to be the ones who are a little bit who died a little bit younger so they want to say what the average life expectancy is I'd really need to do this is right sensor data we don't know how long all of them will live only the ones who have already died how can I look at that mm-hmm yeah why don't I why don't I show doing a survival analysis really quickly I think it's a cool idea so my allowance is really important it's really widely used in medical applications which where a lot of these studies look at time to event something like death or cancer mission or cancer returning things like that so what I'm gonna take a look at here is I'm gonna do actually distinct by winner's name first year is the year of a born birth birth death year is the year of death died and I'm actually gonna transmute this winner why my why my transferring this I want to show each them creating I'm also going to add some it says died is not o me remind myself house about how survival works the second one is event 0 live one is dead so I had one called dead which is not is in a death here and I'll people back here are dead but some of the people later will not will not be and I don't know if this matters we're making an integer alright this is the data we use for survival analysis we have when someone was born we have when they died and then so notice this person was born 1928 we don't actually have their death date that means they are at least Taylor math 92 years old and something we can actually that I need to do here is I need to say let's put make death here coalesce death here with 2020 which is if it's missing let's replace it with the year 2020 and that's not saying that they died in 2020 but saying they're not dead the last data we have from them is that is that they were alive in 2020 for the sixty three distinct winners alright so the so that's looking at death at death here and now I can apply survival serve fit so what I do is I say survival is um let me see and then I'd to say oh actually want to say age at death is that - birth year that what I wanted to say so now I actually want to say like this per these people wow this person died at 35 that 28 they could have easily died in World War one this is 1915 1917 1917 Wow huh or for that matter in the flu pandemic of no that was Lubin 1918 all right so the all right so I have coalesced this is the age again what I would do then is say serve a surfeit serve of the age at death and then second the status are they dead or not explained by right now nothing I'm not doing any cohorts here and then say data equals thought this was a was a survival analysis and then we can say who don't this is I can plot a survival curve I can do it using broom using a couple ways this is what the lifespan of a Tour de France when it looks like we don't know this is what we know is like um here we go they're all among the data we have they're all dead by little by the late 90s but the median would be right here I'm trying to remember how do I set how do I actually get the median out of this there's a way I'm gonna save this quickly survival model this is the model and now I'm gonna try five a model oh there it is oh great so I can actually if I load up broom I can do this in ggplot2 I'm not gonna do it today because survival analysis isn't we were going for but I want to show you we can actually do glance on this and grab a median right out alright so the median age median life expectancy of a Tour de France winner is 77 and that by the way is pretty damn similar to the to the overall life expectancy so it looks like it's not unusually high not unusually low in it is worth noting that was over the course of a century we could cohort this by where they've one of the first half of the second half I'm not gonna go that far okay and alright that was looking at so that was looking at UM at yeah that was looking at the question of life expectancy and I got a little bit of survival analysis in cool I'm gonna look through all the questions we have and I like one idea from kanishka of maybe you can do a GG animate where you have a race between the countries of origin Oh over time the I like that a lot maybe I'm gonna bring it in later I was putting in Georgiana made late in the game and um yeah I'm gonna start by looking at the state at this stage data yeah it's kind of bringing GG animators as a bonus at the end let's see yeah I'm gonna look at the stage data okay stage data choose data stage data do I wonder when you're processing Onderon it is so cool two hundred fifty five thousand points data points oh wow look at that we got distribution of Ages distribution of yes so much data in here stage data and Wow alright and you know what's what's what's bothering me here I have a time do we have a distance oh I bet that's a yes there it is in TDF stages okay so we have staged data which is on the winners we have TDF stages Oh nope wait yes yes TDF stages and this actually shows all right I'm also going to use Jana to clean these I'm going to use the janitors package clean names I just want to make all the UM the titles lower fees lower case and I have stage okay cool date and so on all right and now I want to join these two together so if stage data at TDF stages stage on stage results I'd the stage one two three four five okay I'm gonna need to pull out the stage and the I had to join these two tables they're not quite set up yet to join so what I'm going to do is do this I'll take my TDF stages at a year is actually do this up at the top I'm gonna do the year is year of date that's pretty good there's my year all right and the other thing I'm going to do is actually notice here in stage stage data um I actually need to combine the year but also the stage results ID I'm gonna use separate who's seen separate before that's a rhetorical question because what I'll do is I'll say separate into this is something I really don't need I had only separate I actually need extract because I'm gonna say take a stage results ID and pull out the stage based on the regular expression stage - some number of I actually don't even need this I just did pull out the digits convert equals true what's great with that look at that convert equals to even turned right into an integer so just like that I can now join this with let's find out if I do this right TDF stages by two cool things the year and the stage and uh nuts hmm you have stages stages right there a character vector I'm not crazy with that I want to say stage goes as integer stage really huh we find out what was the thing goes ooh 1a 1b ooh it's not I was an integer okay now I know now that's something I know so I'm actually gonna say stage I'm not going to convert I'm gonna say grab stage anything filter not let's see did I end up with any is in a stage because looks like some stages are partial okay that's great so this matches all of them and I bet there's up here's my is my B is okay so it's not always an integer important to know alright but now I can think I can do nope I don't want as a teacher anymore bit of data cleaning here we go and now I joined it and I know I lost a little bit of data I'm not even gonna pay attention to what I lost lost like 80 HP to 55,000 now it's a little bit lower okay so what I'm gonna do is say here though joined by stages and yeah now we'd actually know the distance and the second oh well there's gonna be frustrating 13 seconds that doesn't make any sense here's time says time is a double that doesn't look like a double to me let's look at this were saying yes so one thing that is confusing me here stage data alright so something's up here you notice says time is a devil but it's actually not it looks like it has an S in here and I'm actually gonna take a look time zero seconds that's a little weird alright I don't need the s two seconds I'm a little I'm a little puzzled by this I really was hoping the the time alright I think this is a parsing issue I'm gonna guess the partition that should have had things like minutes and such before did I I lose anything in here to them was this different when I first looked at it anyone have a suggestion here I think that when this got parsed the time was um I think the time got pulled out incorrectly and only ended up with the seconds maybe not the minutes hours etc which means we can't look at him this one elapsed is um let's see elapsed times stored as Bluebird a period now but it looks like it's still it's also a character okay so I'm running into a data cleaning issue totally happens uh and I'm not going to use time in this one okay okay all right what if I said let's see what if I did it then based on the rank this time I rank the race of a stage you have race based edition what I'm actually gonna say what we're gonna do what those I have way more data than I used to used to have I can't use time but I can use their rank like would that did they win or not and what could I do with that one of these things a little bit silly I'm going to say here we go I'm gonna say group by OH joined stages joined I'm calling the stages joined I'm going to say take my do we have my nationality of the winner country yes I do it's probably not their birth country it's probably what country they represent in and summarize stages and ooh look at that that's fun and but also average rank equals I'll call median rank Oh where's my rank did not finish did not finish DNF since we did not finish and everything else looks like his numeric I'm gonna leave those as an n/a and let's see I'm gonna say you know join this and I say mutate rank equals as integer rank it's gonna leave it as anaise from coercion and and a tyrannical true alright so look at the median and descending the number of stages we have data points on alright so what am i what am I looking at here I and the other thing I want to do actually is I don't know how many are competing in each of these have in each stage how can I find that out I can count the Year and the stage this is a number of racers that there are in a stage so on a so some of them have a lot of stages have a hundred and none of them much more than 200 okay and that it's probably something to do with over time there looks like some of the early ones that were there 30s one year this one there were 88 okay so some get eliminated as they go that sound does that sound right does that sound right it's a multi-stage race and here we go Jimmy here in addition stages yeah I bet that well about there people that drop out so you can see yeah you can see I was like thirty seven thirty whoo look at it yeah it's mostly going down within the course of a year not perfectly but yeah so what why am i checking back because we're actually want to do is say goodbye is say UKaid so actually want to add count I want to add the this year and stage name equals competitors so this is that the competitors will now have the number of people in that stage in that year I won't have the number that finish I actually that's actually what I want now that I think about it so I'm actually to say group by this the UKaid finishers equals sum of not is an a-rank so I did that because I want to actually say percentage rank goes rank divided by finishers by the number of finishers why do that because take a look now we have for the person who won they were in the top point out reaction to one minus that I'm doing one minus that because I like like oh they were the 90th oh I'm gonna call this percentile beautiful I'm actually what any stairs turn this into a percentile and now I can say percentile this is the 97 percentile this in the 59th percentile and now it's gonna be uniform within each of these stages so the reason I did that is gonna say median percentile and now it's meaningful settlement was not and I'm actually gonna filter just for stage 1 because I think the later ones might not be meaningful alright how one thing that's telling me is no country does actually better than average within the first stage they're all medium percentile around for the percent so it's not like there's one country that is dominant within the first stage huh I wouldn't necessarily have expected that okay so that's something I am that's a no-no I see a good I see a really good suggestion from Eric next ed which is does rank in the first stage tell us anything about final ranking that's really that's a great how could I find the final ranking but I look at state I can find their ranking and be let's see I can find their ranking the last one how can I get the final ranking but especially because I don't have a time stage data is there something cumulative here points Oh presumably it's the total points okay so I'm actually gonna say total points is stages joined group by the year the stage and the rider summarize points equals sum of points and the guest that has the most points wins and then I'm actually to watch this I'm not going to ungroup first gonna say summarized at points and it's still grouped by year and by stage oops I probably need na RM was true to say if they don't have any they don't have any points they obviously this person had zero points well um and then I can say what was their fight what was their final ranking and do that based on let's see it's finally going to be a percentile I guess it is but that's kind of character for that drop out mm-hmm okay it's funny I think it's gonna be great so I'll say final rank equals rank of points and noticed still grouped by the UM Rider oh I actually turn that into percent rank that's as a percent rank as a function that kind of does that in one step so percent rank she would say zero and will be between zero and one so this person had a I'm gonna do oh yeah you have total points so this person had a high final rank this person had was in the seventh percentile than ninety the 91st percentile and so on so now we have a final rank OOP this is by year and stage I did not want that stage I just wanted it by by year much better okay so I'm gonna guess the winner this year was this um holiday Hippolyte and the alright so I guess oh this really helpful have people that know about this France mar musa C's I'm will my god I'm gonna keep saying names and it keep watching them and please everyone forgive me I'm just me these people's first name cuz I have a little bit less chance of embarrassing myself the total pain points do not count for the final ranking it's for another Jersey um I'm still odd I'm still gonna use the total points because I don't know how else to I don't know how to tell who's winning does anyone say you don't tell me how does final how does the final winner make a race of a stage by the eventual winner I'm gonna I'm going to just use the point the total points as kind of a like I met as a metric of their performance and try predicting that I don't know how to how to say what someone's final rank was yeah I do not or does their rank on the final stage mean they're no I don't think it does cuz I don't think it's cumulative all right then I'm gonna go ahead and go just just go with this it's going great it's going fantastic alright so the UM I have yes I'm Rossi is does separate this out does the winner question asking is does the winner does the winner of the first stage predict their final point ranking that's the question I'm currently asking alright and here's the total points within each year and writer and I can take the year in writer and say take your stages joined I can still filter by this I can inner join with the total points by urine writer [Music] there are points oh I see total points I should call this total points and call this point rank that's right look at me go what I do nom oh yeah that's right it wrong then call this points all right and then points rank and join by total points I have point and gear Ryder points rank cool and ear Ryder points rank at all points stage so we had points before that's actually the points in the first stage so I actually gonna use the rank in the first one thank first stage equals rank oh that's kind of cool I noticed the person in this in this case in 1903 the person who won the first stage also won the also had the most points by the end of the race all right so I'm actually gonna take a look at this and say the and say overall this is of the 13,000 people how does their rank in the first stage predict their point their overall points rank ha this looks terrible what I would a what a lousy graph because the problem here is that there's way too many points it's the first issue what I'm gonna do here is add a little bit of transparency and rank first stage oh I see one of the problem I did here is I use rank I didn't use the percentile rank in the first stage I didn't say where they was actually reversed and it's kind of like it's up to hundred remember was kind of a weird measure percentile where was my one minus the finishers stage is joined percentile that was that percent how four percent how first stage she met it's a lot more than this one oh here we go and now what I can say is Geum smooth method equals LM sure LM linear model alright yeah so one thing is doesn't look like it from the scatterplot but there's the sonne points which means sket applause Polly not the way to look at this here's a ready for a better graph here's I'm gonna do I'm gonna say percentile first-stage bin is I'm gonna use cut that's a useful base our function there's I think it's sort of similar winning them and tile in the player but I but cut it gives them usual names percentile first stage cut them into seek 0 to 1 by 0.1 and make it a box plot let's see if I did this right it's gonna look all right that looks pretty good I need to add one extra thing I'm gonna add element theme element text X is element the length of the element axis text X is element text I need to I'm trying to actually rotate the axis labels did this work great and that na is very annoying so I'm going to quickly throw in include lower Stickles true that get rid of it no I'm gonna say that isn't a percentile hmm sure I just want to get rid of those n A's all right so the performance in the first stage does predict the overall performance if they're new that if they're if they're really anywhere in the bottom percentile performance it was so much gonna say labs guess aisle performance in the first stage performance overall points percentile and throw in one more thing a scale Y continuous labels equal scales percent look at that all right so the sir that I did there was I saw that I that there was a relationship but the scatter plot was way too messy here what I'm seeing is overall if you ended up in the bottom half in the first stage your median before your median was that you'd end up with actually zero points that the absolute bottom of the overall points percentile you have you basically have to be in the top half of the first stage to have a solid chance of there are people who finish near the top even if they even if they were near the bottom at the start it looks like it happens but mostly there absolutely is a relationship if you were in the top decile in the first stage most the time you'll be in the top really most the time will be in the top quartile over here so we see on the median the median person who's in the top decile ends up at about the 87th percentile of points so that's certainly not surprising um it's not it's yeah as far as that you can say you predict someone's overall performance from their first performance all right I got one last thing I'm gonna do one last thing and I think I'm gonna do a chichi animate so library GG animate we have a couple choices of what we can we can go for here I've got a what do I have yeah I have a thought my thought is I want to actually show this is literally a race so I actually want to show the one of these races these um these races I'm going to show the most recent race and I want to animate how the riders compared so what I'm going to do is do filter for a year is we have no of course it's not twenty in here twenty does it happened what's the last year we have here max year all right last we have is 2017 I'm going to show the 2017 Tour de France but only for some of the top players in that in that year so the UM I'm gonna want to show it for let's see oh yeah so what I'm gonna do is actually look at the tour points filter for year is max year top and total points at the top ten here are the top ten 2017 players okay and the UM and what I'm going to do is actually say how did their number of how did they do over the course of that race so what I do is I take this and then I semi join on top ten by I say so I join on this by the rider by losing fantastic by the rider so now I have all the data for those top across all the stages for those top ten nominees I also to say stage is faster stage they're wonder if ice how do these end up order are there any 21 age women be no I'm just gonna do as integer because it's fine for 2017 now I'm gonna show how did this occur across stages well what I could do is I could say right as I could do um a graph where I said stage rider I haven't actually done but no points here I'm gonna do I had a group by rider cumulative points equals cumulative sum of the points n AR m equals truck nope doesn't doesn't work so I need to say mutate points equals coalesce points and zero so now I say here's the cumulative cumulative points so far as the king of summer points within each rider group that now I say GM call and one day knows if it's a faster to ratify stage there's gonna be too big a graph I just want to quickly give a sense of what this does points rider GM call yeah what this is doing is saying oh I don't want points I want P millet of points so this is doing actually is it's showing here we go yeah it will be showing like here's how the cumulative points go within um within each why does Marcel go to zero at some point oh I see why because I didn't have it arranged correctly before I did this have to arrange by stage and stage is now an integer so I think that'll be better nope what I'm looking for is I'm just like I'm double-checking the hell oh it looks like this person might have dropped out cattell Marcel was actually in the lead and then data just disappears for 18 19 20 21 maybe they dropped out hmm okay I'll skip that I'll skip that one and what but yeah that's looking at it by stage but I'm going to change it to be an animation so I'm going to do was do here we go what did i do - Gigi animate I did a gonna say transition time by stage me so I'm just thinking for a second ID is this gonna yeah what I'm gonna then do is say labs title equals think it's like this look I want to try and include the time I remember it's like Oh frame time something like that time frame so remember to some remember this hmm there's um here we go I create an animation of the 20 I look at that look at it go here they go kill Marcela's wedding know who's winning is winning is winning and there they go left the race I left the race sound even though they were in the lead that's kind of it that's pretty interesting so this is then we can say you see this frame time work yep that's the one 2017 2017 Tour de France and then I'll throw in a couple of axes cumulative points the stage I don't need to say rider it's like that's stage seven eight nine watch them look at them catch up the whole thing I'm doing I'm going to I I recognize people really want this race to be like the the order changing that kind of thing and I have to set that up I actually have made a graph like that and well oh wow someone is this so helpful it's like having an under research army some point out that Marcel could he'll abandon the 2017 Tour de France after a stage 17 crash oh what a bummer he was winning but they were going 48 minute they go in 40 kilometers an hour as we said that's fast yeah and all right so the um yes so then we have so then this yeah the last thing we're going to do is allow them to reorder as we go last thing we're going to do I think this is going to work let's find out what I do is cue points alright what I'm going to do is say that is say this library tidy tax and I'm going to do is f ck is reorder within the it is kind of like fasting I've done this before fasting but now I'm not I don't think with an animation so we order the Rider's bike human of points within stage and now the UM the rider can appoint stage and oh yeah they had a scale I think it's a scale why reordered yes it'll be scale right we ordered in this one let's find out how this looks that's an awesome ad fill equals nope didn't did that didn't work did it do we order with them oh oh because I didn't say that see how this loss interesting that it takes longer now it is interesting I wonder why it takes longer matters would definitely take them on their I've almost done here and Phil is cumulative points wonder if it's cuz I added to Phil look at that person poor for us to drop out so what I'm trying to do is I want these people to be reordering and they're not gonna be smoothly reordering I'm sure there's a way to do that but I actually don't know what it is they're not gonna be like the shifting upward and downward but they are going to be like you can hear my computer is kind of going full speed here goes here it goes here it goes Oh No Oh No how did that happen that looks terrible okay the problem is even though I did am scale X reordered why we ordered it did not it removed these but it left them all being redundant that did not work I wonder if I'm gonna move oh that explains why it took so long I'm gonna remove this line okay so the answer is I do not access right tricks on I do not know how to make the race and that's that's our show because the the crash starts to do of course it wouldn't be a live session if I didn't yes close this it would not be a live session if I didn't crash our studio the Mac this is a question of what kind of met laptop is this is a MacBook Pro 2015 the greatest laptop of all time the 2015 I got it in 2017 so it has been it's been a it's been it's been going for a while that's by wide some taking a while but I'm the IT guy and we said we've changed this I changed this and you rerun the total points and move the yoader within and move there big fan of my MacBook Pro 2015 cowboy the crash right near the end alright this is the best version I'm gonna get at this graph I would love to hear from people not not now where we're finishing up but later how how I would however get these to reorder as the as the the graph changes so I do not know how to how to UM do this alright so that so in conclusion um that was our screencast i'm i've been really fun looking at the comments as i'm going and looking them on a separate computer so the um but hopefully it's still going to be readable people that um that didn't other weren't watching along so so we viewable people we learnt we did a couple bar plots we learned about the most common winners we looked at some changes over time we saw the ones we're really excited about so someone like age didn't really show a trend but some of them that they're getting closer and they're getting bastard almost linearly but kind of flattening off but they look like they flatten off in the sixties to someone told us this someone told me that the 70 inch at the 70s for the age where steroids started the life expectancy we use a little bit of survival analysis to find the life expectancy was about seventy seven using right sensor data and we then we join together stage data i didn't really find a trend from the countries and we we saw it the winner the first stage did work generally one and then we made an animation alright so that was an absolute blast i'd love to do I'm not probably not going to do a lot of screencast every week but I'd love to try and have a chance to in the future thank you so much for joining tuning in a great time hope you did too see you next time
Info
Channel: David Robinson
Views: 3,417
Rating: 5 out of 5
Keywords:
Id: vT-DElIaKtE
Channel Id: undefined
Length: 63min 53sec (3833 seconds)
Published: Tue Apr 07 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.