Data Analyst Portfolio Project (Exploratory Data Analysis With Python Pandas)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to our second data analytics portfolio project here on the channel if you're a data analyst or want to get into the field of data analytics it's often pretty good to have a portfolio of different projects now last month I did a project that was SQL heavy we've pulled some data online from baseball stats of raise pitchers and threw it into the Excel made a few updates on the sheet and then put it into a database through a few different tables and ran a bunch of different queries on that side of things well I wanted to go to the opposite Spectrum today and we're going to be doing a python Project based off the exploratory data analysis now where we're going to be grabbing this data from is kaggle and it deals with actually one of my favorite hobbies out there which is ultra marathons which sounds crazy and it is so for some background information an ultra marathon is anything that is farther than a marathon now the smallest distance is a 50k but it can go all the way up to 200 mile now the first I ran was a 100K and I did that around ucf's campus kind of as like a funny graduation uh because my graduation got canceled I ran a 50 miler before in the Everglades in a few other 50k races including one earlier this year but I have a 50 miler coming up within a week and I thought it'd be really fun to build out a project uh based around some data on it so I was able to Source a data set with years of ultra marathon races so I wanted to run some Eda analysis in there and then also I wanted to see if I was actually in the data set so let's go through this whole process and before we jump into this portfolio project if you guys can please subscribe to the channel since it does take quite a while to build out these projects and subscribing is 100 for free it helps grow the YouTube channel and allows other people to access the videos that we make here there's also tons of resources here on the channel I make Python and SQL videos weekly so that way you guys can grasp the concepts and then Implement them into these projects all right let's go on my computer and let's first grab this data set okay so I have a jupyter notebook loaded over here uh first I have this link to the data set I will put it down below in the description so that way you guys can access it but just to show you what you need to do and the data set itself so I loaded this up already in Google Chrome so the big data set of ultra marathon running collection over 7 million race records registered between 1798 and 2022. uh definitely interesting the 1798 side of things that's pretty cool but regardless to show you guys how you can do this just go over here and click download and then you're going to have this ZIP file so once this loads I'll show you what to do next so now that I have this over here all you have to do is right click it and extract here and then we have this file it says two centuries of um races so what I'm going to do is rename this file this is a CSV but when you import this in it's probably easier just to label that so now you can see on at least my computer it turns this into something could be opened up in Microsoft Excel so we're going to go back over here to our Jupiter notebook I'm just going to open in a new tab just go to overhome and then just click upload and then make sure to upload that specific file now I've already uploaded that file in here so you can go over here to last modified right and you can see this over here so two days ago when I first started this project uh two centuries of ultramarathonraces.csv so make sure that is in here just again upload it and then you are ready to go and with that imported let's start this video so I'm just gonna build out a bunch of cells and all I'm doing is pressing this plus icon because we're going to be coding quite a lot in this video and I'm marking up what we're working on specifically so first thing that we always want to do when you're going to be working in a Jupiter notebook is import libraries that you need so just put import libraries like this now I'm going to be using only two in this video so the first one we're gonna be importing is pandas so import pandas as PD shift enter and the other one that I use in this video is going to be import Seaborn as S and S and I have a full video on Seaborn as well it allows me to show different graphs and charts pretty easily on here uh but if you don't have Seaborn installed it on your machine just pip install Seaborn very very simple okay so now we want to create our data frame based off of the data on the ultra marathons so to do that we're going to put DF which is going to be our data frame equals PD dot read underscore CSV and then we're going to put in that file name which is two centuries of um erases and I have already done this but I'm just going to copy and paste this over here so I don't make any mistakes typing this and then our data frame once this loads up over here it should be this CSV and I'm just going to make a note on here um see the data that's been given it's been imported right and we already get an error over here the type warning columns 11 have mixed types specify D type so there's already going to be a lot of issues uh with this data set and I already know there is um it's it's what seven million records so nothing is going to be perfect with that size um but just to show you what this specifically looks like we can start off with a head so DF dot head 10 and that will show us the first 10 rows associated with it we can shift and enter and this is the breakdown right so you have year of event and these are all going to be 2018 over here we have event dates now this is European style so if you're not familiar with European style uh the months and days are switched so technically this is the sixth day of the month this is the month and then this is the year which gets interesting when you're going to perform some analysis and we're going to do this a little bit later in the video kind of how we transform this event date to find some pretty cool data on seasons for Ultra marathons if you don't know running in summer is a lot tougher than running in the winter and I say that with them up around 50 miles and 95 degree weather okay we have event distance and length so we have 50 km over here but the thing that's interesting about Ultras is you have a lot of different types of races you have like 24 hour reasons you have 12 hour races you have 50 mile events uh so this is going to be all over the place it's gonna be tough to do some calculations with this uh so we're gonna have to edit this down quite a bit number of finishers for the race Athlete Performance so you can see like we have an H at the end and don't like that specifically um but that's been in there at the club at the country uh at the year of birth at the gender category um average speed and ID also um this is interesting age category we have M and probably f as well on here and it's all over the place this will need to be cleaned up if we're going to perform any analysis as well um this is kind of nice right the average speed which I assume is going to be in like kilometers because if you look at this over here 10.286 right uh four hours and 51 for a 50k so I assume like pretty much more than 10 kilometers per hour across the board um so not super descriptive on that side of things and then athlete ID I assume this is a unique identifier for each athlete it's going to be within this data set um so there's probably hundreds of thousands of different athlete IDs associated with it but this is kind of how it's built out at a higher level now we can also just do a shape so the F dot shape and this will tell us you know we have seven million records and then we have 13 columns associated with it and then we can take a look at the d-types so we got an error earlier right that there's the types that are just weird in this so we can do D types and then you can see how everything's associated with it um so like your events int event dates object object int object object object float object object object and then we even enter over here so I think the first thing that we should do since this is seven million records is cleaning up the data so we're going to say we're going to clean up data over here and what I want to do first is I just want to show a USA events 50k and 50 miles within 2020. so I had a 50k race and a 50 miler in 2020 that were official races beyond my 100K so I want to see my own data within this resource and I think the first way we should do that is just kind of you know go from seven million records to whatever it ends up being so we'll just put over here only once USA races 50k or 50 miles then 2020. all right so I think the first thing that we should do is make sure what 50k and 50 miles shows up as and that's under this event uh distance length so step one show 50 mile or 50. okay so it's kind of interesting on here so we already know what 50k is right it's gonna be this so I'll just put over here it's gonna be 50 kilometers uh but we don't know what 50 miles is so I'm just gonna grab this and we're gonna start testing right so what I'm gonna do is EDF dot query and what I want to see well actually we're not going to do a query yet um we'll do that a little bit later once we fix up some of these because there's spaces in here and it's kind of annoying um so we're just gonna go like this data frame put brackets in here everyone can go inside and data frame and then we're going to grab that column which actually we already have it over here event distance length and the first thing that we're going to try on here is equals and then we're going to put over here 50 km just to make sure that this works properly and you can see 50 kilometers is working so that's good right we have a bunch of different stuff over here uh next thing I'm going to check is just if something's been inputted as 50k because that's a very familiar um what it's like most of the time at races but you can see there's nothing for 50k so it's always going to be 50 km within this data set uh next thing is going to be miles so usually it's going to be labeled as 50 m but it could also be 50 Mi so let's try 50m there is nothing in there so my assumption is going to be 50 Mi and then you can see 50 Mi is showing so we'll put that as a note right we need to put over here 50 Mi so that way it is filtered out like that so what I'm gonna do is now combine both of these and I'm going to do that with is in so combine 50k and 50 miles which is in and this is just a show that I know quite a few different things with python I could just do an or statement in here but is in is way more efficient so again portfolio project show off your different skills so what we're going to do is go back over here I'm gonna honestly just copy a lot of this because I don't want to type it out again put that over here right the F event distance is that and then what we're going to do instead is put dot is in and then we'll have this in here we'll have that in there and then copy that and then we're also going to put 50 km comma and this should work right let's move all of that so bracket bracket bracket there I think you should be good with brackets right and that alone has dropped us down to under 2 million records so it's a start but we're not in the best area yet and let's now move on and start adding in 2020 only so what we're going to do is continue with this now I'm going to put an and in here and that way we can start looking at 2020 but one thing I should mention is when you work in data frames like this and you start with ands you need to wrap these in parentheses like this so wrap those together right and then we have and and this time we're just going to do DF I'm going to copy that and say equals 2020. now we have to change specifically what that is because we're going to grab a year of events which I'm just going to grab this like that so make sure I don't have any issues and put that in here and like on this side that wrap that inside over here it's annoying bit of pandas this is very ugly I agree but now this will just show 2020 a bit so boom right we can see over here your event 2020 and look at that we are now down to 63 000 of 489 which is a lot nicer to work with uh than that initial seven mil and we still have to combine it with USA events right so we need to look at specific USA ones and one thing I noticed on here is we have the countries that are associated with event name so you can see like tpe tpe tpe then we have po which I assume is Poland I actually don't know what tpe is so I'm gonna look that up right Taiwan actually I should have known that one but but regardless each of these races has something here at the end so just to confirm this Theory I'm gonna look up a race that I ran and that is the Everglades Ultra and that's space out of Florida so I'm hoping that doesn't have FL in here either this would just have USA so let's take a look at that so DF right DF and then inside over here we're gonna have to put a vent name like that so hopefully no issues we have event name and we're going to say that's equal to and then Everglades 50 mile Ultra run USA like this okay and see if it shows up like that and the race is here right and obviously we have other years because I didn't assign this back to the data frame yet the 2020 um but this is working we see USA here at the end good okay so now we need to figure out how we can pull this USA section and then filter by it because essentially we just want those results so what I'm going to do is something like this I'm going to put our event name once again I'm just going to put a dot Str dot split and I'm going to put in here and what I'm hoping this does let's split this up so we have Everglades here and then we have USA but I think that's nice is all the usas are going to show like that then if we just add in one extra line and Dot Str dot get one it'll just show USA and we can add that into what we have over here essentially and start filtering out things so but I'm gonna go one step farther I'm gonna do this all over again and watch this so I'm not going to put that get yet but I'm also going to split at the very end so now we just have USA here and then since it's in this first spot over here I'm just gonna put this in the first spot is going to be a zero this is just how python works and you can see now we just have USA again super super ugly but it works right which is good now we don't need this Everglades anymore but I'll show you what we can do so we have this and here I'm gonna remove a lot of this but we can have all that in here we can put this equal to USA and essentially if I coded this correctly this will show all the different events that took place in the United States and about 30 seconds later we have all these events and again tons and tons of Records but you can see like Yankee Springs is now over here at West Virginia trilogy and it's good right USA so now let's add in this additional filter so we'll say combine all the filters together okay so let's go back over here and copy this ugly code and add in one more over here so this time we're just going to copy it inside of here we're going to put this that and we have all this so we have event length distance 50k over 50 miles we have a year event is 2020 and then we have the string split to get USA so this will give us our results for USA events 2020 that are 50k or 50 miles shift and enter assuming that nothing is wrong coding it'll take a second and there we go so our total now is 26 000 rows and again way way nicer so I want to just re-label this now I'm just going to copy this over here and I'm going to say that this is going to be data frame 2. and is it ideal to name it df2 no absolutely not but in this case I'm just going to say df2 equals that and then if we just do a df2 . head let's say like 10 in here this has to run still um before this can but you'll see once this loads then you can see this is working correctly so perfect and just to run the shape on here df2 dot shape 26 and 13. so now let's continue to work on this and clean up data so next thing I want to do is remove this USA from the event name in my opinion it's kind of ugly since all these records are going to be in here anyways so I'm just going to say remove USA from event name that okay so all we have to do on this one so we'll put df2 put event name and we're going to rename all these columns too so we can start running some queries on it don't like spaces on columns really annoying but regardless and we'll do Str splits and then do that in there we got it through this also which I guess I can just put that here don't get what's that Str and get TR I get and then I think we're gonna throw a zero in there and that just has this side of things so again you want to have this correct just say it's equal to it and that'll override it so probably an issue on here but this does overwrite it and if you just want to see the F2 the head uh USA is now gone so great so the next thing we're going to do is clean up the athlete age so I'm just making up at the age so we already have a column for male and female but we have like M23 or mu-23 M35 m40. pretty ugly so what I'm gonna do and this isn't a hundred percent the correct assumption but I'm gonna take 2020 minus athlete birth year uh to get the age on here and again there's going to be exceptions on this but that's what I'm going to do on this example uh because this is a category and essentially what happens in some races is you'll have like a men let's say for example under 35 category and you're going to be automatically assigned that I like this one M under 23 so this person was born in 1999 this race took place in 2020 and since I don't have the exact birthday of the individual Runner the athlete ID I'm sure there's another table associated with Somewhere Out There it just very tough to do but all these races are 20 20. so I'm going to take 2020 minus this athlete year of birth and that's going to be uh the new uh athlete like kind of like age category on here so what I'm going to do is have this we'll say df2 and let's start with the correct proper naming and we're going to see athletes age and that's going to be equal to 2020 minus the F2 and we're just going to copy this but it's now going to be at the year of birth and because I want this to show up as an inch I want to say as type int on here and I believe this should work and it does not work cannot convert non-finite values Nar INF to integer so that did not work assume then we're just gonna have to think if you just put this on here for now that'll work and it did work which is nice okay and then the next thing I want to do is remove this each um from Athlete Performance so remove d and the way that you do it is kind of like as above where we did this string split so I'm just going to copy this over here and then I'm going to grab is at the performance this should be it um I'm just gonna die here for now well the reason why that's not working is we have that um but that does work here so just throw that over and boom that has been put in there and then just put the F2 then just put the F2 dot head and just to show you both those are working now hour is gone from healthy performance and um we have out the H on this side of things so now we can start dropping some columns so drop columns so what I want to drop first I don't care about this athlete Club so we'll say athletes Club I think country I don't care about as well athletes country at the year of birth is we already have essentially what we need on that the year of foreign and at the age category all those can go so let's drop those specific columns so all you have to do is DF to equals df2 dot drop and then let's put all those in here and this is going to be a little tedious so sorry but that is oh sometimes goes so just grab all that so athlete Club at the country and again it could just type all these out but then you make a mistake so I'd rather just grab this then we have to label access equals one so it lets it know it's the columns that we're dropping and then I think that worked so df2.head and again way way nicer already you can see how this has been changed so now we should probably clean up any null values that we have and I got an error earlier so let's clean up no values so the first thing that you should do and I'm gonna just add in a lot more cells so there's so much more work to do on here all right so we can do so df2 dot is an A and then dot sum like this and this will essentially tell us um where these nulls are and we have 233 without the age which is why earlier we couldn't do um that as type for an INT so just to see essentially how these are showing up we can just do df2 throw in here again df2 single quotes I'm gonna put athlete age in here input dot is an a we're going to say that's equal to one and they'll just give us examples so it looks like a few different events just you know did not have an athlete age North Carolina fat a Big Bend quite a few for Big Ben over here man against horse so that's a shame um but we can just drop these pretty easy the F2 equals the F2 dot drop n a like that's that will drop those 233 rows now if we just do over here DF two dot shape uh two five eight five seven which is uh 233 less than what we had earlier now we should also check for duplicate values check for dupes and pretty easy too right so all we have to do on this one is df2 and we'll just put over here df2 dot duplicated this that's true shift and enter and I do have an error so what is my error data frame object has no thing for duplicates meant to put duplicated and you can see that there's nothing in here that's been duplicated which is good so now let's reset our index so I'm just going to say reset index and pretty easy you see me do quite a lot on the channel df2 got resets index then we'll put drop equals true so that way you don't get those weird index columns and then now we have zero all the way through two five eight five six we're looking pretty good right um next we're just going to fix a few types now there's a lot of issues in here I'm not gonna lie and this video could be way longer if we really analyze everything in particular with this data set um being truthful to you guys take quite a while to do everything um but we can fix a few of these so the first thing I would like to fix is that athlete age from earlier so easy way to do that is df2 come over here athletes age it's going to be equal to and then just copy that again cut as type and just throw ins in here and that worked awesome and then I want to do the average I think average speed I want to make that into a float so I'm just gonna copy that over here that there and these are just some of the calculations I'm going to do a little bit later in the video so just make sure that you prep that loots in here in network two and then if we just put pf2.d types great it's the end and also float again feel free to clean up any of these as you go um but I won't be using some of these other columns or series a bit later in this video and just to show you guys also this didn't impact um we can just do df2 dot Ed and we have this all over here which is good so now we want to do in and just to show you how this has worked at the age right 29 3921 in comparison the past 8.0 just cleaner in my own personal opinion so now some I've wanted to do for quite a while because couldn't really use Query in here is I want to rename columns and just to grab the call names you can just copy these and put them down below over here and I'm just going to do that real quick through that here and um now what I am renaming The Columns I don't like these spaces in here and I just want everything lowercase just so much easier to code out and you don't make mistakes on that side of things I mean you will but like a lot less mistakes um with accidental capitalization either like the first part or the last part and also I don't like the names of all these um because it's like event event I like what race at least that's what I want to do but df2 and we're going to say this is equal to df2 dot rename then just put over here columns equal and then you're going to be building out this uh this bundle of joy so what I recommend just to do really quick on here is just start copying the easel this we're going to put a colon and then we'll say whatever it is at the end right and then comma so just go through here first and just copy all these or you can just copy my code I'm sure I'm going to put a link to it on GitHub so if you want to save yourself a few minutes you can skip ahead and copy the code um just to show you how this works though let me do these first three over here so for example like your event I'm just gonna literally just put year and essentially what this code will do is it'll go over here from your of events and it'll change that column into a year next we have event dates I'm just gonna name this as race day so race then underscore day so that way we can through that query function and then event name I'm literally just going to put race an underscore name and increase name like that comma and I'm going to rename all these so I'm just going to copy over what I've already done and you guys can see how that is again feel free to pause the video and type all this out or just copy my code shift and enter and all these columns have now been renamed uh so which is really nice again you can just go df2 dot head and boom we have all this over here cool uh the next thing I'm going to do though is rearrange these because hey it looks good for this project but also B I don't like this order so pre-order columns and I'm just going to enable this as df3 so if the F3 equals ef2 and essentially what you're going to do is put the or the desired order in here so let's say for example I wanted to have race day as the first thing on here you could just put race day like that and then boom that'll be the first thing that shows up and again like the one up here I'm not going to waste your time and just put it all here of the code Down Below in GitHub um so I'm just going to copy what I've already done so now this is in here I'll shift enter and just do a df3. head you can see how this is built out now the one thing I did end up deciding is not to put this year column all the race is already in 2020 and we also have this race day in here although it is not formatted properly um and we'll go over that a little bit later because there's a reason why I kept it like this but regardless um we can see everything here versus we had to technically scroll over a little bit on here not a big deal um but again just the constant's choice just to remove that year keep it if you want but I decided to remove it so now here it's gonna be really fun part I'm gonna locate the two races that I did in 2020. and also do you want to caveat there's a lot of other ways that you could clean this up still and this is nowhere near like a final version of what this should be if you're going to really dive deep into the data um so just as you know with that and again there was over seven million records so there's quite a few different directions that you can do you could do different comparisons on types of races like 50k 50 mile 100 mile 200 mile right average speed you can compare races in the US compared to other countries like again so much you can do but one video right so I just want to show you guys what I want to do for a fun project so now that we have the basics of the data cleaned up let's now look into uh the two races that are in in 2020 so Sarasota and also the Everglades so all we have to do first is df3 and inside df3 we're just gonna put over here a race name Rob that's and this needs to be equal to the Everglades one that I had earlier um where is that at here we go it's 50 mile there and I'm just going to put that in here shift enter nothing's showing up so let's see if we have a space here then and we do so here's that now to find exactly where I'm at in here we have a gun time of 919 so if we go down over here right that is me this athlete ID now just as a caveat there's two Ryan Nolans that run Ultras about the same age I'm 22 here there's another one that's 23 as weird as it sounds but he's from Missouri I am from Florida now my results in 2020 show up here but I think it's like in 2018 uh he had some Ultra marathons and for some reason I don't know if it's runner sign up or just another running website they put both of us under the same athlete ID so that should get fixed eventually in the future I have to reach out to them still but regardless if you want to see my other race that I did the ever not the Everglades but the Sarasota one we have 222509 and then what I'm gonna do I can just copy this over here and maybe we'll grab athlete ID we'll say athlete ID equals here and yeah Sarasota Arena under five hours and Everglades under 10 hours but you can see how big of a speed difference now Everglades closer to Summer tougher race and longer distance in comparison to the Sarasota one which is a warm-up for me but lots and lots of sand although I did enjoy that race so those are the two finding them and again athlete age is wrong in this example should be 22 but not in the world okay now let's take a look at some charts and graphs for this data so this is what we're going to be using a Seaborn uh that was a little bit earlier so the first thing I want to see is a histogram plot uh based off of the different race lengths uh between the 50k and 50 miler so it's my assumption that there's gonna be less 50 miles because it's a lot more challenging so we can just put sns.hist plots and just we'll just put in here df3 race length that's and you can see there's quite a lot more 50ks than there are 50 milers now we can also overlay males versus females on this side of things so if we just copy this over here and uh we're just going to do a few different things so first we're going to say our data equals df3 data is the first uh thing on here for the histplot and most of the things for Seaborn two and then we can just put x equals race length like that and then you put a hue which is the overlay and that's categorical so versus numerical and then you can put athletes gender in here and you can see for like 50ks there's a lot more even on here between males and females in comparison to this 50 miler uh which is predominantly male so now I want to take a look at a this plot just to see the distribution of the 50 milers especially when it comes with the average speed uh so we're going to sns.thisplot and normally you could just throw in a specific order and see the distribution is but we're kind of going very specific um with average speed for 50 milers so the way that you can do that as df3 and grab the inside of this so df3 we're going to say race length and we're going to say that's equal to 50 miles then on the outside over here we're gonna put athletes average speed like this and you can see the distribution so it's pretty common to be around a six or a seven speed and just to show you guys when I ran this in shape I was at 8.6 so if we see that better than average but thing is this is just general for courses and every 50 mile is different you're gonna have some very very technical ones like they go up Steve mountains uh but you're also going to have Road ones so unless there is like something in here that showed the difficulty it's tough to kind of view the speed of 50 on just like an equal level again for this data set it's kind of cool to see this but just as a caveat for ultra running not all 50 milers are the same there's quite a lot of differences on those uh the next thing I want to show you guys is a violin plot I think these are actually pretty fun to do so sns.violin plots I'm going to say the F3 or you can also just put just as a heads up you can say data equals df3 either or I'll just put that for this example just to change up the code a little bit and then we're going to put over here race length again and then on the Y side we'll put athlete average speed and then Hue let's say that's equal to the athlete gender on here and this is just a basic one on this side of things you can actually add in a lot more additional information just to kind of spruce this up a bit uh so you can say split it goes true on here then you can say enter equals quartz that then line with one and check this out I think it looks a little bit nicer in here and uh you can see male versus females Mills are a little bit faster on the 50k and they're actually a lot closer on the 50 mile side of things uh one thing that's kind of interesting for Ultra marathons is for some of the races like 100 or 200 miles you do have females winning in comparison to the males so like in a lot shorter distance all males uh tend to win in running races across the board but when you get in that farther distance there's a lot more elements to it um a lot of strategy involved with ultra marathons and mindset and it's gets a lot closer I bet if we went to 100 miles or 200 miles these would be even closer together which it's quite a shock to some people but you gotta remember when you're getting to that distance this is the best of the best Runners and you have to dedicate so much time to even be able to complete one of these type of races another one that's kind of fun to do is a LM plot so SNS all right om plots and then we can just put in here is data goes df3 x equals and we're going to do a lot of the same things right so we can put at the age I guess leads eat which this needs to be in single quotes I don't know why I didn't put that in there and then y equals athletes average speed and then we can do we'll show this first and I'll show you also what a hue looks like after um so this is very tough to see specifically what's working um Bo this pretty much shows you all the placements between athlete age and then the speed so it makes sense as someone gets older they're going to run a bit slower um so you can see over here some very fast some very slow and the crazy thing is there are Ultra runners in their 70s or 80s imagine running that much that age um crazy right um but we can also put over here Hue equals athlete gender I like to use earlier and you can break this down even more so let this load and then you can see mail uh versus female so there is a bit of a gap on here between the 50k and 50 mile or combined especially uh with this data showing that there's significantly more 50k uh data within here um but pretty cool type of a graph so now these are just like questions I want to find out from the data set so questions I want to find out how from the data and just to make things easier too I'm just going to copy over this which is some of the different things that we'll be using specifically so here's the first one right so I want to find the difference in speed for the 50k and 50 miler male to female so we're gonna be using a group by statement and I just want to show a few different ways again showing that you know how to use Python and again this stuff's been pretty intricate I've already 71 lines through it and like we've done a ton of work so far and also just a quick caveat before we move on to these questions if you're still watching make sure to subscribe to the YouTube channel again this takes quite a lot of time to plan out these type of videos build out the code and then I have to rewrite it for the videos as well and then even edit the videos so if you can spend 10 seconds subscribing to the channel it does really help these take hours upon hours and if you do want to also practice some python questions like these over here but for more interview based I have a link to strata scratch Down Below in the description notes 100 for free if you do get a paid plan it does help uh with the channel through an affiliate sale I do want to hire an editor in the future so I can push out more videos and that fund will go towards hiring an editor all right let's continue going forward so what I want to do for 50k 50 mile male to female is we're going to do a group by so we're gonna do df3 a DOT Group by then on the inside over here uh first thing we're going to do is race length so race length and then we're also going to put uh play gender it's gender on here and then on the outside things I'm gonna put athletes average speed and then we're going to put dot mean and then you can see over here male versus female so again female we have 7.08 male we have 7.7 so like about 0.7 of a difference uh but 50 miles female 6.83 uh male 7.2 so 0.4 so again very close on here and this is just the mean which isn't the best example to use but just to show you it's closer on that side of things um so next thing I want to see that's what age groups are the best in the 50 mile race and I want to see also 20 plus races minimum so Ultra marathons is predominantly people like later 20s 30s 40s things like that especially more time and really you can develop that skill set versus younger age tend to be shorter distances uh so having some teenagers in here or early 20s might skew the data a little bit so that's why I'm adding in this 20 plus race minimum also it allows me to build out a little bit more of a complicated uh code over here too so the first thing I want to do is query this so we're going to put query and then we can just do race length and we're going to say that's equal to 50 miles so now we can do our group by so you can just say group buy and in here you can do athletes age and then next thing we're going to do is athlete average speed then we're going to put a dot Ag and then in here we're going to do mean and also count like that and then we're going to sort our values so sort values and we're going to sort by our mean and we're going to do ascending equals false if you thought we're done we're not there's more so you can do is do a query and inside over here we're going to say count is greater than 19. and then we'll just put like head and we can just do something like 10 or 15. we'll just do 15. and here we go so the fastest age is 29 which makes sense this is kind of like your Prime for ultra running I would argue so 29 7.9 23 is next at 7.77 which is really good um so just to see where I was at curious obviously again I was 22. when I ran this race uh but I was at 8.6 so again so above average from age happy four back then although this next 50 miler way way tougher 95 degree weather I'm gonna 20 pounds heavier too a little bit of a chub down there but anyways 28 7.5 30 7.5 25 7.5 and it goes down to 34 7.3 and if we do a sending equals true just to see who the slowest right this doesn't make sense so 60 62 61 63 58 64. right older ages uh it makes sense on here and then we could just say an age of the best they're false we'll say what do we have 15 on here show 15 and I'm going to copy this paste this over here and I'm going to say what age groups are the worst the 50 mile race and just copy this code here and then we can just put ascending equals true and changes 15 over here to 20. you can see that and if we wanted the count to change to let's say 10 plus so we'll just change it on over here uh that way we can see some of the 70 year olds and stuff like that so 70 they're running a 50 incredible and uh 65 68 67 66. so pretty cool on that side of things if that inspires you to go exercise I don't know what else will okay lastly and this is going to be using a Lambda function one of my favorites I'm going to copy over what I want to go and uh uh seasons for the data so our Runners slower in the summer than winter should be right that's way tougher to run 95. um so what we're going to do is go back over here and there's a reason why I uh kept this junk over here uh this race day we're gonna be using that and we're gonna specifically look at some spring which I wrote specifically as Sprint which is nice right summer fall and winter also nice and uh it's not let me copy that or edit that so I'm just gonna remove that over here we're gonna put spring and also winter and then my hints on here was split between the two decimals you guys want to try doing that problem to do so it is kind of difficult and if you haven't watched my Lambda video uh yeah this might be tough but if you haven't watched my Lambda video watch it after this so what I would recommend on this one we start with the F3 but over here race month like this and then we're gonna say that's equal to df3 we're gonna say Reese day the Str dot splits and this is quite a lot in this video period in here dot Str dot gets and now we're going to want that middle section because it's European time so we'll put one or european dates and then as type int so essentially what I'm doing is looking at this race day here and I'm going to grab this middle section and again just as a caveat there's probably some bad dates in here getting some million records so this won't work for everything um again just a shorter video we're going to assume that the dates are all going to be like this so go down over here and that works so just to do the F3 dot head right go all the way over here two is being grabbed on there too which I guess you could say why is it grabbing this these are both 2 2 20 20. just do like a head of 25 or something like that and you can see one two but this still grabs two so promise you it is working right now I'm just gonna remove this head because it is annoying so now let's have something called Race season and see it based off of what I had above and we're going to say that's equal to and we're going to grab this race month so put that over here race month like that and we're going to do dot apply and our world favorite Lambda X and then we're going to say winter that if and we have all these over here so first let's take a look at 12. so if x is greater than 11 else then we're going to say all if x is greater than eight in this case it out else over here summer if X is greater than 5. else spring if X is greater than two else winter like that and it's closed out over here and I think that should work no errors which is good and then we'll put over here df3.head 25 and scroll over here these are all winter cool so I know this logic works because I tested it earlier but essentially how this Lambda Works in general grab it winter if x is greater than 11 8 fall summer five spring two and then everything else is going to be winter which will be one end two now if you want to get familiar with Lambda functions Again full video here on the channel and now we can do a full Group by on this side of things so we can do a df3 DOT Group by and then in here we can just put race season great and then over here we can just put athlete average speed like that dot a g g then you can put over here mean counts and then we'll just do a short values doesn't really matter in this instance because there's only four but that's okay and I'll just put over here mean ascending equals false and then we run this code over here so you can see is the easiest to run a race in Spring but summer 6.8 so uh compared to Spring you almost run one kilometer slower per hour and obviously the count is way different across the board of a summer being the least amount but also makes sense why summer is the least amount because that's the hottest and also toughest season so there's a lot less races there and just to show you like 50 miler only it would be my assumption that summer is tougher because you're in more miles in the Heat and fun times for my race next week but we can just go over here and put query this time and then we're gonna put on this side single quotes race length equals and then in double quotes 50 Mi and about a period here so that it can work and now you can see the big difference as 7.5 versus 6.5 full one point in comparison over here which wasn't so yeah way tougher with 50. so I'm sure with this side of things like if we went to 100 or 200 miles summer is going to be even bigger of a difference between that and fall especially as the distance grows but sometimes that makes Ultras difficult too is the world's toughest summaries bad water 135 is yeah 135 miles so it won't show up for 100 Miler it wouldn't show up for 200 and that's why some of these calculations I can be tough hey you're still here well I appreciate you continuing to watch this coding video I want to make a lot more data projects like this in the future and the best way to support the channel what's 100 for free is by subscribing to it now if you want to learn even more about python having full playlist over here I go over some of the different concepts that I taught in this video I'm more in depth with other examples too and some of them can be a little bit more difficult than what I showed over here so make sure to watch those videos
Info
Channel: Ryan Nolan Data
Views: 36,156
Rating: undefined out of 5
Keywords: Data Analyst, Data Scientist, data analyst job, how to become a data analyst, data analyst roadmap, python, portfolio project, pandas, seaborn, python eda, pandas eda, Exploratory Data Analysis, beginner portfolio project
Id: 4sZFkPw87ng
Channel Id: undefined
Length: 57min 55sec (3475 seconds)
Published: Mon Jul 10 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.