LIVE CODING: Flight Data Exploration with Pandas & Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
start [Music] okay it is October 20th Thursday October 20th 2022 let's get ready for a great night I hope you all had a great week so far I hope um you're excited to do some live coding that's what we're gonna do tonight here on stream so we're gonna stay focused We're not gonna get too crazy here and we're gonna have fun together if you're there in the chat let me know that would be great I'm gonna test hey okay now you can hear me I was having a little audio issues before but we will figure this out we'll go with it okay so um here's the deal everyone we are going to dive into some data exploration so we created a data set on a stream not too long ago where we actually pulled down all the flight data from a public database hey good to see you Yusuf hey databasics what's up time to cause some problems on purpose that's for sure so I'm going to go into here and I'm going to activate my kaggle 2 conda environment and I'm going to go ahead and start up a Jupiter lab environment and we're going to start coding in Jupiter so we created we actually think I finished the creation of this data set while we while I was offline but this data set consists of all the flight data from um Federal the Federal uh Transit Agency I want to say so let's go over to my data sets let's go over to my work and I'll even show it to you all here this flight status prediction data set I will paste into the chat so if you're watching this afterwards you should be able to see the link in the chat or I'll link it down low hello everyone in the chat hope you're doing well check out that data set so I did not include column descriptions but I did start a very early Eda notebook and we're going to dive more into this data outside of a kaggle kernel so let's do it locally on my machine so we don't have to deal with any slowness not that kaggle is slow actually I've heard that today that kaggle um or maybe it was yesterday they upgraded their online instances to become faster but that's another that's news for a number another day that we can test out tonight we are just going to dive into this data so let's go here let's go into our flight data and yeah this is the flight data that we have so we're just going to start in Flight data Eda session let me know if you have any suggestions what you want to see us look at specifically let's remember where this data set came from it should be in the data description that I wrote um that's right this this website Transit stats bts.gov I don't know what the BTS stands for but they have um all these data sets with the all the information for flights in the United States going back to 2018 that's what we're going to look at flight data exploration so we're going to use pandas in Python of course I'm coding here in a Jupiter notebook and Jupiter lab thank you for your lives learn a lot from them oh glad you learned a lot I appreciate it I've been told that I need to focus better so I'm trying to take that constructive criticism and focus more today um question in the chat about if I went to school for data science I did I got a master's degree in data science all right so these are just the standard Imports that I typically import when we're doing data exploration there can be more but this is going to be the these are going to be the main ones for now so we're going to import pandas numpy and we're going to import matplotlib Pi plot as PLT we are also going to import Seaborn as SNS uh let's do style sheet of hmm let's use matte plot Libs style sheets style sheets are nice with matte plot lid because you can kind of keep a standard style to all your plots in your report but you can do something a little bit more than just the Plain Jane Bureau of Transportation statistics that's correct Michael mallinger um but we can do something cool like what do you guys think about Seabourn V zero oh these are new ones I haven't used these before so I I'm thinking this would be could be a good color palette Seaborn Dash b08 dark grid let's see if this works Seaborn I've never I've never used these versions before V 0 underscore eight Dash dark grid of course that doesn't work let's see what styles are available so these are the ones I actually have on this machine it might be that it's a newer version of matplotlib that has these newer versions um so let's see what this error message said actually uh not found in this Library value URL path C style available so maybe if I upgraded matplotlib these would be in there but let's just use Seaborn dark let's see what that looks like dark palette let's try that oops I didn't copy correctly we're going to paste this down here um and then we'll do we're gonna load an extension lab black what lab black does it auto formats our code for us using the black formatter um we might want to import some other things later but this is probably good enough to get started so I like to LS our flight data uh or whatever data we're we're working in so let's just do Dash LS flight stash flights I think this is where I put all my data so let's do LS Dash L I actually like to do this G F L A S H so this will tell us the file sizes in met in readable format and list them all out so I made these parquet versions that's what we're going to use so let's just do a grep on parquet to see only the parquet files so you can see that they are just a few megabytes 200 megabytes each we could of course use the CSV version I think we will want to load in this airlines.csv to map some of the names to the actual Airlines um but yeah oh someone says Yusuf says I've been waiting for your live to start in France it's 3 A.M there oh you're staying up late for me thanks for hanging out late tonight this is fun we're gonna look at some flight data all right so we have the parquet file we have the CSV file no reason to really load these csvs now take note of this the CSV versions are almost two gigabytes each the parquet versions are 200 megabytes think about the compression that's found in this difference and a lot of it's probably due to the fact that you have a lot of sparse data a lot of repetitive data data but maybe we can get things loaded a little faster too by doing some tricky things greetings from Peru uh Jaime sets welcome chess player LOL streams in chess category but I ain't complaining Oh My Stream is in the chess category oh I need to switch over that is important to know oh I had so just a little Side Story here I had set my um my stream to chest because on Monday I played a grand master in chess and guess what the outcome was I played a grand master master in chess and this is a little side now no I'm not gonna go too far into this but um I played this Grand Master in a simul so he was playing 20 games at once and my game got a draw so I feel amazing the fact that I drawed against a grand master but I think he was going easy on me if I'm if I'm completely honest I don't think he was legit trying against me I think he was trying to draw as many games as he could early so um but I mean I was losing I looked I looked it up and I was probably um we got a kaleidoscope guy and chat welcome how's it going okay so we're looking at this data combined flights we have one file per year that's how I set this up and there's a lot of interesting things we can do with this data so I read in the parquet version and if we just do an in dot info on it we can see all the columns that we have in the file types that they are so um I think I was looking at this before and I will pull this over here on the other screen and I had noted some columns in this data that I think might be interesting um so obviously we have the flight date when it occurred the airline I think it's going to be interesting uh where it came from origin industry destination then these are the real interesting ones cancels and diverted um so I think there's a lot of columns here that we destined we don't necessarily care about like um dot ID operating airline iata code operating Airline there's just like a lot of granular detail that we might not necessarily care care about hey we have someone in chat from Columbia welcome welcome all the people who are hanging out here and thanks for the cheers and stuff um so let's filter this down now there's something pretty cool we can do if we only care about this subset for now of data oh another thing we might want to do is look at this data and get an idea of the memory use that it's taking up so it is 2.6 gigabytes plus of memory use so that's closer if we remember to uh it's more than because it's even more than just the CSV version of this but it's closer to the CSV version in size than the compressed parquet version so on this machine I have plenty of memory to work with but if you're working in a cloud environment we might not want to read in all of these columns what does 20 at once mean just side by side or with timeline I'm not sure what that comment means oh to O is a simul yeah he played them everyone at once so he was just walking in a circle making one move per person this guy the guy I played chess against has the world record [Music] at once this guy [Music] Timor if you want to look them up this is a guy that I drew in a chess game but not really I'm telling you he just let me all right so um nice thing about part k files is that we can actually when we read in the parquet file we can provide it just the columns that we care about reading now see with read CSV we have the we don't have columns as an option I believe but with parquet we do so with columns equals this column subset now we have our data frame from 2018 Only The Columns that we really care about um so I I filtered down here like where is it from where is it to was it canceled was it diverted and then the main thing we're going to calculate off this data is um we have our departure time uh this is the CRS departure time so this is the departure time that it was expected to depart and then we have the actual departure time the difference between this these two should in some ways be the delay in minutes um unless it looks like unless the flight actually departs early which it looked like it did in this situation then it would show a delay of zero but if the departure time is later than it's expected to this departure this delay would be a lot lower um longer someone's saying they uh uh saying they upgraded kaggle gpus that's awesome I need to give it a try you see this link yeah I did see them post about this this looks wow you can get two t4s so dual gpus do you get charged double for it like does it count double towards your time I'm guessing not the ram going from 16 to 30 gigabytes will be pretty awesome okay so origin Airport uh yeah so delay in minutes we can look at that a little bit closer next um we already set we already pulled in this first parquet file actually let's use let's from glob import glob and just get a list of our I'm going to restart this kernel and get a list of the parquet files of this weird flights flights folder and I'm going to do startup parquet to get a list of all our parquet files and then we can just load in like the first one and call it data frame uh now a few things do we can we make this more efficient let's look at the info on this and we see that some of these are probably not the most efficient types that they could could be [Music] and we're going to fix that next let me just check something here why does it say my all right this is just messed up okay um so we have the Airline I think so the thing is when you have things like Airlines where it this these are each string values but if we do a value counts on this it's actually just a limited number of Airlines so what we can do is as type and let's look at the DF info first so this is a smaller version it's it's only 500 megabytes but we can change this to a categorical type and I think it'll save us some space um what else can we change we can change I mean really if we just do cat calls and we do Airline origin destination how many unique ones are the origin and unique will tell us the number of unique Origins there are only 397 out of yeah over 500 000 so that would definitely benefit from being a categorical variable um destination would origin state destination State it's a little weird that I'm pulling in the origin States or the destination city and not the origin City so let's also pull in the origin City Oregon City may not be in there so let's just drop all the I spelled that right right okay city names aren't a big deal because we have the state um so out of all of these that we have hello is the data set coming from kaggle it's one that we created and put on kaggle um so you can check this link that I had put out earlier yeah there you go someone else helped me up thanks for the help out all right so let's take each column and let's cast them as a categorical variable so we will go and do this and again I think we had 500 plus before now 311 megabytes so it's going to be a little bit faster when we're messing around with this date Also let's look at uh flight date I think this is good to keep the flight date like that all right so I think what we want to do now that it's not really that big in memory to do this we are going to pull in each of the files 4f in parquet files we are going to append a list with this with reading each file so what this will do is all of our parquet files which we ran glob to find 2018 to 2022 all of these will be read one at a time and they will be appended to this list and then we'll make um PD cat these data frames let's reset the index and let's drop it and we probably could delete this DFS afterwards but let's go ahead and leave it like that and then what the other thing it's going to do so this is read in and format data man how's it going welcome to the chat now if we do a DF info on this we have a subset of categories columns that we pulled in we can go out and get more if we want later it's good to start with 1.5 gigabytes in memory so we're saving a lot of space we'll be able to move around with this fairly quickly um now the next thing I want to do is to think about I might have origin city name I miss a comma in import columns list oh that was it all right actually let's do I kind of want to remember all the maybe do like another pass through of all the column names and now that the data is small enough remember if there's any other columns that I might want to actually include so here in my data description I actually put each of the The Columns and the description it's a little weird with funky with this format but we can kind of decipher what these mean um so flight number I don't think we maybe flight number would be interesting because flight number I think is the same thing that's used to describe the flight each day of the or each week over and over so we could find like which flights are often delayed so let's look at this first one make a temp oh yeah the other thing I want to do here is PD set option Max columns uh display dot Max columns make this something like 500 so it's big you guys have seen me do this before so now we can see more of these columns when we're displaying them so if I look at this flight number marketing Airline I think it has underscores here and that's what's there we go uh and we do a value counts on this okay so this one flight flight one zero nine five is the most common flight um and let's also let's just query where the flight number marketing Airline which is a really long name uh equals this and see what the actual Air Airline name is so front time let's Google this Frontier Airlines and 10.95 it must be this flight from Orlando uh Islip to Orlando what's ISP anyone any ideas hey thanks for the bits by the way his slip am I crazy here what's this slip we'll figure it out and someone in chat's gotta know um so this is the the most common flight apparently in our data set it's happened 23 you think it's not New York Oh Long Island that makes sense that makes sense uh so flying from New York to Orlando these are all the people trying to get out of the cold in the winter time it's the most popular flight or most frequent flight and I guess I don't know much about flight numbers but is a flight number unique from the route like the to and from location if that's unique for Frontier Airlines so they might have 20 of those flights per day but I don't think so because you could have like you wouldn't want people to get confused with their flight number in the same day so it has to be unique per day maybe day a week anyways all that's to say I am convinced that the flight number is interesting enough for us to include in our list that we want to import in so I'm adding it to my list I'm adding a com comma 2. what else is here originally scheduled airline that so that means if it like changed Airlines that seems like a flight another flight number operating Airline operating and marketing so I think what might happen is they'll market a flight they'll sell you a ticket it being a certain Airline and then they'll like Shuffle you off to a different Airline so that's why we have a lot of these same flight number and operating and marketing so we're just gonna ignore the operating one all right so here's all the origin information we have the city name so let's pull in the city name just because it'll be a lot of repetitive ones but look Chicago's the most common City so we're gonna add this to list um this is like the origin code there's a world area code that's okay destination city name we can do something like the worst city to fly out of in America something like that based on delays that would be kind of fun to do right it's not unique across Airlines 1095 many airlines what do you think is Dallas the most you think Dallas is on there no when we did uh origin Air Line uh just just to double check let's see what the top do value counts on Origin state California is the state with the most let's just assume origin and destination have similar numbers of flights in and out origin city name is oh yeah that has nothing because we haven't loaded in uh but Dallas is third on the list Newark I think Newark would have to share a lot of its stuff with New York I don't know I don't know though okay now we have this departure delay indicator 15 minutes or more equals yes departure delay groups I want to calculate these our own on our own uh wheels off time that that might be interesting because then then we would know when the flight because I've had this happen before where you're on a flight you you get on the flight so it departs the gate and then you just sit on the runway so how would we calculate that it would be like wheels off time I guess that would be the taxi out time taxi out time and taxi in time yeah so let's pull those in let's pull those taxi times taxi I believe is the time spent in between the gate and when you're in the air taxi in we already pulled in oh this is the Rival time okay so we can we can do something with if we look at this data we can look at um Not only was the flight delayed from when it was supposed to take off but was it delayed in the air the laser very weather dependent some are much better at handling weather conditions are way better than okay that that makes sense it's kind of like cities like we're not used to a lot of snow where I'm from so if someone from Michigan came and saw how we respond in the in a snowstorm they would laugh at us um arrival delay in minutes that's interesting it's basically the difference between these again but I want to see like if the time of day has a now this whole diverted stuff there's a lot of data in here about the flights being diverted but what I noticed was it's so rare like if we look at DF temp diverted the mean value so that's going to be the amount of times that a flight is actually diverted it's um 0.1 percent of the time the flight is diverted so we could look at all this like I think it was here diverted airtime diverted wheels on all this other stuff down here the diverted tail number like it could be diverted to five different airports we are not gonna look at because it's just so rare that it happens and it probably let's just say it's outside the scope of this analysis um this could be interesting carrier delay versus uh but I don't have this in here versus weather delay what I did was I dropped some of the columns that were just almost exclusively null values and I think these were some of them so let's just keep it with this you guys feel good promise how long we've been streaming not too long like half an hour so we're just at the point we're loading in this data we're picking the columns to read in from our Airline flight data let's delete this I'm going to delete some of these columns just so you know if you escape out of your cell when you're working in a Jupiter lab you hit the Escape key and you press J and K you can jump around and then it's like Vim key binding so if you do something like I want to delete the cell I'll type DD on it and it's gone poof all right so now we have our data frame and if we do it info on this we probably could make some of these categorical like decimation state uh or city name could be categorical but I think we're good we're still now we've jumped up to three gigs though so it's a good a decent size data set all right the first thing I want to do the first thing I want to do here is is put these delayed times into groupings because this is like the big one when when uh the flight is delayed that's what people care about let's put this in single dashes so it looks like a that and let's take this data frame and let's do departure delay in minutes and do our handy trusty histogram with 50 bins okay what do we see here so we could title this too distribution of flight delays uh so this is a very long tail and most of the the flights are not delayed or their delays are very very low so there's also this probably would be more of a normal distribution if we calculated the difference between departure time and estimated departure time but because anything that's early that actually took off early is considered not a delay then it's kind of truncated at zero so let's actually query this since it's so hard to see what's going on let's query when this departure delay minutes is less than 30 Maybe seems like a good place to start still here we see most of them are zero with this long tail of delays so as much as we complain about flights being delayed it seems like there aren't that many delays okay stream's over we're done no just kidding uh I think I might have too many bins here let's just do like 20 bins because I think what's Happening Here is this uh delay minutes if we do a value counts on this ah value counts yeah it's only rounded to the nearest minute so it's not like sub minutes so if we're doing less than or equal to 30 minutes we probably should do 30 bins that'll make it one bin per the minute delay okay so there is like a small tail going out here it's going out here but most of them are not delays so this is like less than 30 minutes all right now the next thing I want to do um I guess we could we could plot this same thing but when it's not equal to zero so just exclude all the on time ones oh it's still really low so if we do not equals to zero I'm surprised by that so it must be the biggest delay is just one minute where does this start to so if we just say greater than five minutes greater than I want to get to a point where we can start seeing what's going on here greater than 20 it's still really hard to see greater than 30. it just becomes less and less likely the further over you go and then uh coding with strangers how's it going man good to see you hope you're doing well should it be greater than 30 . um yeah so I'm trying to see like if we just cap off this stuff to the left and look everything to the right uh well what we probably need to do here is do a range so this between that and let's say 60 minutes there we go that's what I've been looking for where have you been all my life so let's do greater than one minute and less than 60 and make this bins like less than 61. so this is a very clean distribution when you dis when you remove all the flights that are on time then it kind of like is a a very uh clear pattern to the amount of delay time that you get so this is when people start getting pissed like when there's 30 minutes delayed I guess it depends who you are um but it is like less and less likely the further you get out in your delay time from occurring um yeah [Music] now another way to look at it and this might be just a big waste of time is if we go to to the CRS departure time one thing is this might be a little bit weird when we have um when we have flights that take off around midnight because then I think it'll go to zero or it'll start over at one so then right so we have this CRS departure time and then the departure time now the thing is the departure time for a canceled flight do you notice how this this column is a float but the CR CRS departure time is an integer that's because there's always a expected departure time but our actual departure time might be nothing if the flight is canceled but if we just naively took this and we said departure time minus CRS departure time in theory for this first one it should be a delay oh no so this one left 47 minutes early is that right let's compare this to our delay departure delay in minutes because I think these times let's double check these times yeah it's in local time hours and minutes wow yeah so those flights these flights left early and these were really late so 378 so why is this 378 oh oh okay so we can't just do the we can't just do a simple um subtraction trying to think how we would do this if if it's if this occurs at midnight we we might accidentally say that this delay is way longer than it is but all right let's try to parse this so if we're just dumb about it right and then we take the first two values who it really didn't like that really did not like that no pressure can I do this departure time round two round negative two no because that's going to round it all right let's look at this python convert I don't know why this is so zoomed in what we want to do is take a clock like this we have nothing to split on wow this seems like a very complicated way to solve it hmm [Music] so this is another way to do it uh question what is as type just convert the type right yeah so what I was trying to do if we just if we take the head on this like at the first five the thing is if we make this as type string it changes it from a float to a string and then we could do string of the first two values which gives us these 18s right and then we could do two to four which gets us the minutes but it was acting really slow I think because it doesn't like ask GitHub co-pilot that's funny uh what I think it is is it doesn't like the fact that these are floating values so what does it do let's find when this is n a departure time is n a and then let's try to do this that's I think where it's gonna freak out because it's gonna it's gonna convert it to nand the value Nan so I could do drop n a subset departure time right and then try to run this as type string two on that and this would be like our departure hour because what we want to do is take the hour and divide it by 60. and then add it to the minutes hey what am I doing I'm we're we're looking at this flight data set it's a pulled from a government list of all the flights that take off in their times and stuff you know what I think this is a rabbit hole I don't want to go down the main thing I wanted to show here and this doesn't accurately show it but let's knowing that this is incorrect because we didn't actually account for the hours and minutes yeah I think if we if we plotted this with more bins what we would see is more of a normal distribution around zero there's a lot of weird things going on here like I don't know how a flight could take off this much early whatever but we're good all right the next thing we're going to look at is a grouping of delays so I did look this up and it looks like there's a Wikipedia page on flight cancellations of course there's a Wikipedia page on it there's a Wikipedia page for everything hey David Jay what software am I using do exclamation point YouTube oh wait just go to my YouTube channel and check out my introduction to pandas in Python also check out my introduction to uh Jupiter lab and I'm using pandas the package and I'm coding in Python okay so we have flight delays people have done this analysis um we're not the only ones to do it maybe we could try to recreate some of these average type of delay largely the main thing I want to focus in on is they have these types I found this earlier so there's these types okay it's here this this sentence delays are divided into three categories namely on time or small delay so that's up to 15 minutes because like who cares 50 minutes delay if you're gonna freak out about a 15 minute delay come on medium delay that's when you're like um kind of mad but I don't want to say anything let's be honest I would never like be mad or yell at him but about a flight delay but uh large delay is 45 minutes or more so keeping this in mind per Wikipedia which kind of means it's like facts because We Trust Wiki let's create these delay groups so if we have our delay departure delay minutes I Yusuf says I add a redo one of my videos oh nice you redo one of my videos with a different data set that's what it's intended for my my tagline my motto of my YouTube channel is to inspire creativity in data science so if it I want you to be inspired to do something interesting with your own data because you can watch hundreds of YouTube tutorials and not do anything yourself but the goal is that we get you in a place where you feel comfortable doing these things yourself hey someone's greeting from Mexico welcome Victor how's it going all right so we have departure delay in minutes most of them are zero one two three four okay so we already saw that in our in our previous analysis so let's do if it equals zero let's add a new let's add a new column called delay group then let's locate when that delay group is zero on time or early sorry I'm doing this wrong when this is zero then our delay group will equal this on timer early we're also going to make our other three groups small delay let's not add a slash here that's not so this is small delay this will be medium delay and this will be large delay these are the official categories people so this is going to be when our delay is under a certain condition which we will add here hold on one second yeah so this is when our delay is why did this add a parenthesis in here I want to do it outside of this so when it's uh greater than zero and it's less than or equal to 15. that's going to be our small delay then greater than 15 less than or equal to 45 medium delay greater than 45. boom we don't even need this and statement because that is a large delay let's let this run hey we got some people in chest in Visa can I sign values using look you can assign values using look yes so what this is doing is locating and then assigning assigning this column name or this column values for only when these conditions are true to these values this is much faster way of doing it than to like write a function and go over it that way there might be even faster way of doing than than this but this is nice so now if we go into delay group um first let's check if any of them are null just a quick is n a and sum so when when are these n a H so these are n a when there is no departure time I think let's yeah the departure time is null and that's probably because it's canceled so let's let's do another one if canceled equals uh well we if true that's just the pythonic way of saying it so if it's canceled then the delay group is can sold and now let's look and see isn't a sum that we're trying to get this to not nothing because there should be nothing that I mean out of now we're at the point we're out of 29 million flights only 1 000 of them have an n a value but I want to figure out why um this was not canceled it was not diverted and this should just be these all look like they departed on time or close to it um this is really weird can we do a diff on this diff axis one so we're we're kind of dealing with the final few but it looks like 11 000 of these are should just be zeros then we have a few other funky ones let's just not deal with it because it's weird that they're not in there um what kind of flight data are we collecting where we've pulled this flight data from from jeez I will paste this here if in the future if someone asks you can tell them that's where it is or this data set uh and someone else says hello I'm loving the stream so far I know this is a little off topic but how far are you with plotly I'm starting frequently and I've loved it so far we're gonna do some plotly Express stuff for sure we did plotly Express last stream too Pablo fluff I just said that weird all right so let's do value counts on this and let's plot this we're gonna pull our color palette there we go and let's give this a title 2018 and 2022 these are your flight results so should we do these as like a percentage if it's on time small delay medium large and then canceled is actually pretty rare that if flight would be canceled um you're saying floats I'm not sure what that means so this equals zero but we we looked and we saw there were null that's off my flights get canceled all the time maybe you're flying out of the wrong airports um okay so we looks like this is the main result uh let's look up as a percentage um so the way we could do this as a percentage because this value on the x-axis is kind of you know hard to do much with we can do this value counts and then we can sum and we can divide by um the total number of flights get something going here need some good background music um so if we do this times divided by D of shape zero it should give us a rough percentage so this says two percent of flights are canceled what percent of U.S flights are canceled did it just auto complete to okay so they have their numbers per year 3.2 it looks like 2020 was crazy but what we're seeing of like the one point five okay so maybe what we could do here is hmm we could Group by year we have to make a year column I think flight date year pull out the year from this so this will be our year so the question we're trying to answer here is what is the percent of flight results by year so we want to see they're showing like a big um shift obviously when covid hit that a lot of flights were canceled but before and then also in 2021 it looks like less were canceled maybe because people were traveling less or they had less um but this will be interesting to see so let's Group by here then we'll take our also like putting a show here our delay group and then do a value value counts here and we're going to make like an aggregation data frame I also like to unstack these how I do it what unstacking does is it puts it in this type of format where we have like it's hard to it's hard to actually see visually how the years compare uh let's call this DF AG it's just a temporary data frame name it's a really small data frame and we'll mess around with some style sheets so we can do background gradients and I think it depends a little bit on yeah my my theme here but background gradients can be really helpful um let's look at some pandas Style examples so it's this table visualization stuff where you can do a lot of stuff with pandas you can really get in there with with your styling colors and I don't know why this all it actually doesn't render well and these examples are showing but we're showing things like background gradients let's see if there's like a way of doing yeah so I think we could do c map as like greens and there would be greens or we could do Reds I kind of like greens right now um so what this really shows is the raw numbers but not the percentage is unstacking like unpivoting yes it's very similar to pivoting because what we have here when we do this group by year and then the value counts on delay group is we have like a multi-index so if we look at this index the index is a multi-index with the year and then each of these subgroups by unstackity unstacketing by unstacking we basically flip this multi-index um so that one of the indexes columns one of the indexes rows hope that makes sense really cool stuff we're doing here guys all right but I'm not too happy with this because the aggregation is not in percentages and we want to see percentages so let's do a sum axis equals one now we have the total number of flights and we could have done this a lot easier let's maybe do how many flights per year um we just do a bar plot with uh data is no let's just do this DF year value counts sort index plot kind equals bar fig size like a 10.5 and then title is flights scheduled flights per year we're gonna see some weird stuff with covet right let's also add yeah these numbers are so big it makes it look weird it would be nice just to add the the number to this but that might take too long the big thing to note do we have all the months from 2018 wise 2018 so something's weird with 2018 I just wanna see it does start on January 1st and it ends December 31st so I'm not quite sure what happened in 2018 to 2019 maybe this doesn't include all Airlines every year because there's no real like do do flights really go up that much in one year it's like almost went down to what happened in 2020 and no one was traveling then so something funky is going on here but the main thing I wanted to do was to do this sum and then take our DF aggregation and can we just divide this to get some percentages hmm no we could just take like canceled and divide that by the sum now that gives us a percentage this might be a like a weird way to do it but um we can Loop over each column and divide it by this aggregation let's do this up here uh so what am I doing here I'm just looping through each column and I'm dividing that column by the total amount for that year to get percentages here now I could also just take this DF Ag and multiply it by 100 to get it like in round percentages I don't know um don't even need to do this in the data frame I could just do this right here now we have the percentages wait 99 have has a small delay something's weird there oh I see what I'm doing I'm aggregating each time and I don't want to be doing that that's going to mess up my total calculation but this should be better guys all right let's see what see is this table of visualizations file style feature new with pandas 1.5 embarrassed to say I haven't updated since 1.3 Funky Monkey you should watch my video on newbie Panda's mistakes um I think this has been around a while I think it's been around just people don't know about it uh it was changing version 1.4 so yeah if you're using 1.3 maybe that's why oh my gosh you can use value counts normalize equals true okay so let's that I my mine would be blown if that's what we could do so you're telling me everything that we did with this could just be done by doing normalize equals true mattius oh I put in here not in value counts sorry how would that work normalize oh man that's awesome thank you for telling me that thank you for telling me that look at this all this dumb code that I wrote up here is not necessary and I'm going to use that all the time see you can learn stuff all the time that's what's so great about working in a field like this and that's what's so great about streaming to you all where you teach me things matius I'll buy you a beer next time we hang out can we do this with count plot it would be easier oops that's not what I meant to do Seaborn count plot yes we could yeah so I I don't know why I tend to use like the built-in pandas plots more often just because it doesn't involve using a different package but count plots are nice I mean I do use when I want to see like the standard deviation of something I could do a so let's do this exact same plot up here for instance count plot with our data is going to be so how would we do the count plot oh we can only we can just only give it X is going to be DF here yeah so it adds a little bit more color more Flair more pizzazz how you look for hidden patterns in the data what model of neural networks ah you're getting ahead of yourself we're not really doing neural network stuff right now can I explain what normalize does again okay so we wrote a bunch of code let's make sure normalize does what I think it does yeah it does okay so what normalize does here is when we do a value counts of like year it'll give us the raw numbers so 2019 had this many flights 2021 had this many flights blah blah blah what normalize does is it gives us the percent of the total so now we see 2019 is actually 27 of all the flights in our data set uh memo says something about SK learn pipelines yes I do find them clunky too I don't use them as much as I should I'd also like to order these columns in a way that makes sense to me shall we um so on time early small delays should be first then medium delay and there may be a way that I could have set this up using like an ordered categorical type but this makes more sense that it's first on time then it's a small delay hey what do we got here on Twitch stain train subscribed with prime stain train thank you so much I appreciate the subscription on Prime I gotta spin the wheel there's no getting around it I spin this wheel anytime we get a new subscriber on Twitch you can use Amazon Prime to subscribe on Twitch if you have an Amazon account it's free doesn't Auto renew oh I wish it didn't look I just want to re-spin it let's not do a bass lick my thoughts on Java I'm not a big fan but I also just don't know enough about it 10 push-ups okay thanks uh thanks again for the um Prime stain train also all the new followers thanks for hanging out tonight Okay so yeah this order makes more sense to me so look at this big flip in 2020 um but these are not AG these are not normalized values so let's rerun these cells why isn't this working I thought we had our well first of all I think I see what the issue is I'm multiplying by hey clipped thanks for this subscription for six months six months clipped appreciate it yes you're right guys I'm multiplying by 100 twice which makes zero sense let's spin that wheel some of these are kind of crazy I'd have to go get a whole banana you needed it foreign push-ups next month I don't have to spin for you uh I can't get out of it that easily all right so here we go craziness happening in 2020 where you actually had more on time than you did in previous years but you have many more that are canceled now let's see how this lines up with what we googled why is this only January through May is that like Peak Travel Time what's up mgx welcome to the chat um so they are saying swing Lord thanks for the subscription let's spin the wheel um so I I think we can't compare this Apples to Apples but we are seeing this big jump and cancelizations not as high as 11 percent but maybe because we're not looking at January to May which makes sense 10 push-ups again I'm gonna have to lower that one if you're watching this later on YouTube you can uh you guys can fast forward hello awesome scream name thank you vwishy so percentages don't line up should we see what this source is of this more than one in five domestic flights was delayed in early 2022 so this is why they board your flight so early because they really don't want to be delayed that's probably a number like even if they're one minute delayed it would probably would appear on this but why do they say January through May can anyone explain that to me all right let's just keep on going with our analysis other people can critique our analysis later we're gonna do our stuff now um I'm gonna get another crack comment I know I know so let's look at it by month let's do the same thing by month so what we'd have to do is basically the same thing this is a nice thing about doing this in code we can just reuse this should be my month and let's go ahead and make this a different color so it stands out there we go okay so what do we see here with months typically this might be why they were looking January through May typically all the cancel is cancellations happen early in the year could that be because everyone's traveling no one well people travel during the holidays so I would think it would be high here but oh it's probably weather related I don't know need someone here to tell me okay so let's keep on let's keep on trucking around you think this is due to winter storms like could be that's getting into like April April's not that bad on which days of the week and in which months where the more aircraft accidents of flight cancellation with summer hot weather days more often cancellation due to weather okay this could be coveted cancel it cancellations you're you know that could be it this could all be because of covid hitting here so the all these flights were not the like uh they were still on the books and then they had to cancel them Let's test that hypothesis why High cancellations in January through uh let's just say in March April what does it look like by year for these months so let's do this let's go back to doing this by year but let's query where month um month is less than or equal to four or greater than or equal to three and let's maybe make this oranges that's my favorite now every time it's a business question oh it's covid yeah but you have to prove it all right looky here looky here our hypothesis is Justified or is uh shown to be correct when we group everything together here and we look at the highest cancel cancellation month this is over almost five years four and a half years of data we might be led to believe if we just aggregate by ear oh this is the high time for cancellations but it's not necessarily it was just really messed up because of covid so if we do this same plot as above but we do query year not equals 2020. even though 2021 was kind of messed up by it but let's see what this looks like when are the cancellations so the cancellations are all coming in January and February that's when all the cancellations are due to weather I would think it looks like we kind of have two peaks and cancellations like it gets higher in the winter for winter storms and then it also gets higher in the summer probably for thunderstorms I think it would be an extra Group by month and year and see April yeah so how could we see month and year I know what we could do uh uh this is a pretty cool I I've used a few packages that do this um this is one of them so let's see if I have this installed uh plot using calmap it's not installed so I'll go to my terminal conda activate kaggle 2 pip install calmap not pip install Cal map there we go what this will let us do if it installs correctly which it did uh let's let's plot this example I said plot this example okay so this is example plot um there that looks a little bit prettier now what I'm plotting here is just fake data I'm just every time I run it it's fake data but let's try to make this what do you guys think we can make one of these calendars for each year would that be cool can I get my OBS back up in the group I passed month and year yeah but then we're doing this unstacking like okay true I could do it this way I could do group by month year and I could just query the canceled right and then I could do a size on this and unstack right and then I could do like style background gray gradient and you can see where it ends in 2022 our data ends but yeah this is you can do as type int might make this look a little prettier no it's because of these nand values that doesn't like it but yeah I I want to see make it for next year I wish put it in a banana and call it banana are you sure the color gradient correct the large values in the early on time columns seem small similar color to the smaller values in the rest of the heat map I see what you're saying oh that's a good point okay so I didn't mention this but the seat the background gradient can be applied to an axis um so I probably should do this in a different cell so it doesn't have to compute this so if we do axis equals one it'll just show us the value across the year so on time early is always going to be the highest when we do X is equals one I believe by default that axis is zero so it's showing the darkest value across years uh I don't know it might be a linear scale but it's by column it's coloring by column so like if you were in Excel you would color it by you could color it by column I think we do yeah we can do axis is none if you do access is none then it basically treats all columns it like makes a gradient Scale based on all the values hopefully this makes sense the main point is that um I think that axis zero makes the most sense for this use case because we're trying to see out of all the years what was the worst year or the year that had the most percentage on time early or the most percentage canceled otherwise these small numbers kind of get lost in the Fray but I mean to each their own I want to do this cow map I know you guys aren't super excited about it but I don't care um so it looks like what we do is we just we just need a get an index with a date time so that's going to be easy Group by what we'll do is we'll Group by flight date and then we'll do it canceled and we'll take the mean of that what that'll do is give us the percent of canceled flights on that day we could do this by airport too we got to look at airports and we gotta look at Airlines like we're just so much more we need to do here okay so um if you're if you're enjoying this in chat just like give me an F or something type something in chat bump it with the F if you're think this is interesting okay so this gives us our events thank you all thank you all twitch chat is greater than uh um other chat just kidding okay there comes YouTube okay so key error is wrong because year equals 2015. oh yeah because I don't have 2015 2018. oh okay so look it's dominated people are loving the stream I love it thanks guys you saw my Tick Tock today nice trying to get on Tick Tock while the all you little kiddos so can we do like a we don't want this to be a linear gradient we want this to be like oh I do like them let's look a little bit more at what the the things we can pass this like can we do it on the logarithm foreign yeah we can do this matplotlib normalize that'll normalize the values monthly border I think that kind of looks cool true I think the darkest heat should be on cancellation as the most severe found your chance Channel recently really helpful and very inspiring thank you meet say hi Socrates hi year plot what yeah but this is so basically what we have here is this if we plot this as a histogram we have these big outliers that are really so usually if you take a lot yeah so if we take a log if we take a log transform of this it makes it more of a normal distribution these are one of the tricks that us data science scientists use from time to time uh because you have these like way big outliers that otherwise make it not look so great so let's try to do this there we go but now the downside is that they all look so similar um let's also make this a multiple axes of five by one four four I year in we have 2018 I know I shouldn't be typing these out but just want to make sure I get the range right for that in enumerate what this will do is let us run for each year there we go whoa whoa and let's add title set title as year and then we need a fig sup title subtitle cancellations U.S flight cancer cancellations and uh what happens here is the Y needs to be it's always up too high like nine and then font size Maybe 20. all right um so what do you guys think about this I think it's pretty cool huh uh what's the difference between someone like Jupiter and notebook and ID like spider pie charm check out my I have a whole video on that check it out axis I for the title oh oh yeah yeah I didn't good catch and that's probably why this why I could make that subtitle so low um maybe make the C map let's pick a different color map something like this I think this is pretty cool guys yeah this looks cool this looks cool it looks like a sunny day do I like Jupiter more than vs code I've tried to do uh Jupiter in vs code and I just don't it doesn't feel right it doesn't feel right I use vs code I do code in vs code a good bit show you uh this is some of the stuff we were doing on our last stream usually it's like exploratory stuff in a notebook then if you have any code you want to factorize you write it in in an actual IDE environment and then you import it into your notebook next time now we can make that interactive yeah we probably could so speaking of plotly Express someone mentioned this up earlier uh heat map calendar they might have this apparently is like a heat a cow plot no this is what we just oh plotly Cal plot huh all right let's try to install that pip ins install plotly Cal plot please don't have a lot of dependencies oh yeah you totally you're changing my numpy version okay um so we are going to do events let's reset this index so this will be events with our X as flight date and our y being canceled there we go folks I probably should do this log scale again modern technology look at this now we can actually hover over them someone just mentioned it and what in 10 seconds we had it up so that's pretty impressive how long does it take kaggle to transfer money to my local bank account I think it depends um probably which country you're in too sometimes pip is awesome other times not yeah mostly awesome so it is cool that you could see okay so let's keep in mind that we just don't have data from here so don't let that bias our our view but it's like wow look at all the cancellations with covid the big hit and then also 2020 these cancellations were kind of oh sorry this is 2021 I think this is when there were like strikes with with um Pilots and if we if we look at the values I don't think they're normalized by year so this like this really dark here is two three eight and this not so dark one is like a oh these are negative values yeah that's that weird uh scaling but this is cool okay let's maybe second coveted wave in 2021 yeah I think that's what it was um Okay so I mean so my viewpoint on interactive plot sometimes is that they're more trouble than they're worth uh you can even annotate flatly figure lay it with certain positions explaining if you want yeah I mean there's a lot you can do with plotly it's getting better and better each time Rocco best welcome to the chat uh is there a reason to seasonally adjust the data I don't know I I just like I don't like the seasonally adjusted stuff at first but what was I saying before that um I was saying that Hey Rye thanks for subscribing with Prime eight months eight months we gotta spin the wheel for Rye um what I was saying is I think sometimes interactive plots are like if it's on a website it's needed or if like people want to dive in and see these actual numbers but this can be put in a um because top one can be put in a PowerPoint or something pretty easily even though I think the plotly one's actually pretty beautiful so let's go and see what the spin landed on push-ups incoming he says okay we got a gambler ah 10 second hamstring stretch ah just what I needed okay hopefully this stream is is fun for you all um now I want to start looking at Airlines that's the thing about delivering these kind of messages to managers they want clearly about the information so although it's results might might not be the most interesting only the biggest results are the most impactful true that's true um Tech Guy Dylan this is all just so you're mentioning that seasons are weird because of uh where you are that's true but keep in mind this is only U.S flights that we're looking at uh so let's compare airlines who has the most delays who has the most cancellations all right so first let's just get an idea of the airlines that we have um so we could do SNS count plot for DF with our X Airlines Air Line hmm this is kind of hard to see so this might want to be our y I like so I like using horizontal bar plots when the names are so otherwise harder to read because you have to turn your head otherwise so one thing I hate about seaborn's default plots too is this weird color palette that it defaults to uh so let's just do palette is blues or something at least it's now the same and now how do we sort this order see I wish I could do just ascending but I think what order wants is a list so what order wants is and this is kind of why I don't like using this is DF Airline now I have to do it here value counts and then I could do the index of this oh uh and then make this to a list Airlines ordered and then if I do order here yeah now it at least looks like that but now the color palette is weird let's do color pal two so that's the second color in my color palette now we're getting down to interesting who's the most reliable IE on time it's a good one on time so that's like neither canceled nor delayed or early or earlier okay so Southwest Delta Sky West we have some big players I kind of feel like maybe we should focus on the big players I pulled the heat map sure for the stream yeah it's a perfect shirt for this Chicago one they gave me this one use colorblind palette to be inclusive um we could do that we could switch the Seaborn colorblind I actually like the colors on the colorblind I work with someone who's color blind in it um I mean I think I the shades are nice um now I need to recreate my color pal there we go um and then where did I create it later in this notebook oh I just called it pal before all right um so what do you guys what is the horizontal axis here so the horizontal is the count of flights uh so let's add a title here number of flights in data set oh that's the other thing is I don't think this will let me do that so I'll need to set title like this okay uh the cancellations get skewed or am I wrong so what do you mean this these cancellations is because we have a bunch of nulls so can we do you guys think it's good to look at wait so Southwest has five flights in the data set no so um this is ie6 so that's to raised to the power of six yeah this the yeah how do we make this look better uh we could do ax set X label flights and make this million but then how do we make this count plot see this is where I get messed up using Seaborn because like I know how I could do this and actually get this ie6 to be removed if I just did like the value counts so this is why I find myself not using an account plot very much like what exactly does c board give me that I couldn't have made on my own so yeah let's just take this count plot out and then we'll do color is pal two maybe um sending his faults is true and then I also like doing this with is like one and Edge color is black yeah that looks kind of sweet right and then I could do this right these value counts and divide it by 1 million uh let's do one hundred thousand put this in here wrap this in this now this makes a little bit more sense it's the 50 so this is 5 million flights for Southwest Airlines that looks good you like it we could even take it up a notch and add the the numbers in here but I don't think we need to do that yet let's let's get to answering some of our some of our questions here what I want to do is so let's let's say a minimum threshold of at least 2 million flights so sorry if you don't have 2 million flights no let's say 1 million let's say 1 million if you don't have 1 million flights you're not you're not going to be looked at in our analysis let's see where Airline equals uh how categorical does this so this is top airlines these are all of our Airlines with 101 million flights at least and we're gonna look at only the the rows where these Airlines are part of um it was a simple answer you need to Learn Python you're asking which language to learn python can I use Google collab yes yeah so be honest fool we're gonna get to that we're gonna do things in percentages so the first thing we can do we're gonna call this DF top we're going to reset the index and drop it we're going to copy this so we're actually creating a copy of this data frame and this is going to be like our subset of only the big Airlines then we're going to group we're gonna do the same thing Group by Airline flight we might be able to answer all these questions in one plot delay group value counts now we know about normalize equals true and then we're going to unstack this data dude we're we're looking at um someone someone sent them the link whoo something went wrong I think it's because we have some n a valued delay groups you I may need a reset index after the group by my inkling was it was this oh but it isn't that so you're saying oh so what I do need to do here is Group by both of these I think that's where I was going wrong it's kind of fun to work with the data set that's this size where it's like middle of the road oh my goodness as I say that like my machine is 50 gigs of space uh jeez 50 gigs of memory we're using right now some of that's obs I'm just trying to get the size I'm trying to get the counts trying to get that counts of what the delay group is by Airline maybe I need to stop this and let's see how much memory we're using now I'm going to start from scratch just to make sure that I free up this memory okay we're down to 10 gigs now it's much better I think the size function will work yeah yeah you're right all right so starting this over there's a lot of this we don't need to do again I think that we're making a lot of slices and copies of the data frame which is where we're getting into trouble like this top airlines thing probably could have been a better way to do that but yeah so now we're at 20 gigs of being used do you need such powerful machine to work on large data set there's an advantage of the memory you have um it definitely makes everything easier if you can fit your stuff into memory just makes life easier if not you have to start using things that tend to complicate things so um getting a ec2 instance that's huge can save you a lot of time and and trying to develop something that will actually work with like desk or whatever do you run your projects locally yes I am right now that's what we're doing um not at work but here here I do hey we got I I think this is the biggest streaming crowd I've ever streamed to um on YouTube we have like 125 people this is insane so I hope you guys are enjoying this I guess this is a sign to me that I need to be less of a jumping all over the place and just kind of focus on one project all right so I think the problem is trying to run this group by Airline and delay group so you're saying I just need a size here that's that's my problem right that was my problem when should we use Pi spark over pandas very very large data sets very very very large data sets if you can fit it into memory just do that why oh oh oh so this is this is one of the things that we didn't consider when we make this Airline column into a category and then we start to do things like plotting or or grouping by it assumes that there is a category of this so can I fix this just by recasting this as a category like as type string as type category um so maybe like this and then like this unstack there we go now we have just the top categories and we can't do normalizing here I don't think so so this is where we run it so doing the value counts will not work here can I do it this way DF top Group by Airline delay group value counts this is that going to go crazy again and then unstack this gives us the same result so this is what I was trying to do before but now we can use normalize I think this is going to be good all right so now what I want to do with this I also want to order it in our order that we set up here and then I want to do a stacked area plot there we go maybe now it says smaller value counts will work I think the reason why value counts wasn't working before was because I had these categorical variables that did not exist I'm that's my suspicion because the size of the data hasn't changed or the the memory sizes but uh the the our our use of the memory has gone down but I still think the root cause is something different okay so uh let's let's make this kind of like the last plot the last big plot that we'll do but it's going to be spectacular so so um I'm just gonna make an aggregation one so we don't have to recalculate this each time and then we could plot it like this again and a new fig X let's make it bigger oh yeah big size and then this will be it here um let's take this Legend and put this outside the plot here a little bit too tall I think there we go do you guys see what we're doing here everyone see what we're doing here um because I grew up both Airline right why did you give someone says Rob why did they give you data where everything is wrong spiked due to covid what to clean all the years when there were flight restrictions Alps that's I mean it's not wrong it is what happened it's life um covid happened we can't pretend like it didn't so clean what is clean what is like what if every two years we have something I mean we kind of do there's always like a recession or something coming around the corner so nothing is ever clean and this is why time series forecasting is so rough because I remember I was meeting with like an an executive or some people in a call center and they were saying yeah we our models are almost working well but we just need to get past the time where the recession hit so then we'll have a bunch of clean data that we can model off of and and my response was yeah until the next recession hits and sure enough that was like six months before covid so not a dumb question not a dumb question but a honest question and I appreciate you asking it but I just want to make sure that um like we realized that that with uh real life comes real messy data all right so on time and early here's the thing the cancellation I don't think we can really glean much out of it because it's so skinny here it might need to be its own plot but I would like to sort this maybe by the on time now that we're like this ordering is I think in just in alphabetical order let's sort it by the on time early so that bar is the largest or sorry goes ascending to descending okay so maybe something like this huh that with 10 is it wasn't a good idea 0.8 Edge Edge color let's do that again um like this maybe we could add percentages boxes to these datadude just asked if we ever messed with background gradient to check this out background gradient background gradient background gate it all right this is our plot can we we want to add the values in our stacked bar plot we would like it to be kind of like this the thing is when we get to these small percentages it's going to be really hard to see so this is like the double-edged sword of stacked plots is your eye can usually tell the difference between the first variable pretty well so that's just like a bar plot but then trying to tell the difference between the next ones can be tough so let's maybe split this out by the way let's just let's just show here um the most on time Republic Airlines the least on time Southwest JetBlue is pretty bad for its size too JetBlue is remember one of the smaller Airlines and yet they have a good amount of delays even small delays and I guess if what you really care about is those medium large and canceled because who cares about a small delay if we really care about these what's what's down here then JetBlue is the worst in my opinion even though I do kind of like the stuff that JetBlue gives you when you fly a stream graph would look good for that data yeah that would be cool I think these but by the way I think this plot is really cool let's go to Twitter made this plot on stream tonight with 100 plus viewers was a lot of fun working with Airline data and then I should link to my YouTube copy this link make sure it works aren't you guys so happy you're here for me posting this um MMA like the fighters what are you guys talking about Okay so the only last thing I want to do maybe is we could make one two three four five actually what we want to do is change these groups so take dftop flight status uh no what was it called that we made delay group and let's just make two groups um let's make two groups we're going to make a mapping delay mapping okay yeah I guess we'd we could do it this way we're going to map each of these to different groups which is if it's on time or early that's okay small delay that's okay medium delayed not okay large delay these are all not okay I'm not okay okay that's all my singing for tonight not okay let's map delay okay and I'm I'm getting tired here so these the names of these are kind of failing me um but I'm gonna do a lot of the same stuff here basically the aggregation in this plot um but it's gonna be with this delay okay and then the column order is not and then sort values by okay um it should also be ordered in we want it to be like okay first and then not not okay how do you format the delay mapping cell so fast I don't know I don't know if I was using Vim I'd probably be faster um let's also add a title to this let's not set title okay so um of course I gotta put this all together font size has to be bigger title font size has to be bigger than like your y-axis labels and to be proper about it set X label should be percent total flights it's actually fraction not a percent I don't know Okay so maybe he's saying good and bad is better right so this would be like flight result is it good or bad this should be good do I have aircraft model in the data that's a good question these are all good questions for next type time let's uh just as a reminder point you all towards the data set I'm going to paste that in here up and you can do it twice because the first one was bad but um aircraft it doesn't look like we have the aircraft number um what else would it be called by the way guys if you want to vote on this up vote this this data set that would be fun all right so Delta the best airline in the world JetBlue the worst in the world no oversimplification but uh yeah that's what we found out today uh yes we are streaming to YouTube right now so as soon as this is over you can rewind and watch it all you can watch all 30 of my push-ups that I did uh but yeah we're winding down here it's getting a little late on my end I had a long day got up at like six ish so that's that's early for me uh I want to thank you all for hanging out with me it's been a blast I have fun doing these streams I appreciate feedback even if it's negative feedback or or constructive negative feedback you know last stream someone told me that I was a little bit all over the place so I try to focus down today and just have one topic and just start with it and to the end we worked on flight data and I think had a very productive stream so I hope you enjoyed it um and I appreciate every single one of you for hanging out it's so much fun um and I can't wait until we do it again definitely have the biggest viewing viewer count on YouTube but here over on Twitch which you should be hanging out with us on Twitch if you aren't already we are going to send over a raid and what that means is we are going to send all the people currently viewing our stream over to another Channel and I like to usually raid channels that are doing coding so um let's go into the software and game development um I like looking for seeing if people are using python oh wait I'm the top of the Python um okay it looks like someone I don't know who this person is Bash bunny but she already has a hundred ninety five people maybe I can like rate her Channel maybe some of them will figure out who I am because of that we're gonna raid right now thanks everyone for hanging out and I will see you next time I stream Sundays Tuesdays and Thursdays however that's only subject to if I'm available to stream so Tuesday I was traveling for work the next day so I didn't stream I'm sorry I can't be that consistent I have a full-time job and a family and stuff but when I am around the Stream it would usually be Tuesdays Thursdays and Sundays so you know hit that Bell notification all that stuff uh oh yeah let me let me also run through some of the last stuff that you guys are all gonna leave for because no one cares about watching listen to this stuff but join our Discord exclamation point Discord um Discord can I spell check me out on kaggle upvote some of my data sets if you think they're helpful um also check out my YouTube if you're watching on Twitch please give me a subscription or subscribe there and uh tell your friends and stuff also we had a competition that we launched on kaggle uh a community competition recently which was a lot of fun and we are just gonna uh probably do another one so that was our third in a series and we'll probably do another one hopefully give away a big prize like we did last time I we gave away a GPU all right so we're gonna raid the channel thanks everyone again see you next time 10 seconds bye YouTube have a good time
Info
Channel: Rob Mulla
Views: 13,257
Rating: undefined out of 5
Keywords: rob mulla, machine learning, python pandas, data science for beginners, data science project, data analysis, data science course, data science full course, machine learning course, data science projects for beginners, data science course python, data science vs data analytics, data science career, data science jobs, live coding, python coding, data exploration, exploratory data analysis, exploratory data analysis in python
Id: xs_L6z9QNYY
Channel Id: undefined
Length: 146min 43sec (8803 seconds)
Published: Thu Oct 20 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.