Chai Time Kaggle Talks with Andrada Olteanu - EDA Grandmastery

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so it always takes a few minutes to go live and for that well since now like i use a monitor so i'm like looking left and right and i feel like a chipmunk when looking at the recording that like why am i being so weird [Laughter] so i have to wait to hear an echo that's when i know i'm live awesome i like so i'll quickly share my screen introduce the session introduce android and then we can get started uh i just got exposed i was looking at one of the kernels in preparation so that's what you all saw but um welcome back everyone thanks for joining on a friday uh i'm super excited to be talking to indarda today uh second time on the series i'll quickly introduce her uh first i want to like glance over what are these series about what is chai time and this is essentially ctds 2.0 so what is that all about uh at weights and biases i'm i used to host a podcast called time data science in my previous life admits and biases i get to continue doing that and i wanted to do more to it so now we dive into deeper techniques we do solution walkthroughs where i essentially get to meet great people like andrada and understand how do they approach things so that's what we're doing today again uh we're learning about eda grand mastery and here's the link for questions and answers so i'll post this again in the youtube chat if you head over to this link which i'll just show you it should take you to this thread and you all can post your questions here and we'll keep looking at it as we go through the session uh youtube chat is a little finicky so this gets retained and i can keep an eye out so please post your questions again on this link a little more logistic stuff before i introduce and data sorry about that uh so i've just told you what ctds 2.0 is about it's adding a depth to the interview series it's learning more about techniques learning more about solutions i'll also have a few questions around dada's eda journey then we'll understand how she approaches uh new data sets and also how did she approach that for a complete kernel uh one that she's already put out so andrada is top eight top eight worldwide out of i think it's five million now kaglers uh in the notebooks category one of the top ranked notebooks grand masters she's a data scientist at endower one of the best eda uh storytellers one of the also best people to hear the stories from if you've watched her interview on chai time data science it's one of the most streamed ones for a reason just because she's such an amazing orator an amazing person to speak to i also she's a mentor of mine and she's one of the reasons why i'm at wits and biases i had a conversation with her so this is also another little reason why this session is happening uh andrada thanks again for joining me for the second time no thank you so much and thank you for your kind words and for all that you do for this community it's absolutely amazing your videos are still most of my watch so thank you so much for for doing all this work thank you and for having me i'm i'm always excited to know more about your journey i keep asking you online and offline so going back to our previous interview you were a master then since then you've become a kaggle grand master uh and in the previous interview that it wasn't the external motivating factor for you it was just about creating these kernels like understanding the process broadly what is eda to you what does that stand for to you so the first time i properly heard the word exploratory data analysis was in my masters and it was about statistics and findings finding insights within the data which is static so machine learning goes beyond that and tries to predict something which gets a little bit more dynamic however uh eda is something that you want to see in a particular point in time in the past or in the present something that you you you want to understand about some action that is happening there and you can for sure draw some conclusions and also maybe um implement some some actions right on the spot by doing only ada so you don't have to get to the machine learning process i saw a tweet i actually saw a tweet a few i think a few days ago about something like who tweeted this it was something like you don't have to so the first step in doing machine learning is not doing machine learning and it was something like by analyzing the data you don't have to build neutral models and trade them for for thousands of ebooks and stuff like that you can do something just by analyzing and drawing some conclusions by looking at the data so that's ada for me personally ada is more about the visual part i would say so everything not not necessarily graphs but also the explanations that gets behind with all the schemas and understanding it like it's extremely visual yeah um so leaning into that like what according to you is like a good eda uh heads or tails creates this like hidden gems every single week you get featured a lot of the times to you if you were to like create a list of things that you know really stand out to you what broadly is good eda so good eda and a good visualization for that matter is something that's in structure so talking about the structure of it is extremely clean and clear and beautiful and visually appealing i might add and that kind of makes the user go into different directions it also so it clears stuff about their questions and it also puts more it gives clarity but also some sort of curiosity to go even further because usually from a graph you get some answers which is amazing but if it's a very good graph and usually you it doesn't always turn out this way but a very good visual it also provides some even more questions and you want to get even deeper into the data and get you curious and be like oh okay what's next i would like to know about this more and stuff like that yeah i everyone has seen this like extended effort that you put into it you've given talks around it you've created posts around it also you create special banners you go the extra length of even looking at what's what gets printed in the console output you also change the colors of that how do you every single time how do you find something new to try like isn't it doesn't it get exhausted or like how do you like find those ideas it gets very exhausting actually but i i feel like it's and i'm saying this a lot in my in my talks and also this is what i encourage a lot uh people to do is try to bring something new every time uh it doesn't matter how small it is and i have been doing kaggle i think religiously for the past two two years so to do something two years is like religiously it's quite a lot of time i don't go to gym religiously i haven't been going but i haven't been eating very clean religiously for the past two years or stuff like that so i think kegel has been the only thing that i have been doing regularly for two years straight and it gets tiring because at some point you're like how can i get this to the next level but the answer is you always like if you have that faith in your in your head that you there's always something better then it gets your brain thinking and you're getting creative if you're putting a stop and you're having like a limiting belief saying oh no like this is all i can do i can do even more of course you won't do more so the secret i would say the secret sauce would be to be like okay um what can i add i'm sure there's something very interesting out there where can i finding and then doing the research and getting curious like said banners um okay i saw the first time i saw ben and i was like oh my god i was like oh she's amazing this is a it's css okay html in markdowns that's amazing i i didn't know about that and then you go stepping one further and i'm like can i make this prettier can i add something a little something something to it and it's the answer is yes yes a hundred times people were getting the secret sauces that's what i we had promised and it it also goes back to your uh in the previous interview you mentioned you have this like very visual very artistic artisty inclination and you're always on the hand for like how to make things stand out and it always comes across in your notebooks kernels uh older people call it kernels older calculators so broadly speaking how do you approach edn we can also use uh that as a transition to like looking at one of your kernels so how do when you see a new data set what comes to mind how do you go about your ed i've seen your tweets you first draw the plot literally and then go about it so like how what's what's happening behind the scenes take us through that okay so can i can i share my screen with you please okay so everybody's seeing my screen okay i can see pleasure ideas okay so i wish you would have started with the elon one where you like i think photoshop bitcoin to his eyes do you want me to go through elon i was like i i'm already going through this oh ow i know i know what you're talking about you are talking about the the notebook uh crypto the cover bitcoin yes yeah this one i i could make it even better right now like this is this is this is always a ongoing process i should have bought some sub nose coin before this um okay so i'm gonna first introduce the topic a little bit and then i'm gonna walk you through the thing is it's not this is not really a recipe like usually when you want to get something done you are like okay you it's like i don't know baking bread you have the flour you have a certain amount of flour you have salt you have yeast and stuff like that you you know the exact temperature in in the oven and so on uh for eda or for at least my notebooks usually it's not really a recipe is they start with a thought a thought of okay i really want to do this particular thing um in this case so this is a notebook i did almost a year ago uh it's been it's been like in a heartbeat almost a year ago on a competition um the famous annual kaggle survey which so kaggle usually uh puts out a survey usually during this time and um they in like interview of these data scientists that they have on kaggle uh and learn about how old they are what is their background um what they know what they use why did what they don't use and so on so um afterwards during december usually they have a competition very interesting competition when they say okay uh do an analysis on some sort of feature so you we we want you to understand some sort of particular aspect from data whatever aspect you want put it out there and then greatness happens um last year i was so two years ago actually i was too scared to participate i was like i i don't have anything to and honestly i was quite a beginner so i was just discovering seabourn and how to use it and wasn't this the mcdonald's one where you like compared these mcdonald's purchasing power across country exactly it's the mcdonald's which so it wasn't so i'm gonna actually maybe break a few myths it's not original i asked a friend uh and this is something many many people i feel like they are very afraid to do so ask like ask people and this friend of mine very good friend of mine i worked with her uh her name is georgiana so i worked with her at avon before and she's very very smart it's she's extremely smart and she knows how to draw insights and how to look at data i i knew i was going to the right person and i so the idea of my notebook first let's talk about the idea of my book was to analyze the cream let's say of the data scientists and understand how they became so good and i had a struggle really i had to struggle because i was like okay uh what can i look at can i look i can look at their education so yeah probably grandmasters and usually in their interviews with you sonia say they have some kind of doctorate for example but that would exclude all the people that didn't really want to go through the education but to really put the time and effort home and then i wanted to look at the ears in programming but this doesn't mean like you can code once a week for five years and be like average but then you can code hardcore every day for 12 hours straight no sleep no no food and stuff like that in one year and you're like a god so i didn't i didn't like that uh approach as well so i kind of went to the with the more let's say traditional aspect which is money really how much money are you paid to do your to do your work however this was a problem because and this is the the mcdonald's stuff and what i found out uh from from this friend of mine i was like how can i compare pay because 100 dollars in for example romania maybe can buy much more stuff than 100 in u.s and i'm telling you it's it's buying much more because it depends on the power of purchase and stuff and i asked her and i was like what should i use and she was like you can use like bread because what yeah you can do bread or sugar or something like that and you can compare like how much sugar can these like units of sugar packs of sugar can 100 buy in romania versus us india japan and so on it's like okay but can i make more fun for the foodies out there on cancun yeah and i found the mcmill combo and i was like okay we can go to that so what i did was i compared for example this is a bar chart of all the units so the majority of people uh so the the histogram was super super cute so of course can we talk about how did you decide first of all and how did you plot those french fries out there and how did that thought come into your mind like how does this magic happen in all of your kernels i really want to understand that ah you're p you're really picking my brain right now i'm trying to find uh i'm trying to find a process that somebody else could follow so regardless of the main idea of the notebook what i usually try to do and the idea here was so i looked at the notebooks that were made before so in 2019 2018 i think yeah and um i saw kind of the level and i looked at first second third first fifth five places and i was like okay how can i top that because and this is amazing actually if you look at the winners in 2018 their notebooks are absolutely amazing but if you look at the winners in 2018 you can see they went a little bit extra and then the winners in 2020 like i thought i did a good job i got third like the first and the second place is they are like mind low it's it's absolutely amazing what what what those those people did ah so they went even further and now i'm super curious what 2021 is gonna bring and this is the same like breaking records it's the exact same thing it's not bad to look at something and compare and want to do something much better because this is how evolution works so seeing it's like um i can give you another example with running so i think the record for running for like hundreds of years was like for for why running one mile it was four minutes or something four or five minutes don't quote me on that but it was like a number and then the record was broke by one second and then that record was broken again and again and again just because one person dared to broke that record it's exactly the same here you want to break the record of the last year because this is how you evolve and you're making much much much more progress and this is how you want to break the record from past yourself also like andrada from last year hopefully i broke the record on her like i'm i'm working pretty hard i hope she's proud of me for breaking the record okay so the idea here was how can i make these graphs because i looked at what other people did like okay i need to come up with some sort of very out of the box uh analysis but how can i visually bring this better like to to the next level let's say and i took each graph like each particular graph and said okay i did the box plot so here is just a simple very simple box block box plot in seaburn but i added so many features and i was like okay what can i do to make this even better i can advance french fries i can add i can add arrows i can add information to the graph i can change the color of these to match the phrase in the box i can do stuff like that and this is what i did yeah sure can i ask a question around that like you always have this super vibrant positive attitude and you're always trying to improve also uh i from just creating so much content i've learned that the difference between great and like out of the world content is just those last minute optimization the last mile is always the hardest like it's almost done you could call it done and then you waste so much time in a way just doing one thing and then something great comes out of it so like does it ever get frustrating for you or like how do you continue when let's say the plot is almost done and like you don't have the arrows you're trying to make them you're trying to figure out what to do like how do you fight that frustration if at all you know what you're saying i think i know very well what you're saying i think so when you take a step back and you look at it and you're like it looks it looks good it's never if you're a perfectionist it's never gonna look like outstanding it's never but when you look at it and you're like it looks good it looks pretty good for now um however during an analysis so this notebook took one one month and a little bit to do each day kind of i i did so i worked on it um i go back so i i make it i i can't tell you how many times i changed the color scheme just because i was like this doesn't work or how many times in one notebook i like this graph is very ugly i need to change it or the the cover i changed the cover and now i don't like it by the way but i changed this cover so many times i don't like it now i would change this 100 but this this is it just just just constant improvement until the thing is it's a deadline so when the deadline is here you can't do anything more and you're like okay i did the best that i could it looks amazing i am proud of myself i did all the best that i could it looks amazing but what i can tell you is because it looks it looks overwhelming now right i put these colors i have a table here it's all this methodology behind first of all mind you it took me one month to make it so it didn't take me five minutes or something first of all and second of all so be between very bad work and average work this is for from tony robbins by the way so it's it's not me between so poor work and mediocre work is a lot of hustle between average mediocre work and great work is a lot of hassle between great work and outstanding work is just like teeny tiny little bit of effort more i i think it's the opposite that i i think it takes like an insane effort to like not call it done and then like keep at it like i would just be happy that okay now like the numbers are there like i'm done i'm done i can't spend one month on earth making things look better i i give up um i'm gonna go to the another notebook in a second and in that notebook first of all the frustration was so so big it took me one week to do a visualization and at the end of that week i was like i can't do this how am i supposed to make a good understandable notebook in one month if in two months really if one research visualization takes me one week like i can't possibly humbly but what i'm trying to say is just adding little details and making it making it your own throwing in your own personality because it doesn't have to look like mine i remember last year so in this uh competition i saw super great idea of uh somebody that used like in between text it was a dialogue it was like a theater piece that i was watching by through looking into his uh analysis and i thought oh wow this idea is amazing sometimes you don't have to come up with the visualization after you did the you have the data set and like okay this is the inside now how do i because usually this is the case so you have the insight and then how do i represent it in a way that i can share it with the people however there's not not always the case maybe you all already have the not really the inside but what you want to share and then you can create the insights on base of it and i'm gonna i'm gonna give you two examples so for example in this case what i think made me won in this competition was the fact that i added uh a map so it was a beautiful map yeah these ones exactly so i did it in i can't remember the i can't remember i'm gonna i'm gonna leave it below in the comments i can remember the tool i used but i use the tool like very similar with photoshop and maybe no no no no no this is to because the thing is these are stamps and i put them by hand one by one so this was like a piece of paper and i followed tutorials on youtube on how to create this map i knew when i started this competition so before i knew what i want to analyze before i knew the mcdonald's scheme before i knew anything anything anything really i was like i want to do a map and this is because it's not like insane idea that i had during sleep or something it was the fact that in summer last year i was browsing through some deep learning uh courses and i stumbled upon on google uh on an analysis that was like a map a very cool gaming app it looked like super awesome and the guy that did the article split a blend piece into something like data analytics machine learning deep learning blah blah neural networks and so on so it was like areas within the light and i thought that was so awesome oh my god it looks great i wanna do that so i knew the competition is going to start in december and i was like i am going to draw a map so i already knew the theme before i started everything and then i was like okay treasure hunt gold money hmm i'm gonna make a treasure hunt about the cream of the cream the grandmasters maybe i can spot them and be like okay what do you do to be so great so it was the theme and then the analysis uh if you want i can go even can i go further to the other notebook please please so this is this is really incredible like this is what really i want to achieve through through just the series like really understanding the minds of great one of the best people of the field just like you how how do they like go about these things just trying to understand one step beyond the original interview series please the thing is i'm not i i like artsy stuff and i like to make things beautiful but in my brain i am not that creative uh in a sense of i don't think i invent the wheel and then i put it in the notebooks i know it sometimes might be looking like that but i i really am not what i am doing and i'm trying to constantly do to stay humble and be like okay i don't know everything is to research because when you're browsing google and trying to find people that are doing great visualization for example you're gonna find at some point i know it's hard to find them but you're gonna find at some point somebody that does something very great and inspires you you're like oh i want to do that too or you see a map and you're like oh i want to do that too or uh this stuff for example and i don't know putting icons i feel like i saw this somewhere now it's been too long to i can't remember the source really but i was like oh you can put icons for class and make them fun and i don't know pirate hats and so on so just sticking to the theme right i am not doing anything out of the box really but just like treasure hunt pirates map gold and sticking to that i managed to to to do this and i'm super proud i'm i'm really super proud of this one however i didn't win the first or the second or i was the third and when i saw i was the third because in kegel usually people that have the most up votes don't necessarily have like the most out of the box notebook usually super out of the box notebooks are very hidden and yeah i saw that a lot of times just because that person may be uh they aren't very active on kegels for example and they did something like super amazing but because they aren't very active they didn't get much traction and the notebook kind of got others got stuck upon it and it was lost so i haven't seen the first and the second place notebooks i haven't seen and i looked through many notebooks uh before the competition ended and then i saw like for me the second notebook the second second place notebook was what got me very very intrigued because after doing this analysis i was like okay i have no idea what i can do even more with seabourn matt totally and plotly it's amazing bloodly but i am not that i'm not that fond of it i don't know why personally but i was like i want to get even further a step further so i stalked schubert which has been like oh my god he's the reason i kind of got to the next level but not that easy because i got scared at first okay so this is his notebook which is the second prize at the second prize dinner so besides the fact that it is super insightful and it's it's extremely it gets extremely deep and it's super interesting like if you have the time just just you can check it out look at it see what he's done he's using a lot of plotly in it what i got amazed is about this graph where you can make it change and it was like what what kind of sources is this like how can you do that and then show hidden code and sometimes you get scared when you show the code and i saw html and like oh what and then i saw javascript what and i'm like what the heck is this and i closed it i was like no this is not for me is this why later you you learned d3.js recently yeah because i would have learned it much sooner but i was like oh this is too complicated for me no and this was a mistake and i encourage everybody to not use that like if you want to do something and you it looks scary and hard just just do it it's it it's scary and it looks hard just because you haven't done it before and it looks like super foreign to you yeah so fast forward summer august beginning of august um i realized i kind of i was not following what i was preached being you want to add something new every time you want to learn something new you want to put out something new in each notebook at least the smallest thing and i was like well i haven't been doing much progress in the visualization part and it's the stuff that makes me the most excited and i went back to schubert's notebook what is he using like what is this style oh horrible and then i tweeted him like i didn't tweet him but i i sent him a message i was like hi can you please explain to me what what what did you use and he was like d3 so this was my my the beginning of my d3 journey i asked him so he gave me some courses interesting courses on udemy i did a course it was challenging and i'm gonna tell you so d3 is absolutely amazing like you can do spaceships with d3 really like i mean i can't even bar cloud so like i i can't do spaces but i'm assuming people can so it's absolutely amazing however for a data scientist or at least i don't know for you have been interviewing many many people do data scientists from your knowledge know javascript like i don't think so maybe like if they come from that background maybe then but most of them like the people into you haven't so okay so javascript in my opinion is more of a like hardcore programming programming language um d3 unfortunately fortunately because it's so granular and this is why you can do spaceships really with it is that it uses javascript so it uses javascript html and css html is for creating the the place where you want your dashboard to go because you can do dashboards you can do anything really with the three then you have css for the beautifying visually appealing stuff and then the backend part for like what goes to create the structure of the of the visualization is javascript and the library that uses javascript which is d3 so like we are using matplotlib for python you're using d3 with javascript uh you want to go into like other stuff like um we are used with jupiter notebooks for example kaggle notebooks uh i went through more hardcore uh environments i i forgot i think it's called ed a it doesn't matter like a visual studio and it was quite intense but it got me to the next level and this is what i want to show you actually because i already have i already have a link so this is a graph i i i found that uses d3 for example and all the code behind it is available so you can put your own data into it and what is amazing is in in you you are looking at this everything is done in three and you you are trying to think of some similarity with seaborne or clothly or matlock and it doesn't have any similarity whatsoever so these are hit songs and how many days they stayed in the leaderboard the gray ones is kind of oh it dropped out after 24 days this one after 47 and so on like it's spiraling this isn't even like we are getting beyond pie chart to bar chart box plot stuff beyond portly beyond just animated stuff you can click yeah is is is interactive you can transform into a bar chart it's absolutely amazing and you have a button and it moves and that that animation like so seamless and i was like i want to learn how to do that because it's it it is insane and then you can transform into radial chart again and and ah i want to show you so if you click it sends you to spotify to the song and again i feel like wow that's amazing so this is super interactive stuff and the person that did this did lots and lots more interactive interactive things and what's insane is again you can insert somewhere down here you have the data which is just a simple csv file with the artist the name when it premiered first appearance weak and so on so if you put a csv here like you can just append the csv here with the same format you can use this graph from scratch like you don't have to do it from scratch you can just use this work of course just reference i cannot say this enough reference anybody your cop not copying but using their work uh it's it's not a big deal like everybody does that but just reference uh but this guy is awesome so after seeing this i just couldn't go back to seabourn i just couldn't i was like no i i need to understand and this um this is why i created this so this competition ended yesterday and it was a struggle for me until the end uh especially in the moment that i realized that i realized that i am not going to finish this analysis as i wanted just because the graphs were taking me too long so this is copy 19 impact.digital learning it was a very interesting competition where you could analyze um how is really the state of the children in terms of engagement and how are they learning and how what are their thoughts and this was very close to my heart or of course also because my mother is also a teacher so i talked with her a lot so the research was done it was it was just the fact that for example this chart so if you want to use d3 you can safely uh copy this notebook because it has everything you'll need inclusively some tweaks that i had to do in order to make d3 work in a kaggle environment but this graph which i went back to a couple of times also it took a week which is it it doesn't it doesn't look like it's it's it's not it's nothing really but the fact that i needed to learn javascript html css how to put everything in practice how to order them some errors how to work with a console within the web page it's not hard but it took a little bit of time so if you want to learn the three you want to prepare with some sort of uh some sort of patience um and then i went into some hardcore stuff and i'm just gonna show you some some things that you can really do for example a bar chart just a simple bar chart can go into something very interactive where you can see like numbers i added also a line here this is referenced so i saw somebody use that then i went back because i saw like this cool gradient apparently it's super easy to add the cool gradients to the bar chart and i added this too again you can add this interactivity to any any plot and this one for example so this one i can actually show you the original because uh the wheel and so sorry to interrupt someone asked is this pre-recorded no it's not uh we're live we're literally live we'll get to the q a afterwards so please keep the questions coming this is not chai time data science there is no one where i record and premiere them it's live okay so great for the excitement um and this is i remember what i wanted to say so for this particular analysis i didn't come up with the like i first saw the i first knew what i wanted to do like i want to make this wheel how can i incorporate this visualization into what i'm working now so it wasn't the inside again it was on the inside and then the graph it was first the graph and then the inside um and i saw i remember i i came into observable and i saw this will which it again it blew my mind because it's something like it's not really a pie chart so this is equal area radial matrix of lgbt rights in us so this is a will that you can you can move and i feel like oh okay and so for each state here they are split into southwest northwest midwest and so on so it's i feel like it's pretty easy to read this will you have um if marriage for example is like fully allowed or partially hospital business adoption employment housing so for these people if these let's say uh facilities or not really facilities but these areas let's say that all people have rights too if they have also maximum right or partially you have where the data is so where is bent not clear is when it's gray and then you can also analyze so the amazing thing is you can analyze full time between these characteristics let's call them and between states in the same time mind you in matplotlib in my opinion what you would have needed to do instead of just one cohesive very very informative will was to do pie charts 50 pie charts or two four six seven pie charts and then in each pie chart to put like the state it's not that intuitive and it's not that easy to understand and the idea is to analyze the graph and draw insights so this will i was like i need to do this it's absolutely insane it took me one week to fully implement it with my data but i did it nevertheless i did it so this is the origin of the wheel and then i went in and went further for example this is another another graph that i am very very proud of which took another week um uh on school timeline so this is the activity and then this is the engagement of the children in top states so you can see here so this also is in d3 right everything is in d3 so for this analysis everything i did was in the three just to get a hands-on because you can do the courses and i highly recommend the courses actually so do the courses just to have a baseline but after you did the courses um you want to have a pretty good hands-on and to start digging and researching and understanding for yourself learn by doing it's the best way to learn really um so this is a stream chart this is called a stream chart and again i saw this and i was like oh i have to do this because it fascinated me the fact that you can so this is the time axis you can see here where the pandemic started in us when the school started to shut down really between march 16th and 24th i hope you can see it quite clearly and then here you have the summer break and then again the next school year and the fact that you can move this slide and you can see here so you can see on the left the the date and then to the right the state which is here new york here is massachusetts illinois connecticut and so on and then how many loads page loads per 1000 students and again you can compare in time and see oh okay so they all had around the same trend you can't say oh new york dropped suddenly in summer of course it drops all new summer it's it's summer nobody's studying it's so really crazy and then but then it didn't quite get back to what the it was so but you can also compare between the states and say oh so new york had the most engagement compared to for example california which had a very low engagement so again this is a graph that has a lot of dimensions i would call them i don't know if it's very correct to say that but you can get a lot of information by looking at just one thing instead of having like multiple part like line charts that go all over the place and you need to look at the axis to see if it's a difference between them and so on um okay and if you are curious you can go ahead and look into it more more deeply uh what i saw many people observe in one of my tweets and i saw you had it too sonia um from this d3 course what i learned and i was amazed was that the the guy said whenever you're starting to make a very deeply uh complicated visualization you want to have a notebook or just take a piece of paper and after you have the visual so for example i don't know here after i had i knew i wanted to do this hive map of usa and then here i wanted to put a bar chart just because i knew for example i think this is wisconsin here you have an engagement however you want to look here and see that wisconsin has only three districts so maybe it's not that representative however connecticut you can look at connect with and you know it's extremely representative so i just drew on a piece of paper okay this is where my map i want to go this is the title i want to be here is the bar chart and so on so i made them in a sense that it's very easy afterwards to look at the paper and then create it having it visually next to your individual studio and this is again this is not not my idea i found out about it in that d3 course i took and it's amazing it's amazing yeah it's like from where what i understand there are like these steps like people like me give up just like getting the plot done and like we give up there but like once you have the plots ready then you can go another step stylize them create a color theme last time we talked about rick and morty themes all those things you can do that then you get to the code you also comment it out put nice comments in there make make sure it's up to quality then you go a step further where you like add arrows add annotations to the graph themselves beyond that uh this really is the like i think the highest step any calculator has shown us where you like learn an absolute new framework that i don't think any data scientist uses i'm sure a few do but like not on kaggle and create these like visualizations that are totally interactive totally different like everything about this even this graph like i'm sure you put in a lot of thought into understanding the background there's like a gray background not a black one i would assume a black would stand out i'm sure you would have experimented at some time oh yeah picking the right shade of green i'm i'm assuming a brighter green is what i would have gone with but like that might be just too annoying so you tuned it a bit then like in the graph the blue color uh gradient really yeah yeah yeah like all all of these thoughts are like these you add another thing to the storytelling you add another thing you add another thing and like it's all of these incremental steps that i'm just learning about now yes and it's in my opinion so you don't have to be a visualization fanatic to do this you can apply this in all your work because this is how models are created or this is how more complex analytics are created and so on it's just building upon what you did and building a building and trying to for me at this point i felt like i already did i already discovered as much as i could have discovered uh using the usual mainstream libraries and i just wanted to get a step further so this took me one month and a half to do the entire analysis and lots of nerves however i realized yesterday when i kind of sprinted sprinted in the last few hours of the competition this was made yesterday for example uh i realized when i shut down and i was like okay this is it i can't do more it's it's what i could do in the time i had uh i i don't think i can go back to like peacefully do a bar chart you might believe and be like very good you did it so i might add some sort of d3 because now i'm starting to like it but because i'm starting to understand it but it wasn't like that in the beginning like as i told you it was so frustrating yeah it's like kaggle really brings out the best version of everyone not just because there's a leader but but like anyone who's especially in this like notebooks you're like creating a story so like anyone who's there like is trying to always like increa increment their approaches and this is a reason you have the rank of top 10 in the world it's it's not just a random number right it comes through all of this effort and like the community really sees it through in your kernel and we all like like i can i can probably relate to this graph as well i'm sure i would like just take the whitest color and put the title using that but this i know is like a little slight off-white if i may i'm like getting the paint down in my house that's why i'm like very much into what what colors are but like it's it's it's totally like it totally comes through the attention to dated in all of these kernels so yeah question everything question everything uh should we go into the questions should i what okay i'm gonna stop sharing them let me share my screen and get to them i'll scroll to the top and start from there um i think this has been answered so i'll skip that how do you come up with your analysis sorry go ahead um as i said so it's not second nature um you if you have been inspired by something then you can implement that and start with building from there yeah schubert is there in the youtube chat uh shout out to you we've just been looking at your stuff a little while ago oh it's schubert here schubert is here yes yes i gave you a big like again thank you so much like the fact that he implemented b3 is why i kind of got to the next level so go follow him he's an absolutely amazing human being thank you so much i i said i feel seen in this comment uh david he's a good friend he thinks that young or aspiring data scientists don't understand the value of sql sql can you elaborate on your experience with sql and are those important in your opinion so i haven't been but this is depending on your work uh i have seen so there are many debates so if you are on twitter and you follow many data scientists uh you will see that there is a lot of debate on what is the best approach to machine learning or what is the best approach to do something or is it kegel worth it the answer is everything is customizable to you um and if something works for somebody else it doesn't mean it has to work for you in my opinion i worked so i worked with sql in my previous job now i'm not working with it with it anymore just because the job doesn't require we are working with unstructured data we are working with files so it really depends sql in my opinion and i think it's it's not used for example with images or uh text or something like that it's used in databases so if you are and i can ask you i can answer you with this if you are a data scientist and you don't want to wait for other people data and engineers or other teams or usually wait for a page to load to put in the you know the parameters to extract the file you want to go straight to the database because this is how you automate stuff or this is how you make things go much faster and on your own terms because you're doing this i can do it now i don't need anybody else to do it for me then you would need sql but if your friend wants to do deep learning for example it depends if he there if he's she's using images or stuff so but i would recommend for everybody to know a little bit of sql because you don't know when you'll need it just just a little bit it's not that hard to to learn really if you're writing uh the queries in pandas using df.query like you know an okay amount i think that's that enough to get around okay coming ahead what is a simple way for like let's to extend this question what's a simple way for anyone in machine learning to get into casual competition to become a grand master oh a grand master in competitions i cannot pronounce because i am not i'm not qualified in any way to answer that but i can tell you a grandmaster in general the so first of all the purpose shouldn't be to become a grandmaster and i've seen this and this applies in my opinion to all areas of our life if the purpose is to get a lot of money be a grandmaster be i don't know the purpose when the purpose is very very community focused and this is not like i swear to you this is not bullshitting or something but when you have contribution in mind when you're thinking how can i add to my community how can i help somebody else understand data visualization or this model in machine learning or how can they get a job or something anything like when you're thinking of how can i contribute and add more value not destroy add more value to the community or to this i am working on all the benefits will follow i 100 percent guarantee you so thinking okay i need to do 20 notebooks and i had to ask 100 people each day to upvote my notebooks you're going to become a grandmaster but if you have eventually you are going to become really aggressive but if you're having contribution in mind and adding value um that's the moment when you're gonna have all these benefits that we're gonna come so really if you are passionate about competitions just this is how you get into kegel you go ahead open a competition that sounds nice and then you start understanding learning look through the notebooks other notebooks and try to create something yourself if you want to create to get notebooks just go ahead and do some analytics on something you're very passionate about uh maybe you're passionate about sports you want to make an analysis of sports and if the analysis is very very good it's going to get traction at some point now not always so you want to also be very resilient yeah i remember in like even in a previous interview mentioned you were happy by the fact that you didn't initially get a good traction and you were like quite focused on the process just to echo on that like i i would also quite get frustrated especially like with like it's it's a stupid thing to compare to so i apologize but like with the podcast it's like a little number that gets thrown in your face this is the number of people that watched it but like for anything in that matter that's like similar to that number of uploads how many views do you get on a blog number of views on a video like if you if you optimize for that it's it's not going to be fun like i was there to like just learn about your journey you learn the journey about like all of these grandmasters and reminding me about that like eventually got it the podcast to like the top podcast in the community like that was never the goal it did get to it in every day like it becomes very frustrating if you're always staring at that because there's the instant gratification that we had and i would be a hypocrite to say that i at some point was like why not getting up votes like i'm so proud of this notebook i work so much i i don't know if you saw the d3 notebook that i created that i worked my ass off it has like 80 upvotes in comparison to i don't know anoth other notebooks that i didn't work as much and they have like 500 uploads now i'm not going to gonna question anything this is the algorithm maybe people didn't didn't see the value it doesn't matter but like you said if it has five views of course it doesn't have thousands or hundreds or something but these five views were people that you helped in some sort of way and who says helping five people is a bad thing like really so a constant reminder would be very very good yeah we also have laura fink in the youtube chat thanks for joining us laura oh oh laura laura yes yes oh hello laura okay everyone wants to learn your secrets to learn your secrets no i want to know laura's secrets like she's absolutely amazing okay um i can skip these two because we've answered those um also kaggle master every kaggler for the first time is showing up uh do you ever find the notebook format creatively restrictive if you had a blank canvas what would be your chosen format oh this is a very good question so the notebook format in kaggle or in general i i think they're talking about kaggle or maybe just jupiter notebooks both are the same thing in a way mostly so in my opinion i am inevitably inevitably thinking of d3 and the fact that uh for example the way i imported the code for d3 was very very uh it was very restrictive and very hard and for example for people that didn't that never coded in javascript um if you when you call a variable it's not x equals to 10 it's var x equals to 10 or const x equals 10 and so on and if you forget the thing before in visual studio works uh and on my local machine it worked but whenever i put it into giggle i would get an error and the entire notebook would shut off and then you need to rerun everything and then other other things that i had to do little tweaks but really my partner helped me because i have no i have no technical background whatsoever and he helped me a lot because i i couldn't get kaggle to work in this case i think observable for example so if everybody wants to use d3 you can go on observable and they have is exactly the same format of notebooks which with markdowns and codes and so on but you can implement d3 very easily and then you have a designated part where the chart updates you don't even have to refresh it updates automatically every time you put some some sort of code so i think for example if cargo and maybe i can actually start a discussion about this if cargo could implement to if kaggle could make it easier i have no idea how and if they are technically restricted in some ways but if they could implement an easier way for us to use d3 uh in the notebooks i think many more people would start learning it and i think there are three notebooks in total three or four notebooks in total that used the three at the moment to kegel it was one for beginners i forgot who i am very sorry um it was one from schubert and it's this one from from me and the fact that it's so hard to make it work it's it's a showstopper oh it's not it's not a showstopper it's it's a barrier in in breaking it yeah it's a very good question very good question okay we have like a one one or two more questions maybe we can come back to them afterwards uh do you want to like uh go to the last experiment little bit we had discussed uh the nfl or oh okay so so just to give everyone some context i asked andrada to like pick a competition she's not looked at all at before uh and like walk us through how she would for the first few minutes look at the data and how would she approach uh eda or like just just the data broadly okay this competition started eight days ago i was still in the other one and i was still struggling with the other one so i had i knew when it appeared i wanted to take advantage of it because most of the competitions have leaderboard and they are like normal competitions that people very very smart smart smart people participating but because i am super passionate about eda although this competition is is not really about data analytics and visualization is it's it's much more deeper but it is still some sort of analytics uh it has some sort of analytics analytics aspect to it and i was like okay after i'm done with this competition i really want to start in the nfl one so i i haven't looked at it i watched a video i'm not gonna lie about what nfl is because i'm from romania which is a country in europe we have soccer and i don't know how people play soccer either i am very i like basketball i i i know some stuff about basketball but about nfl i have no clue so i watched the video but it didn't shred any lights it's still very very weird to me but it's so why i chose it let's let's start with that one uh and maybe this is how would be your day-to-day recipe um so i started it i i chose it because first of all i want a change of theme i did for one month and a half education it was super amazing but i wanted to change it so this is why data science is so amazing you can change the fields the playing field uh no pun intended to whatever subject you would like so sports i've never done i think a notebook on sports and again i want to take advantage that this is kind of an analytics competition and i want to learn about american football so here i am um good first thing i would do and i usually do is read this many times like three four times to understand uh what's the competition about then i go to evaluation and understand what's the like what's the requirement so specialty metric i understand the special teams are some players that come in the game some maybe a maybe a fan of american football fan can explain better in the chat i need to research more but i would understand what are the requirements and then so before doing anything else i would jump and so this is my approach but any other person might do it differently so don't take me by word but this is what i do i jump straight into the data and i just browse okay so first of all mental note it's csvs so it's not images not audio file files not numpy files or something like that so it's just plain csv files we have multiple files we have ears so it's tracking on multiple ears we have like time series maybe something like that we have the plays the players the games pff scouting data i have no idea of what pff means i need to come back to this and then i just here i just browse because in this part i get in more depth so this is quite a pretty big uh description of the data uh for a composition in this part i usually get when i am opening a jupiter keggle notebook and i'm starting to look at the data so here i'm not producing afterwards what i do is research in my case the research is pretty lame because at the moment i first need to understand what so first of all how what are the rules of american football how do you play american football and then what are the rules of the championship what are the themes i understand there are like 32 teams divided in 16 america i have no idea i have no idea at all so yeah so the the idea the reasoning behind it because in in my opinion in this competition from from what i can say from the 10-minute video i watched today about american football is the fact that so it's the way you play in one game but it's also how the games fall and it might be like in a chess game for example when sometimes you sacrifice pieces in order to win so in my opinion how this competition might need to be approached is not a winning game by game basis necessarily this is why i need to understand how they are folding but winning in the overall picture uh because at basketball if i remember correctly if you if you lose your your you're out so but here i i as i understand it's it's not the same rules so understanding the rules understanding data and then i would open a jupiter notebook when i would have some free time i love to make the cover so first of all i usually make the cover so that that comes first obvious yeah because i'm like i i want to get into the i want to make a commitment making a cover is a commitment it means the notebook is going to come out like no matter how hard it is okay so yeah i usually at the beginning make the cover just to be everything pretty and to have peace of mind and then i would start looking at the data and then going back and forth with the research data research data research until i i understand something what i will for sure do in this case because i have no idea what to make of the bull is i don't know if i told you um is to make a lot of scammers because again i'm a very visual person and i feel like most people understand information when it's visual not to written visual if it's a video with sounds and noises and stuff it's much better but i am i am yet to to to go there but scammers in my opinion are the best way and oh uh i usually say this i think i said this too many times but if you want to make a great notebook you want to make it for others like you'd teach others to make it yeah going back to the contribution also that you have we need to have in mind because when you are thinking how can i teach this how can i how can i explain this to another person has no idea what american football is you are gonna get in so much depth within your brain that you're gonna understand it very very clearly but if you're doing this only to yourself you're gonna you're gonna step over a few a few a few steps in between that might need to to link the entire into the entire image now um this would be like the first three four days uh in the analysis what i would do in the first three four days yeah beyond that i have no idea because i haven't seen the data and i don't know what american football is yeah i'm like just just curious on that today uh people like me don't even know all of the graphs but like let's say you've come up with a schema and then you're like uh figuring out how to show it how do you like go about into the details of picking oh this is this is more of a v plot let's say this is more of a histogram how do you like decide all on all of that i didn't quite get the question so if i have a histogram how do you decide to present the data once you like have things in place and once you have like stored in your mind i think so the question is how do i decide what graph to use when i have the insight in mind okay um i think this comes oh it's going to be such a lame answer but i think this comes with experience unfortunately uh meaning that you're gonna do and actually i i i can give a recipe so you have some sort of insight um for example let's say you have let's say you have two teams team a and tb and you have this course yeah okay so the score for team a and the score for team b uh for 10 years so you have a few dimensions you have the team you have the the years so it's through time and you have to score three things how can i go about it and you can go about it through trial and error so you already have the data you already kind of pre-processed it you bought it to your a nice small table that you want to plot now a first idea very quickly to mind would be to create a bar chart like a bar chart which you have here every year here is the score and then here you have team a and team b and you can see how the scores in different years look okay but does it really showcases the difference does it really accurately shows and makes you think about that difference because if you look at the bars one bar can be like this is a bar and this is the other you need to compute the difference between them and also you want to look at the score so maybe team b had an evolution through time but the difference between team b and team a got bigger and bigger through time how do you you know these are all these little details what i would recommend now that i've seen many methods to plot and this is where experience comes really you want to research as much as possible and see as many examples of plots as possible is to for example show a dot a dot plot where can i share my screen just one second please please because i feel like me explaining this it's a funny attempt like this because this is why i i thought about it so you can see the time and you can see the difference much more clearly and you can also compare for example if if i have multiple other groups you can compare but you can see of course so this is lower than this because you have here the axis from 0 to 4.5 okay but you can also observe the differences and the color also helps you see that the green like let's say team a was much lower in score all the time than team b but oh here they were so close it was such a tight game and so on so this one in my opinion in this case is a much more informative graph than a bar chart but for you in order to use this you must know about it and this is you know where you want to make your research and just explore yeah gotcha i think we have time for one question but again thanks thanks for that like detailed walkthrough i'm sure like i'll go back and try to copy that uh i don't foresee any success but like i'll i'll try that i'm sure others will find it more useful uh i think we can squeeze in one question so uh here's a controversial one uh david asks it's more uh d3 is more suited for data journalists and not data analytics in business settings uh what you really want is like something practical that can get things done right i mean like maybe i can answer first to that point kaggle is also like really experimental like you don't you don't see that data every every day at work like it's it's much more simpler than that most of the times unless you're like in a research environment and like the reason you're trying all of this stuff is because kaggle is a playground so you you get to try those things so it's useful in that way um you're you're right and i would like to like to add upon this so i see how d3 is a complete waste of time for people that are doing competitions for example like hardcore competitions d3 is a complete waste of time and i wouldn't recommend to anybody any any anybody to do the three what again going back to the example with the bar chart you don't have to make this awesome grade with bubbles and annotations and so on if you can do a quickly seabourn or matlab lib bar chart where you show the differences and you as a data scientist understand everything and you're like oh i i know what i need to do you can go ahead and skip and because i i remember i saw a notebook from a grandmaster he's like top 10 in competitions like super hardcore and his notebook was so insightful but the visualizations were like the most quickest smart blood lip things you can you can think of but he doesn't need something more because he needs just the insights what i love about data visualization for me is the fact that i would like to share it for example with my mother or with a co-worker or just in general to be able to be like like they like he said like a journal like a a notebook an analysis so i know that day three for people in competitions is not going to be very helpful but for people that are doing analytical competitions or for people that are just like said experimenting and doing some playground on kaggle and really the fact that i created the notebook of the three on kegel is to show to be another resource for people to to to learn because i i didn't have to get into that much trouble to integrate these reasons you know but it's it's a process so i i understand their opinion but i feel like it depends the case again it's super customized customizable yeah content also by definition is open so so like now that you've created it who knows like maybe everyone starts doing it and then you like again need to level up and that's that's how like kaggle and pretty much youtube also looks like some youtuber does something crazy and now like everyone's doing it so like that's normal now you need to do something even more [Laughter] exactly and this is what i said about doing outstanding work and kind of doing better than the ones that did last year because this is how we create good stuff it's it's it's the fact that we innovate let's say and this is why we're humans with a brain i feel like this has been an awesome interview walkthrough of understanding your greatness i point the audience to your uh twitter profile at andrada olteano with w i always get confused by that it's there in the youtube chat also in the top uh resource of the forum so follow her there you can always find her notebooks as soon as they come out or other i think there was a gif i recently saw about the yes this one for the d3 plot that you were working on yeah and you can also follow her on kaggle or find her profile there just by searching her name uh you'll see more areas getting golden really soon and these numbers going up so check out her profile and data it's always always a great learning experience for me to talk to you and show for the community as well so thanks for sharing these insights thank you so much for having me thank you so much and thank you for everybody to everybody who joined and put the question thank you so much thank you thanks everyone
Info
Channel: Weights & Biases
Views: 1,071
Rating: 4.8947368 out of 5
Keywords:
Id: vIL-I3D-Zdw
Channel Id: undefined
Length: 91min 30sec (5490 seconds)
Published: Fri Oct 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.