Effortless Data Analysis and Cleaning with Data Wrangler in VS Code

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] thank you [Music] foreign [Music] foreign did we just count down twice is it just me is there a glitch in The Matrix what's going on I felt like two countdowns there uh what is up everybody how are you welcome to the vs code live stream some Richard you're looking forward to this me too uh we're doing effort data analysis and cleaning today with data Wrangler which is an extension that you may have heard of but you may not have heard of that's pre-rad um what's up we want Mew Jeffrey mu is here so you're in luck uh it was my bad accidentally rolled the release the release I rolled the release I don't know what that means that's producer that's okay was that you Doug was it Doug behind the scenes there's two people Doug and uh Peggy what's up from India what's going on nice to see y'all well listen chat today we are doing data analysis and cleaning effortless data analysis and cleaning with data Wrangler which is a an extension that we don't hear a lot about so we're going to do an entire stream on it because uh it's quite interesting and it pertains to pretty much everybody and I think uh I think you'll find a lot of value here no matter what you do uh I mean what you do professionally like as a Dev and in fact chat tell us tell me what you do right like are you a front end Dev or do you consider yourself a full stack Dev are you a c-sharp Dev are you a Java Dev are you a python Dev are you a markdown Dev is that a thing can you be a markdown Dev that's me I'm a markdown Dev what is up what's up from the UK what's going on Raleigh thank you for being here all right by the way in case you didn't see last week we had vs code day I hope that you were all there that's our annual conference and um it was awesome so if you would like to you can check out all of the videos if you missed it I mean you missed all the fun and we missed you but you can check it all out on our YouTube channel youtube.com code all of the videos are up there you can watch them at your leisure or at 2X normal speed if you're one of those people are you one of those people that listens to things that like 2x normal speed I try this with podcasts I can't understand a word I go the other way I listen to podcasts at 0.5 speed I like it to be I want to take more of my time so check out vs code day all our recorded videos are up there uh also friends Microsoft build is coming up and um it's at the end of this month and I'm gonna be there uh doing a a session on GitHub co-pilot Jeffrey are you going to be there y'all can't see Jeffrey but I can see Jeffrey Jeffrey give me a thumbs up if you're going to build yeah I got a thumbs down okay uh so Jeffrey's not gonna be there but you should be there you should join us it's in person in Seattle uh dropped my pen it's the first time back in four years and uh it's gonna be awesome so I I hope if you can make it right if you can convince your your um your boss or your company to get you out there it's gonna be a big deal AI is a big deal you may have heard so uh hopefully we can see it build all right y'all it is Thursday and that means Tick-Tock throwback or throwback Tick Tock it's one of the two I don't know whichever one it is [Music] multiple instances of Visual Studio code sometimes it gets confusing which environment is for what just download and install the peacock extension select settings and under workspace specify what color you want your environment to have such as green and now it's easier to distinguish your environments [Music] thank you there we go uh a lot of people use peacock very popular extension and now with profiles you can just set it per profile which is pretty nice okay so with that let's get to the point of why why I called you all here today which is effortless data analysis and cleaning with data Wrangler data the most exciting subject in the world please welcome to the stream Jeffrey Mew what's up Jeffrey hey Brooke hey everyone and I know I asked you this before but for chat should chat call you Jeff or Jeffrey some people feel strongly about this stuff yeah I feel like um ironically I'm one of those people that doesn't really care too too much so I actually have been trained all my life to respond to both so whatever you say uh I'll be there but I noticed you put Jeffrey in your lower third so it's my uh it's my legal name so you know fair enough uh so um Jeffrey where do you work yeah so uh coincidentally I work at the same company as you work um I work yep I work as on the AI and data data science tooling for BS code team which is why I'm here showing off data Wrangler which is exactly a python data science and AI tool so very interesting just so for the edification of chat really quickly how does one go about the business of getting into the field of data science and Ai and ending up with a job at Microsoft yeah that's pretty uh that's a great question I'd say I did a bunch of I did a few internships in the data science field I kind of like dabble my too I was like originally a software engineer so I was trying that I was trying some PM skills to trying some like data science and AI stuff and I felt like at least the intersection of PM and data science and AI really spoke to me and I interned actually on this team at Microsoft I actually came back for full time so kind of a kind of a cool story about like how I end up back at the safest place that I entered as well very nice so you came in Via internship yep exactly awesome all right y'all so get yourself an internship at Microsoft I don't know how do people do how do people do that you just apply you apply right yeah for sure yeah somebody told me the other day they said I think we're doing a tick tock live and they said I'm not gonna apply because it'll never pick me and I said no do it apply because you never know I've seen people picked from all over the place there's the worst you can hear is is no and that's not a bad thing to hear you're in nowhere soft and you are now apply for an internship okay that's not why we're here Jeffrey I'm sorry a bit of a tangent we're here to talk about data analysis and cleaning it's not something that I would say is a particularly exciting subject but it's something that we all have to do uh and and certainly I remember doing a ton of it at my job prior to Microsoft when I worked in retail and um we were just we had a ton of data that came in from our retail location so we spent a lot of time doing this what is data Wrangler and what are we doing today hit the nail on the head right there so as basically every data scientist knows things like exploratory D analysis data cleaning prepping your data set for the model um are things that everyone has to do but it's the most time consuming and as from what we've seen as well the least enjoyable parts of being a data scientist um during your job some will often tell you like hey we have this data set go tell me if there's any patterns in the data so we can use it to predict X or something and more than 50 of that time is just spent cleaning that data and making sense of that data leaving less than 50 of the time doing like something more important modeling and data analysis work the more glamorous work as well some people would say as well um so would you say this is only for data scientists or is this applicable to everybody yeah so this is more applicable to just anybody working with something called tabular Data so that's something like if you have like a spreadsheet in Excel right or you just want to make sense of Panda's data frames just anyone working with any um any sorts of data you don't have to be data scientists right you can it can also be a useful learning tool as you'll see in a little bit as well but it's just basically anywhere anybody working with tabular data basically which basically data you can put into Excel this tool is amazing for yeah absolutely and I think every developer knows what it's like to work with the data set and you know there's it's either not in the right form or you don't have the right it's not the right uh it's not the right data type or you're trying to join data together and you get the final data set or you've got it you got to clean it you gotta you gotta transform it and uh and that's tedious so let's take a look at data Wrangler and uh see what it does for us for sure and yeah this is just something again like we heard a lot from data scientists in vs code which is why we kind of how data Wrangler was born and as you mentioned right hopefully some folks in the club have heard of it but you may have not heard it because it was just at least less than a month ago so it's kind of fresh off the press uh freshly released the vs code extension uh extension Marketplace um so just a quick highlight so data Wrangler is a free Exploratory data analysis and data cleaning extension for vs code um it puts all of your data into an interactive Excel like interface which you'll see just in a moment um essentially making it easy to play with interact with manipulate with via the UI to get more additional insights on your data but the key bread and butter of it and why it's in vs code is it will automatically generate python code for you as you explore and affiliate yeah so you'll see you'll see exactly how this works but it just the global has to speed up this redundant and time-consuming process as you mentioned just became more interactive became more fun um for data scientists out there awesome um yeah but before we get started with the demo I'm just quickly I'm doing the Deep dive in data Wrangler I just want to quickly talk about how you have to get set up um personally of course you'll need your favorite editor vs code installed as you see I have here then you'll need to go to the extension Marketplace on the left hand side and just search for the keyword data rank um it should be the first one that pops up here and you'll know it's the legit one because it's made from Microsoft um and just install that and that's all you'll need for uh to get started with data Wrangler so when you actually first click install for the first time I already have it on my machine um you'll open you'll get shown what is our welcome page for data Wrangler and it has everything you need to actually successfully get started with data wrinkles um so just go through all the steps you'll need but for the stream um I'll just be walking you through like a quick demo of what you can be doing um and I'll be using the sample Titanic data set which we also offer built into the extension so if you folks want to follow along and have the extension installed if you don't want to find the data sets yourself or you don't have anything to test with you can just use the built-in default Titan by the way I'm following along as you do this because I've never used this extension so I'm curious it looks like it's in preview oh yeah it's in preview right now um but we it's it's in a stable state so which is why we push it to the vs code stable Marketplace and we've already had it up for like around almost a month now we've seen tremendously uh positive feedback on it which is why I want to show it off to folks as well okay perfect um yeah so once you actually are have data Wrangler um installed like I mentioned you'll show this welcome page um and I'm gonna be using the Titanic data set here for this demo but if you actually want to bring your own data set to it it's super simple so if you have any tabular formats of data sets so you can see them by explore page I have a lot of different csvs I'll be testing with um you can literally just right click the CSV and you'll see this new um oh I guess it's not sharing but if you just right click the CSV you'll see a new button that says open a data Wrangler as part of the context menu so you can just click on that that's one way to open directly from and I was about to say just out of curiosity like what is the Titanic data set yeah I was just going to put the content so the Titanic data set is kind of like the hello world one of the hello world examples in the data science Community um so the detained data set is just a um just seen a bit but it's a data set of all the passengers that were on the Titanic and it contains a lot of information about them like their name their age what classes around obviously if they survived or not just a lot of this useful information and it's a really great learning tool or starting data set for data scientists um to just play with trying to get some analysis with and it's like the hello world the community for data science gotcha okay um so yeah so that's how you can open it from a CSV file but if you actually the cool part is it's integration with Jupiter notebook so I have just a notebook open here and I have again the time handset Titanic data set loaded but if every time you do a df.head which is one of the more common like I think it's the number one most common pandas API apart from obviously CSV you'll see this new launch data management button built in directly into the book so anytime you print out like a table or tablet format you'll see by default it just gives the the pandas or Jupiter version of it which is just a small snippet of it it's not interactive it's just to see what's there but we have this hook and add-on now that anytime anytime you have the data regular extension installed you'll see this launch data so this also will get you into Data too and any in a python cell right yeah in a python cell yes correct okay um yeah so let's just go jump in directly so like I mentioned I'll be using the Titanic data set here and it comes with the comes with the extension just as a hello world example but you'll see once data wrangler's actually launched here so you just saw it launch super quickly um it actually you have a completely isolated sandbox environment for you to explore and try out various data cleaning operations without the fear of overwriting the original data set one of the issues that we've seen from folks is hey if you're cleaning your data in a notebook you always have to create like a new variable or do something with it because you might actually delete some data you might not want to do right it's exploratory data analysis for a reason because you don't really know what the data is you don't really know s effects are going to be so you want to we want to put this in a sandbox environment such that you can kind of go away with their hearts intent without any fear that you're gonna do something wrong that makes sense because everything will be undoable nothing is binding until you actually save it at the very end so where for really large data sets this is all like just in memory it's not in a file somewhere yep so right now we're using pandas as the backend engine so anytime that you can um any data frames any data frame size or data set size that you can load in pandas own notebook in your machine it should be able to be loaded here but we are I guess a sneak peek but we are adding some Pi spark and sampling support in the future as well whenever one of our future updates so that should even support larger data frames that won't be able to be happening so right now we're just supporting Panda's data frames on csps let's make it simple all right awesome I've got it open here on my side I'm on the same screen that you are so I've made it this far yeah um so I'll just quickly walk you through like it might be a little bit um everything looks like slightly different because we just jumped in from a notebook into here I'll just quickly walk through just like some of the UI elements so everyone can get quickly familiar um but the first thing that you'll see on the screen just front center is what we call our data Wrangler grid um here you can get the full values of the data set and so you saw in just a notebook if you kind of like did a DFW head you'll see just like a small window just only five frames and it's kind of like non-interactive but in data Wrangler you'll you can kind of scroll through hard intent and as you mentioned before even if you had a large what like million row data set this would effortlessly load you can just scroll up and down it so it makes it easier to see exactly all your data on one place um the most important parts I would actually say are actually what appears up the headers of each of these calls so it's what we call our quick insights it's roles to give you the most important information about your columns just at a glance right these are the things like the call of data types so if you can see some of these symbols here if I hover over it it'll tell you what it is like object um it was like a string you can see things like that numeric app so it tells you it's a number column um I don't know if I have any other Concepts here but um different column types will have different symbols here the other really key pieces that basically every data scientist will know is we show the most the number of missing and unique values so we can see any problematic columns any missing values you want to deal with um let's go with here and then the key cool part is we actually show some visualizations as well so if you have like a distribution maybe this H column is a really great example you can kind of see the distribution pages on the Titanic data set as well or if you have like a frequency uh or like some categorical call you kind of see like the frequency of certain elements that appear in the data set as well wow so I didn't realize so many of the people on the Titanic were really between the ages of 16 and 30. right yeah I think also life expectancy is probably lower back then so maybe this is probably the norm or R like yeah 40s 50s yeah I don't know like what what no it's the 20s right like when was the time yeah was it yeah I think the 1910s 19. I could be wrong so uh chat a few things while we're doing this tell us when did the Titanic sink and two what was the life expectancy at that time because now I'm curious as to was the Titanic just like full of young people was that like a Spring break cruise because it kind of looks like that and I'll show you you'll see as I do some more analysis you actually get to know um well I should answer a lot of interesting questions that maybe you probably never thought about about the Titanic data set too so 1912 okay so 1912 you were closest Jeffrey I believe you win you said 1910. uh and then somebody look up the average life expectancy chat I'm saying what is it today what is it like 76-ish average life expectancy is what you can see here even tells you the Min and Max so we can see the oldest person on the Titanic was only uh 80 years old eight years okay so well I mean but that's all that's awful I mean 80 years old to be on a a cruise that long right far 51. the life expectancy was 51 in 1912. yeah wow I didn't realize that it increased that much between 1912 and today holy smoke modern medicine yeah no kidding all right okay so then but we're still skewing lower than the average light that's the average life expectancy though so I guess we kind of are right at the midpoint aren't we exactly yeah okay um yeah so just to continue quickly on so sorry yeah no worries yeah we saw the quick insights here um the next part of the screen now let's kind of show is the um if you look in the left-hand side it's it's what we call our operation panel this is kind of like like I mentioned the bread and butter So within datarangular um you if you we have a bunch of built-in operations um any built-in ones so you can kind of see here's just a few samples like you can drop missing boroughs drop missing columns Etc and then you can here's where you can actually perform the data cleaning with you so here's where you're actually going to select your operations um you'll be able to see the Direct effects of the operations on the data and again I'll get into that soon but I just want to quickly highlight some of the UI elements so people are familiar oh yeah sure um there we go I'll just zoom in a little bit for this yeah there we go um so the next part is uh right to the right side of it is what we call a summary panel so it might be on the left-hand side when you install it first but we actually recommend dragging to the right hand side just to make it cleaner and nicer to easier to use um especially because of vs code all the panels are resizable so this is at least how I like to set it up but if folks want to set up differently they can choose that themselves that's so that's the second area so you put it in the secondary sidebars which yeah I put in the secondary side yeah um but the cool part is so by default it just shows kind of like a data shapes of your data set so you can see here there's 891 passengers or people on the Titanic data set um but if you actually selected the column that's kind of where it gives you like more information so if you select a specific column it'll give you more stats about that one you can see things like if there's age column because it's a numeric column it'd be things like mean median percentiles as well as some Advanced statistics things like kurtosis and skew as well here's where you can see some more advanced statistics I guess or detailed statistics about a specific column um the last two elements on the screen you'll see here is cleaning steps uh and this I guess what we call a code preview panel so the clean steps these are these two will be a lot more uh evident once I just in like a minute once I actually get into the actual demo but the cleaning steps is kind of just keeps track of everything you've done so if let's say I've dropped a column it'll show up here right or let's say I did some other operational show up here and here's where you kind of manage it you complete it edit it et cetera and the last part is what I leave the literature in the beginning of hey this does code generation for you this will actually show you the code preview of what's actually happening in the background or you can just write your code yourself but here's where the bread and butter featuring is the code generation part it'll all appear here um yeah so those were I guess the I think what like five elements that I just want to showcase for data Wrangler um but once we kind of like understand that then we can probably just jump into the data set real quick yeah let's do something with it yeah so with the Titanic data set and as a data scientist um you'll want to First figure out typically what are the missing values like what columns are missing values what are those are most likely to be the problematic ones so by default if you look at the summary panel we'll see this missing values section so you can actually expand it and it'll actually tell you which of these columns have missing values and how many um so you can see at the very top there's 891 rows um the thing that just jumps out to me right off the bat is if we look at this cabin column 687 yeah yeah that's like majority missing values and again you can kind of see it right here right the majority yeah these values are missing and it seems like capital is like a data scientist what do you do there you just drop the whole column yeah I'd say everyone has their own methods but I would say most data Sciences most likely will probably drop it because again there's if you just look into the data real quick and that's the benefit of having a Wrangler we can just look at real quick these don't really have any more typical meanings like I could be wrong like that's why with the data Wrangler even if I drop it now I can always bring it back in case I did something wrong um but most likely will just be dropping because it's what like 70 80 of the column of the rows are just missing yeah and it's not like you can calculate an average here or something like that like yeah 46 is not a just like arbitrary so how do we drop it yeah um so like you saw earlier there's like some operations inside we can just do a drop there but we built in some of the key ones that are the most common ones like into the actual column itself so if I want to drop it I could just bring up the context menu here and then I can click drop columns and here's where you'll see the bread and butter where'd you click oh drop column oh yes just the just the top of these the target columns cabin applied so you can kind of see here um when you actually get into operation here's the bread and butter of data Wrangler where it tells you hey this is what's actually going to be happening in the operation candle the key part is here's the code that's actually being run right so it's it's not a black box it's very transparent of what's doing um so you can't you kind of trust that data Wrangler will do the right job that makes sense and the best part is um most people here uh if you're familiar with Git you'll see this is very like get-esque where you can kind of see like a diff of your preview right of your change before you actually want to commit it so maybe you did something wrong maybe you selected drop cons not probably not the best example but if you did like a filter and you picked the wrong value it'll show up immediately here um so you here's you get like this is the the get preview or the get diff before you actually choose whether to apply it or um maybe you did something wrong you discard it so here everything looks pretty good because it's simple drops I can just click apply and now you see the drop columns shows up in the cleaning step so you can kind of see hey this is what I just did previously and um here's how I got there so now you can see that the drop the cabin column is now gone missing oh he's there um some quick question from uh I think it's Natalie I hope I said that right Natalie um it's possible to explore quick insights of a given column like putting in a slide in presentation what do you mean by quick insights Natalie like just the oh I think she means the um top yeah the The Columns visualizations uh they mean the column visualization at the very tops of these ones yeah could you do that real quick because I agree that's kind of that's pretty useful yeah that's actually a great question so just because we've just released it right now we don't have that capability but it's right on our backlog so if you follow data Wrangler um one of the upcoming um releases you'll hopefully get to see that actually uh that feature built out so right now we can't export this but um it will be in one of the future ones right now we can purely export the code and cleaning steps but um hopefully you'll see the actual ability to explore visualizations in one of the upcoming updates all right what are we doing next yeah so next we'll see the age column right and um you can see here actually quick question do you mind if I zoom out just one tip just to make it a little bit easier yeah okay yeah um so just so you can see a little bit more of the data for this demo but um so you can see here age has like 177 columns so it's not too it's like a decent amount but age is really important um because here we're trying to I think we want to look at the data set to actually predict like what are some factors of how folks survived on the data set right um so here we can see H has like 177 but it's pretty important so I don't think we just want to drop them so we have different operations built into Data Wrangler that instead of just dropping them it can compute the value so let's actually go into the operation panel and search for the sync and you'll see here's some of the operations we have so we can do drop like I mentioned but we'll probably just want to fill it with some custom value here so here you can select uh let's just do let's do the um the mean age so just the average age that of folks on the passengers right you can see again this is a great get diff so you can see exactly what everything's happening and this is great for both experience data scientists you know does that means as well as beginner data scientists because you can explore like hey how does what happens when you like media what happens if it's like me like what actually happened it's so cool yeah because normally you have to do this with um with like data frame API methods which I exactly yeah I'm always forgetting I'm like what is the freaking Syntax for this again yeah this doesn't help me actually solve anything right I'm not even to the point where I'm writing code yet I'm just trying to get the data in the right State and I can't remember the API exactly yeah that's one of the the main pain points we heard from customers as well which is why I wanted to build this like you can see I haven't really had to go back and forth between like Bing or like internet search right to see okay what's the documentation for this um this just kind of tells me here and learned like right on the spot um so again you can see them being it puts it uh in a float so it's like not a super nice number um so what we can do is actually because this is a code first code Centric experience and it's built into vs code you would expect to be able to code it well you can so I can if I want to cast this to an integer I can just completely type in and now you can see as I'm coding it makes live changes for me on here so I typed into here to cast it into now you can see it's no longer this really nasty like float number now it's like nice integer very nice um and again everything you do here changes live on the dates in the data frame too okay but it only changes the first occurrence the rest of them stay oh no it's all of them yeah huh so here um it's this all the missing values Get Down cast into it yeah okay um so yeah this looks good we'll click apply again um and you'll see as I clicked apply you can see the exact change what happened previously so this is kind of like what looking at your git history or git log of what you actually did um last quick thing is uh as we saw earlier um that's like this we saw embarked has two missing columns so for that one we can just do a simple um again two out of what like 900 less than one percent so we can just remove these uh remove these missing values so we kind of see we kind of exercised three different strategies here of dropping either the column compute the missing values yeah so we'll just do drop basic values here um and then you can see so if I click apply you'll see there's no more missing values at all so now we've done the the boring work of removing the signal use um now we can kind of maybe potentially look into some data analysis um so there's no more missing college to get some insights from the data um if you if you heard of the story Titanic you've probably heard of The age-old Motto at least back in what 1912 of like women and children first right of right yes that's what you always see in the movies is that like exactly yeah of the Titanic is that they're trying to get the women and the children and the lifeboats I don't know how accurate that is but let's find out yeah let's actually find out um so to quickly do that it might not um if you're just trying to write this with code it might not be super intuitive right you might be like oh what's a good place to start you probably have to go search online um but here we can actually use the built-in operations so we'll just do something uh we'll do something simple let's Group by um using the sex column on the survivability so you can see here survived of one means they did survive survive the value zero means they did not survive so we can just do a group by and Groupon is one of those operations that a lot of data scientists like kind of hate because um one well hey they love it because it's really powerful but B they kind of hate it because it doesn't really it's not as intuitive to use so if you're using a group by you can't it's harder to visualize what the end output is going to be then something simple is drop right but with data Wrangler we made Group by super simple so you can just do something like it's way easier to select this and then you get a boom easy Group by by default it's count so this doesn't really just tells us how many people survive but we actually want to look at the percentage of people survive so it actually tells us more information about it so let's do um being here and boom yeah so it looks like for females this is obviously not multiplied by 100 yet but females around 74 of passengers survived and then for males around 19 so it's oh wow yeah right exactly or but I mean what was also what's the I guess my next question is then well like well what how many males and females were on the Titanic just in general right yeah that's a total number yeah so if you just go back to count um we can see that distribution here even okay so a lot fewer women were on the Titanic right men okay so that if that would so that would affect that percentage somewhat right I'm trying to science now yeah it would definitely uh have some effects on what but I think just the Stark significance of this because I don't think we can look into correlations like that this soon but sorry significance of this I think it just uh just shows that hey we'll probably want to keep this uh sex call because it has definitely has some effect on this right we don't know exactly yeah absolutely so I think we got some so folks join the stream uh as we go Jeffrey and they're saying what are these different colors represent at the top The Greener red these are the oh yeah right yeah so it's a diff so green are columns that are going to be added um and then red are columns or rows that are going to be removed so for a group buyer like you kind of see like hey red is like what's gonna go away if I do a good buy ingredients what's going to stay because the group by it kind of creates like a new data set um but if I'm doing any like row level operations like let's say I'm doing a filter filter um everyone that kind of uh did not survive on the Titanic data center right you'll see each of these rows with zero will be red just right kind of means like what's going to be removed and what ingredients what's going to be kept and then if there's no not color it just means he'll be it'll stay the same so what do you I guess is is it possible to get into a state where a column has got additions and deletions oh yeah you represent that yeah um so oh that's pretty simple so it'll just be like I mentioned it's row wise so for example like a filter could be a good option might not anything but that row will be highlighted green or that row will be highlighted red or that row just won't be highlighted so it could be it could happen column wise or row wise gotcha gotcha okay um so yeah it seems pretty interesting for like the sex column let's see if uh the second part of the motto of like uh women and children culture so let's talk about children so we can actually do a group by by age um but you can't see ages like uh not really consistent number right it's like um it's every single possible age you can see there's like Floats or there's not a whole ages so to make it a little bit nicer we can actually just modify it through code um I actually know this really cool uh pandas operation which is kind of kind of groups them into buckets of n or how many uh buckets you want to give it so I can just do PD which is stand for pandas I'll do cut so it's essentially cutting the data set into chunks so I'll do um let's see here I'll send it to groups of five just as an example to make it happen so this you can see here now I'm chunking each of the age ranges into five equal age ranges and now it's a lot easier to see I can change this to any value but this just shows five because it makes it kind of each life stage of a of a human um let me just zoom in really quickly for this one because this might be a little bit small but you can kind of see here's for each age range here is the survivability percentage um so from zeros from ages 0 to 16 so more of the children teenager phase is 55 and as we go down we can kind of see the kind of 34 40 42 and then seniors like nine percent yeah that didn't happen but I mean everyone who hit the water died right like nobody survived did they pull anybody out they didn't pull anybody out oh yeah no no there was a there was a another ship I think it was the Corinthian or I forgot that I'm not a historian but there was another ship that did yeah that pulled out um a good amount of people so that's why you saw here um if you just uh exit out of this real quick um we can see uh what like 340 folks survived and 549 did not so obviously the majority did not survive unfortunately but there are a significant amount of people that did actually survive wow so I didn't know that I thought everybody who went into the water unless you were on the floating door you didn't make it yeah maybe that's just from the Titanic uh movie causing this uh causing that stereotype but yeah yeah this is the Titanic data set we're looking at which is a little bit morbid but also uh but very interesting yeah um and again we just chose it because it's the hopefully it's something a story that most folks are familiar with um right I don't have to really explain the data as well yeah yeah um so other interesting ones I say this P class so this kind of just shows you like what is the class the passengers thing but like an airline right there's like economy class like business class first class so I'm sure um I think one of these represents um but I don't actually know which one is like one and three like maybe three is actually first class maybe one is first class I'm not too sure so maybe one way of doing it is uh there's actually this Fair column I saw earlier that shows like how much did somebody pay for a ticket for the Titanic we can see there's a heavy distribution obviously to the lower ones because I'm sure these are the economy tickets and these are the higher ones so I can actually just do a quick sort from uh highest to lowest so somebody definitely was rich here these are the rich people um 512 dollars for a ticket in 1912. right chat do the math what is 512 dollars in 1912 equivalent to in 2023. I'm curious what do you think it is Jeffrey my guess is probably like ten thousand dollars at least ten thousand dollars at least yeah I'm gonna say 15 000. let's see kind of like a first class flight almost like nowadays for us right right exactly yeah exactly but this is even this was even nicer I mean I saw the movie I know what the I know what it was yeah Leonardo's in the in the um in the in the in coach Kate Winslet is in first class they're having fine dinners I've seen it um but again you can quickly see if we just go back to the p-class club we can see well these are all ones so I'm guessing one means first class here um yeah all the top most expensive ones are all class ones so yeah that should be actually for just for curiosity's sake let's do this real quick so let's do um let's Group by um uh what's it called uh E-class yes 15 879 Jeffrey and that's how much yeah you're too good at this all right sorry uh no no so one interesting thing I actually wanna you kind of brought up is like let's actually see what's the average price per class so we can see that distribution um so we look at the mean and we see first class the average price is 84 obviously back in those days um economy class is 13 so it's around what like six that's a big difference between 84 and 512 for a first class ticket yeah maybe someone's got like the presidential suite or something yeah yeah but again you can't see it's kind of like reflective of even our own prices nowadays for flights like the first class flights are usually what like five six x more at least than the economy class tickets yeah our meals included no not for first class I've learned this the nicer the ticket you buy the more stuff you have to pay for this is true like chat go to a hotel and go to an expensive hotel you got to pay for breakfast but if you go stay at a super eight but which if you're not if it's super eight that's I don't know if that's strictly American but it's like the low the lower end of the hotel chain a bunch of free breakfast you had a free breakfast so I'm just saying Am I Wrong unfortunately unfortunate is probably a continental breakfast which they make it sound really fancy but it's yeah I I didn't say breakfast yeah I mean it's you make your own it's the waffle maker with the the waffles yeah yeah sorry um no worries um yeah so we saw like the p-class club might be of interest let's just see quickly what else could be of interest so there's this sibsp and parge column so if for folks who don't know um the time here is that this actually represents I looked up a little bit earlier but this actually represents the number of siblings or spouses you have on the ship and this represents the number of parents of par or children um so we actually can make it simple um to make it simple for folks so let's just rename the columns so um people aren't confused I can say like number of siblings and spouses you can see updates live here um so anybody in the call or anybody in the Stream that's watching can follow along with that being confused um here will be a number of parents children again very Excel like but you get also the piece of code that does it as well so at the end of the day you can always just export this and it's all very very open right there's no hidden things that's happening in the back very nice actually these kind of columns are very similar it's just talking about kind of what number of relatives you have in the uh it's like a arbitrary grouping right it's like yeah so maybe you can just group them together so we can actually do something like let's go to the formulas let's actually just create a new custom column from these formulas so let's call it number of relatives right and then for this we can just uh let's just copy this and then right back custom formula so we can do DF of this plus yeah of this um these two columns so we can actually add up and that kind of adds up to the number of total um it's a number of total relatives so you can see here here's a preview so you can kind of see exactly what it's doing um and you can see here it kind of adds everything up so now there's four here Etc and you can see exactly again the code that generated this as well yeah see column based on the two exactly yeah you can't just like um dimensionality reduction um ideally in the world you want to have a least number of features as possible but you should only do it if you think these are kind of correlated which I Pro they're most likely are because they're just talking about yeah I was going to say there could be some overlap but it doesn't matter that's yeah just for this yeah looking at the data all right so yeah we have a number of relatives um so maybe let's just now explore the group by again and on survived and see how maybe class affects survivability as well as what we just saw did earlier with oh like did more first class people survive exactly yeah I'm gonna say yes right chat I feel like we should get guesses do you think that more people from first class survive than from economy class realizing there's far more people in economy class but I'm just curious and also we're on a delay Jeffrey so we continue okay sounds good let's go why into here and let's do the percentage let's see if there's anything interesting wow so the first class had 62 63 survival rate like economy class at like 24. I mean yeah we just know this to be true right like money yeah money is going to get you prioritized unfortunately um and also if you think about I think uh in the ships on cruises if you've been on a cruise like the first class passengers are usually near the top of the show that's right yeah and the economy and that's they do that in the movie right like people that are yeah which those are the Decks that got flooded first yeah so I guess it doesn't make sense because they had to climb up what like at least like triple the amount of stairs that like the first class people yeah and I my my guess is also that the first class passengers got a lot of help and we're prioritized getting to the area um so yeah that's Peak also seems like p-class is pretty important indicator so probably want to keep that um let's do the one last one the new column we just added recently which was the number of relatives um so right now like the distribution's not super nice I can't really it's harder to read so maybe let's actually instead of number of relatives let's just say uh let's make it into Boolean column right let's just say instead of number of relatives um let's say hey what are these relatives um uh do they have relatives or not so more of a Boolean right rather than the number of relatives because I think the numbers should affect it too much we just want to see if they had relatives so let's actually convert this to a Boolean column right so let's see if this is greater than uh and again I basically went back to the clean except and I can quickly make these edits and now you can see now it's a quick easy Boolean column um and again there's a new visualization for Boolean so you can see here um oh yeah yeah so most people seems like they did not have relatives on that they're traveling solo um but there's like a decent amount maybe what like 40 30 40 that did have relatives on the flight so that's pretty interesting and I can again quickly click update to actually update that last step and let's just do that quick Group by again to see what that actually is fine aggregate um let's do a number of relatives and then um let's pick survived again let's go to the mean and let's see what that actually is so if you had relatives um you had 50 chance of survival and if you did not you had around 30 yes survival so that's pretty interesting as well yeah but that could be that's that's interesting but I'm guessing that if you're traveling solo you're more likely to be in economy or code yeah that's good that's a good point yeah versus family is more likely to be in first class I would think but I'm guessing the The Women and Children First can pull true as well right because if you're uh that's true parents on board right if you're a child so yeah I guess this kind of makes sense too but I think this just also shows that hey this is a number of relatives is going to be a factor so we'll probably want to keep this at least as for now in the data yeah yeah yeah right now I'm just trying to figure out like what to remove um the last column I think we looked through most of them actually but the last column that kind of just caught my eye right now is if you see this name column you'll see people have different titles um if there's an interesting thing called Master here which I don't maybe like as an older term I think that was just like sir back in the day or oh okay maybe let's actually because they used to refer to young boys that way master so and so yeah I think there's a lot different ones like somebody's like Don which I'm not really sure what that means but maybe for a different language as well but let's actually extract out these different titles which is not a simple task right you can think about this you'll probably have to do maybe some regex because everyone has a different format name like this person has like a very simple name first last name sort of sounds like a middle name this person has what like an alias or something um and some people have like multiple middle names so yeah Jeffrey master was getting the prepubescent boys typically boys who are below the age of 13 a boy of 12 or below was always titled Master while a boy over the age of 12 and into adulthood was titled mister okay yesterday I learned I think that actually makes it even more interesting to take out these titles because it actually combines like the thing but it kind of combines the sex column and the age column right because it kind of distributes into like hey males that are young or like males that are old Etc so maybe you want to actually extract that out and see maybe how that affects uh survivability as well because I mean why not at this point that's so strange they even gave children titles well they gave young boys yeah yeah I would have been um yeah so let's just do um extraction of these first names so like I mentioned if you had to do this by code it's probably going to be what like writing and regex but we have this really cool operation here um that I really want to show you it's called Spring transform by example so this is one of our AI features and um basically how it works is you based on this like one or more columns here you give an example of what you're actually trying to pull out so here let's say let's name it title because this is kind of what's called um and we're going to extract out the keyword Mr so let's just type in Mr as an example right we're just basically telling the model like hey I want to pull this string out from this one example and I'll try to go and impute the rest of the values here and try to predict what the rest of these oh what is this okay this might be a unforsaken book but it kind of tries to predict what the rest of the things are so let's try one last time but I'm on a pre-build right now which is why it's a preview so it might not show up directly oh okay now it works oh there we go okay so it's kind of like Excel where you like drag down like once it's so recognize the pattern yeah so funny that you actually mentioned that because this is um the exact same team or very similar team the technology that's used in Excel right so it's called flash fuel if you're familiar with Excel but we worked with that same team to get this same AI feature built in here so it's no I didn't realize that's what it's called so when you drag down in Excel it's called Flash Fill yep huh interesting technical term for the Excel hardcore fans out there but yeah now you know now you know but the cool part is when Excel right when you drag it out like you don't actually know if it's doing it correctly or not if you're hoping for the best right then and most likely the time it obviously works but the great part is with data Wrangler with its code Centric approach you actually get the piece of code that actually generated right it's a very simple like not simple but it's very readable for us for users it doesn't really use any uh separate or external libraries these are all like resource libraries Library so even though it's using a in the background we kind of demystify it by making it um generating the code that's actually used to generate it so you can see here exactly what it's doing makes it reproducible which is one big thing at the very end so if I actually export this in the script I could be running it again and I will always get the same result exactly so that's one of the issues with AI where hey it might it's not um it might always not to be the same answer yeah it's not reproducible yeah exactly yeah yeah so again this looks pretty cool so we'll just apply that um yeah so we have a lot on there let's do one last quick thing um of like the goodbye or trust the old Group by and see how title actually affects um the the survivability but actually maybe one thing we can do first is see how many of each all the titles were and how many there there were um so there's a bunch of interesting ones major yeah I'm guessing these are some from like Miss different languages Let me see like mad bozell Madame oh yeah yeah yeah it didn't it sound from Francis sail from France no I think they sailed they're all sail from England I believe oh yeah um yeah we can see like a lot of these different titles uh as well you know what's interesting is like I feel like in 2023 there would be no title column right like exactly yeah it's something yeah well I mean there's there's probably a lot of culture that they also wouldn't keep this one interesting thing is kind of funny is like hey Captain there's only there's only one well there's only one um but let's actually see if how many of these survived is there a percentage of one of our last quick thing so let's go to the mean and well unfortunately the captain did not survive he went down with the ship but um let's see some of the ones so like doctors had what 42 percent um misses at 70 okay yeah which Mr Big sense as well like if we liked it before um yeah it seems yeah I think if we actually had a little more time probably Phil's wrote all the ones that I only had like one-off titles right because I don't think they would have too many too much of effect but for some people like more common ones like this is Miss Mr Etc probably because I think it definitely you can definitely see some sort of correlation or some effect it has so is it possible to export all python code generated I think this is what you said right that you can export all of them so that it's really cool yeah so we're actually getting near to the end so I'll actually showcase some of that as well so kind of you see there's like a lot of steps here but what's really nice is um you saw it trying to write a lot of code right um but we actually put it first in a kind of human readable format so kind of at a glance you can easily see what it did without having to move through the code so you can see like hey previously we did like a string transform by example and that was the extracted title we did things like create we created that new column we rename some of the columns as well we dropped some missing values so it's really great to see all these things so you can always see the intermediate step as well as the piece of code um that was used to generate it but you can see this preview code for all steps one so here's like a quick preview of everything you've done and again you're not meant to actually edit or do anything here this is just a quick preview when you actually are done with the new um done with the data set we have a few export options at the very top so very top here you'll see we have three different export options every time you have enter data Wrangler so we have export to notebook because we know data scientists love notebooks so if you click on this a new notebook will be generated and again because of some vs code you can written out side by side I'm zoomed in but you can kind of like drag this anywhere you want um you can save this notebook and you see it not only does it just dump the code but it puts it in an extremely nice like clean function right so you can see here every function or every every operation is commented out so you can ask you exactly what it did it also does a copy of the data such that it's not manipulating the real piece of data right this is where what I talked about earlier where if I did something wrong or if I did something that deleted my data I don't have to worry because this is not a separate kernel as the one of the you know so if I actually ran this again and let's just run this with the same just my data Wrangler kernel you can see what's actually prints out the df.head you'll see I get the exact same data frame that I actually got from data Wrangler when I cleaned it what's this runs so boom yeah you can see this is the exact same one you see I extracted the title column you see that there and now that's the notebook right I can just do anything along with it so maybe I'm done cleaning I want to go do some model training model building right I can import sklearn or something and pass the status up there um because it's pure code I can share this with anybody right they don't uh they don't need to have data random installed but hopefully they do if I can just share this with anybody I can say this to GitHub as well but because the artifact or the thing that's generated at the very end is pure code it makes it super easy to share maybe very extensible as well yeah it's very cool um so that's that and then also let me just uh save um you can also export as a CSV if you want so if you just want to save as a direct CSV file or you can actually just copy all the code right and if maybe you want to work with a python script like somebody mentioned in the chat you can just create like a new python file and that will maybe face it there um yeah so that's kind of like a quick highlight of the New England um one really quick thing I just want to showcase before we wrap it up a little bit is once you actually export the code right every single time you do like this df.head or print out a DF you'll see this launch data rainbow button so what this kind of means is like you don't actually have to do everything all in one um this kind of invites for exploration so you can kind of jump into Data Wrangler right just with one click jump back to the notebook do some maybe do something that doesn't involve data right with the uh code then you jump back into Data Rank and jump back out so it's kind of like a companion Network rather than like a separate tool which is why we want to build it directly into notebooks obviously if you want to work with your own data set of data frame you can just launch it directly like as I did here but if you want to work with notebooks as we know data scientists love notebooks you can just jump back in export the code as you saw here right export it back out and then you'll see a button also below I haven't run this yet but if you you'll see a button also that says launching let's jump back into the regular so continue with what I was doing so that I have two questions based on that then so you can jump in and out can you can you remove a step in the history like can you just say actually I don't want to do that I want to roll that back exactly yeah that's a great question so you can see here they just hover over um one of the steps you'll see this delete button and then if I want to get rid of like this title one I can just click delete and can you rearrange the steps so that's something where that's in our backlog right now okay we just wanted to get down to this white stone preview but we think the value that at least this preview still brings as you hopefully saw in this video of the stream is just so great that um you'll stick with us until we get into those features well I I hope people will I think there's a lot of great comments in here uh that people want to use this right away I want to use it too because it's I mean there you can do so much here without even writing any code at all in fact you know part of me wants to just like wants a button it's like generate a generate a scatter right and then like put a line of least regression on it right like exactly yeah we're halfway to a full visual tool um Can it generate Pi test code um So currently at the moment it does not but we are like the team that works in data Wrangler is one of like the sister teams to the python extension team so this is definitely something that we can just talk to their team about um this is directly integrated with both the Python and the Jupiter extension so it understands both of them so it's something that we can easily add if we see like these feature requests from people awesome all right so let's throw the link back up to data Wrangler one more time which you can install today a free extension for Microsoft um written by Jeffrey in his free time on the weekend no I'm just kidding I'm on the data Wrangler team yeah I'm actually the PM owner and one of the folks that actually so just for us a cool fun fact data Wrangler was actually started uh by me and two other folks on the team as like a hackathon project like almost a year ago because we're doing a hackathon on the team and we just thought there was so much value in it from testing customers that we built a whole team around it and turned it into a product of 60 years so the data Wrangler was actually built by an entire team of Engineers um and I'm just the PM that I guess oversees it well thank you so much for joining us folks we do have to run uh we're almost at time here but thank you so much for joining thank you for your questions and Jeffrey thank you for joining us everybody download data Wrangler and come back here next week for the next vs code live stream next Thursday at 10 A.M Central Time 8 A.M Pacific we will see you there thanks Jeffrey all right thanks everyone thank you [Music] foreign [Applause] foreign
Info
Channel: Visual Studio Code
Views: 11,148
Rating: undefined out of 5
Keywords: code, vscode, python, data wrangler, jupyter notebooks, data
Id: gc0Hm1NpYPo
Channel Id: undefined
Length: 56min 45sec (3405 seconds)
Published: Thu May 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.