Exploring Data Wrangler in VS Code | Python Data Science Day

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone my name is Jeffrey and welcome to today's session on exploring data Wrangler in vs code um today we'll be using the free data Wrangler extension for vs code to explore analyze and clean a data set of Airbnb listings in Seattle and throughout the journey we'll be checking out how data Wrangler will help um increase the productivity of data scientists and increase the learnings that you get from your data set so to start off I'm in vs code right now my favorite editor and I do have the data Wrangle extension already installed as you can see right here um you can find this in the vs Cod Marketplace and I'm also have a jupter notebook open and I have pandas loaded as well as the Airbnb data set as a CSV loaded into my notebook as well so to one way to invoke data Wrangler from The Notebook is if you have a data frame such as DF here which is my Airbnb data set if I just type DF dfad or anything that prints out the data frame you'll see this new uh open data frame or whatever your variable is in data Wrangler button um in the cell status bar so feel free to click at this anytime and this will open up the data Wrangler extension when you first launch data rangler you'll see a really rich interface of your data so you can see this is the airbm data set um you'll see at the top all the columns at the top it'll tell you as well additional information such as what is the data type of the column um any missing or distinct values and what the percentage of it is relative to the overall data set um depending on the data type as well you'll see the summary statistics or what the representation is um in histogram of the distribution of the data so for numeric columns we can see distribution um for string columns you'll just see how many distinct values there are um if there's any columns like object columns you'll see also the frequency so you can see how many of each item appears in the column as well you also see at the top a um a data shape summary so you can see how much data you have and on the left hand side you'll get a data summary of what are all the columns um which of the columns have missing values um Etc so firstly let's take a look at what we can do with the data set so let's quickly look through the data we can see um you can do some really basic filtering of sorts so let's say we want to look exclusively at the queen and neighborhood here I see a lot of them we can actually just invoke filter right here by clicking on the uh three dots at the top right we can click add filter um let's say uh equal to and let's just say queen and and we can see as I'm doing this live data Wrangler automatically updates The View and the grid to only contain information um that I'm providing here so you can see it's only queenan now and we can see we filtered it down and all these stats update live as I'm actually updating the filter so that's great as well um let's just undo this filter because we want to look at all the data set but if we recall when we go back into our full data summary we can see there's a lot of missing values in some of the columns such as square footage um we want to find where square footage is um because this is a pretty big data set we can actually just use this go to column feature it shows everything if I can just search for the word square feet square feet and we'll jump nicely to this column so I think because we want to do a little bit more exploring and a little bit more data clean mean and analyzing we'll want to switch from this viewing mode which is what data rangle opens by default into editing mode so viewing mode kind of lets you explore the data really quickly apply some like basic sorts of filters look at the data look at the distribution of the data as we did earlier but if you want to do any more cleaning or um any manipulation of the data sets U we'll want to go into editing mode where you can leverage a lot of data Wranglers built in operations to do such so now that I'm editing mode you'll see a couple things change so one thing you'll see is now this new operations panel on the leth hand side top left U what this does is this showcases all of the data Rangers built-in operations um that you can apply so they're all grouped by category of operation and um you'll see in a minute of how I can actually use these operations to help with cleaning the data set in a much faster manner than just writing code um the leth hand side you'll also see a new panel called cleaning steps and again this will be self-explanatory when you see it in a minute but as you're applying operations and doing different transformations to your data set you'll see this start to populate with the different steps that you actually did so you can keep it's kind of like a history you can keep track of everything you did and the bottom side here you'll see again in a minute um this will show a live preview and automatically generate the code required to make that transformation um so let's get back into it so um like I mentioned earlier square footage was one of the columns that we saw that had a lot of missing values and we can confirm that here when we look at the missing we'll see 97% of values are missing here um so we'll probably want to generally R them like this is a lot of columns we can't really um a lot of missing values we can't really impute many of these because there's just so many that are missing so it might be just easier to get rid of this column um so one way we can do that is we can click on the top right side we can click drop columns here showcases some of the most commonly used operations but if you want to get access to all of them it's that operation panel is on the left so when you actually invoke an operation here which is I'm dropping columns here you'll see a few things change so you'll see a diff of the view so if folks are familiar with GitHub and this is kind of like a diff of what's actually happening to the data this one's a pretty simple one because we're just dropping the the column so it highlights everything that's going to be removed in red and changed in green so this um in this case it's just dropping the column so you'll see everything is red um you'll also see this code was automatically generated by data rangle as well so it gives you um you can also look at the code to make sure that's doing the correct operation and if everything looks good to you you can click apply and once I click apply you'll see that that um that column no longer exists in my data set as well as you'll see this new drop operate or drop columns in my cleaning steps so again this is what I mentioned earlier where as you're cleaning the data you'll see this start to populate um another thing we can do is if we look back into the missing values we'll see review scores rating also had a good amount of missing value so we want to handle that so let's find that um by searching for review score ratings here and this one only has we're looking at 17% so most of the data is pretty clean here most of the data is intact so we'll actually probably want to keep this column um or most of the data in this column so we we can do is again you can see some of the missing values here we'll want to just instead of dropping the entire column we can handle some of these missing values so we can say one thing we can do is we can just drop the missing values there's other strategies as well but in this case we'll just drop the the values that are non-existent so what's great is we can see as I go into my preview of the operation you'll see all the columns that are actually being dropped or sorry all the rows that are being dropped highlighted in Red so it gives you a very visual indicator of what's actually happening to your data and again you'll see the code automatically being generated at the bottom and you'll see um a new set of like column statistics on your data so you can now see if I applied this the missing values become zero so if everything looks good to me here I can click go ahead and click apply um so now that I've done the review scores rating um we'll see there's still a few missing values but there's 's not as much anymore so we'll um we'll for the sake of time we'll get into exploring some of the data set as well so let's go let's look at the price column because um as an Airbnb data said a lot of us when we're booking airbnbs we care a lot about the price of the place and that's one thing I think we want to analyze in this data set and see how price compares to different bedrooms bathrooms different neighborhoods as we saw some of the other data in the data set um one thing we'll notice is that price is actually an object and this is likely because you'll see like there's a dollar sign here there's also dos here so it's not fully represented as a number and which is also why we're not getting this histogram so we'll want to see we'll want to convert this um from like a human readable price to something that's more machine readable that's just a raw value um typically if you're doing this outside data Wrangle you probably have to write some code to figure out how to remove the the dollar sign as well as the dots but because the like the numbers are not standardized right you might have like um hundreds of dollars thousands of dollars Etc it's not as trivial and you'll have to probably look up online um data Wrangler actually has this really cool built operation called by example um so we can do string transform by example here and what this does it's um if folks are familiar with Excel here it's very similar to excel where it tries to recognize a pattern that you give and tries to impute the rest of the values so in this case we're going to be um giving uh creating a new column by example let's say price cleaned since we're cleaning up the price here and then we'll it says enter an example so let's just say we want to extract 80 from this um it tries to impute the rest of it um it gives this question mark because it's not fully sure in this one it gives um so it's saying hey enter another example and we can see here it didn't do it correctly it thought I was trying to pull just the first two numbers of of everything so here I'm going to update the example and set it to 150 and now you can see because of these two examples that's able to smartly impute the rest of what it should be deriving um based on these two examples I gave so you can see now it's like 975 for this um but it still remembers that if it's just 80 then or if it's just two digits it'll pull that out as well and again the best part is you'll see because this we in data Wrangler um it automatically generates the code for you and provides that transparency or trust um to you as the user so you know exactly what it's doing it's not behind some sort of black box here and you'll see um exactly what it's doing so you can see it's finding um it's the index it's finding all the values between the dollar sign and the dot and that's exactly what I'm trying to do here so if everything looks good and again this is a little preview mode if everything looks good I can click apply and now you'll see this is part of my data set um as a new column and now that I've added this price clean I probably don't need this original price anymore so I'm going to drop this column um again something similar I did earlier and one thing I'll notice is I still don't to see this histogram and it's because this price clean is still an object because I'm just extracting at this value so I want to convert this with this change column type operation from object to um there's a lot of op there's a lot of different types here but let's just say float 32 is an example and again here you can see the diff that's happening so you can see that um this is the previous column and this is what's actually going to be added or what's going to be changed we can see it's changing all the float and now we'll start to see that histogram here so if that looks good we'll click apply and now we can kind of see some more data about the price so we can see the minimum price here is $22 a night at for Airbnb and the maximum price is around $1,000 which is pretty expensive we can see the average is probably around here so if I want to see more information about price clean I can actually click on this column and it now focuses the summary on this specific Hollow and it gives me more information so I can see things like the mean price um uh or the standard deviation the minimum price um Etc and so if you want to see a specific data on a specific column you just click that column and it updates it here but if you want to go back you can unselect it and you can see the data summary of everything but yeah $1,000 a night it's pretty nice so let's uh let's just quickly sort it um just for my own curiosity sake to see maybe what the like what this place is um so if I look through so now we can see this is the $ th000 night place um we can see that it's a four-person place there's only one bathroom two bedrooms I don't know if it deserves $1,000 a night um it is a house though so it is pretty big and we can see it's the neighborhood of Ravina um but one thing I also noticed is that this column says availability 365 which kind of represents how many nights of the year in the last 365 years that's actually available and you can see it's available for almost all the year so I think this person likely overpriced their everying be um and that's why there's so many nights that's available so that's another cool or interesting thing we're already learning from this data set um one thing we'll also want to do is um given that we can see some of the availability is zero which means it's just not available to book at all I think if we want to get an accurate representation of the data we can actually filter out anything that's available um for zero nights which means it's just fully booked out or that person's just not even um providing or listing this Airbnb or something that's available like every single day which basically means that it's just probably overpriced and nobody wants it so let's just say greater than as the condition and say greater than zero um and we can also add another filter we can say availability 365 again and we can say let's say less than um 365 so again as you can see as I'm typing live it's updating both the code live and the data Grid live so you can see what's actually happening um with your data set very interactively so here I'm just removing everything I click apply and now we can see the Max and Min has updated here so now that the price has been cleaned and we're getting rid of some of the outliers for the availability let's see how the price um let's see how like things like the let's say property type affect the price so we can do a classic group buy over here so let's Select Property type let's go and search for a group buy and aggregate and let's aggregate on price um so price claim that's the operation we or that's the colum we just added and by default it does count and what count is is just showing you how many of these property types there are so we can see um super quickly let's just bring this down here we can see most of them at least from what I see yeah most of the uh property types here are Apartments of course um I'm sure that's probably what's the most abundant um we'll see the second most is houses and we can also see a distribution of some the more interesting or unique things so let's just click apply here let's see what else there is so there's also things such as campers RVs um there's a boat there's there's three boats which is kind of cool so maybe we'll look into that a little bit more um we can also see like some tree houses town house as well so this kind of interesting to see like the distribution of what people are actually offering in Seattle but I did Count earlier I actually want to see if any of these property types change by price so instead of count what we can do is let's say you already committed an operation as I did before actually jump back to this any of these previous operations at any time um and I can go edit it so here I'm going back to the group by Agy and instead of count let's just say I want to get the uh the mean so what's the average price for each of these um each of these property types so you can see here now I can see the average price of a condo not surprisingly it's probably more expensive than apartment um here's the average price of the house it's $130 a night um and we can see the boat obviously is by far the most most expensive at $513 a night as well so some more interesting facts um about our data another thing we might want to do is instead of so let's just click update so now we can see I just updated the previous operation but actually want to agregates we saw something like um the neighborhood as well right so not just property type might affect the price of the urban B but also the neighborhood of where it's at so we want to kind of check that out so we can actually just delete the previous operation by just clicking this and recall there's this neighborhood um column here so let's just scoup by this neighborhood column instead of the property type and see how that actually affects price so again I can click on neighborhood I can do group bu and aggregate again and we'll aggregate on price one more time and by default it's count again so we can see um count just gives us a really good distribution of where all the neighborhoods are or where the most common neighborhoods are for airbnbs um so if you want to do this we can actually just quickly click apply sort descending just to make it look easier so we can see the most of the RBS are going to be Capitol Hill um these are if you're not fam Seattle these are some of the more centrally located places and again I think the lower values if I just scroll the way down like these are kind of places that are um a little bit further away from the city so the again um another thing we can do is like I mentioned earlier is we'll want us change from count again to mean so we can see the average price rather than just the number and so we can kind of see what is kind of like the most most expensive areas right so if I scroll down or if I sort by descending we can see this neighborhood of Portage Bay as well as West Lake have the most expensive um average airbnbs per night and again if I scroll down we can see um where the some of the cheaper ones are per night as well um so let's just go back and another thing we'll want to do is we saw earlier that there are bathrooms and bedrooms as well um so we want to see maybe how the neighborhood affects bathrooms and bedroom so let's just say we can also instead of just aggregating on one column we can say let's say B bedrooms here and instead of count again we'll go back to mean which is going to represent the average number of bedrooms for neighborhood and we can do the same for bathrooms so let's say mean and again we can see again this all update live as I'm making these modification as well as the code updating live this just one of the power one of the um uh one of the benefits of using data Wrangler is rather than just working a notebook where you don't really know what's happening until you actually execute the code and you got to write some extra code to visualize it all data Wrangler is kind of just doing it automatically for you as you interact with the data with the UI so again if after this aggregation we can kind of see um alai this neighborhood of alai has around maybe 1.8 bedrooms right on average and 1.3 bathrooms but we can see maybe somewhere that's smaller like this neighborhood only has an average of just one bedroom and one bathroom um in the neighborhood so again gives you just additional information uh of can of course you can just write this group by on your own but this just makes it a lot easier to visualize um and do within data Wrangler so let's just update this and these numbers right now because I'm sure let's see these were float 64 so we have a lot of different decimal places um I want to clean this up a little bit so let's actually round all these so we can go into our operations um numeric because these are number columns and we have some operations built-in operations to help round the numbers to make it easier to view so let's go into round and let's say we want to set it to two decimal places right and again we can see this update live you can see this is going from this super long number into um a rounded two decimal place number here obviously there's just one so it's going to go there and again get the code for it and we can click apply so we can see how quickly we went from just this massive data set into something that's a lot more readable and gives a lot more information um to your eyes in a much nicer viewing way of rounded numbers rather than the whole super LW number as well so if this looks good you can also export it as well but one thing one last thing I want to showcase um I'm just going to undo some of the previous things we did is if you want to actually do any machine learning on this as well we want to convert some of the text into numbers something that's machine readable so one example of this is this host is superhost column um this is supposed to represent true and false to tell you if that host is is a superhost or not but T and F are not easily um represented by to a computer so I want to convert this to a number so we'll have this operation um under formulas called one hot and code and this is a very common operation to convert your numeric or your string columns to actually or categorical column sorry to a number column and we can see now what 100 code does is it's going to change like it's going to create two columns out of this and it's going to set it to one if it's false for that column or one if it's true for that column so you can see here the true becomes one here it's zero and removes that original column so I can click apply and now um this makes it a lot easier for um for me to do any apply any machine learning algorithms to this as well so we did a lot today um there was a lot I think there's also a lot of operations that we still didn't go through um if I just open a lot of these operations well you'll see there a lot of different string operations we can do like capitalizing the text um any date type formatting as well uh we can also calculate text length there's also some scheme operations so if you want to clone or duplicate your column rename the column as well so a lot of different operations that data Wrangle supports and if there's an operation you want to do that's not built in we can also just you can also just start typing code so if I want to remove this ID column because this just seems like it's the Airbnb ID I can just maybe just even take this example right I can do DF equals DF do drop columns equals ID oh there's um sorry P there was one I think there's like a bug that just so if there's an operation that's not supported here on data Wrangler as well um we can also always type in any operation as well so if there's any operations that um you want to write if you just want to write python code as well you can type python code here and it'll automatically update live in the grid so earlier we just saw a bunch of operations that you can do that data Wrangler does support but if there aren't any operations or if there's something custom you want to write or an operations that's not built in supported by data Wrangler right now you can also just enter this box and just type in any piece of code that you would like and anytime you write code here it'll automatically reflect live in this grd as well so again you get the same benefits of data wrang of having it very interactive showing you what's happening to the data but you can choose to whether it's a right code if you prefer that or you can use one of the built-in operations and interact with the UI as well so once uh we did we accomplished a lot of different steps today we can see we did like nine different steps not even including the ones that we undid or deleted or changed but once they're kind of done we can um Can quickly take a look as a sanity check of previewing everything we did earlier so we can see all the pieces of code that we wrote or had data rang WR for us um all the different operations also very nicely commented as well so if you don't exactly remember what you did based on the code it'll tell you in human readable text of what's actually happening um if everything looks good to you we have we can jump to the export menu at the very top so you'll see there's different ways we can actually bring this back because if you recall from the very beginning of our talk we started within a jupter notebook um to get to this data set um but there's different ways to go back to the notebook or export somewhere else as well if you'd like um so the common one is to export to notebook which will bring this piece of code back to the notebook there's also exports file so let's say you want to you start off with the data frame but you wanted to save it as like a CSP file or give it to a coworker working with Excel you can do that as well or we can just copy all this code of this button and you can paste it into GitHub you can paste it as a python file as well but there's different ways to export all of this um but the main artifact or thing that you're generating with data Wrangler is that piece of code or the data as well so in this case if you want to jump back to notebook we can just quickly click export to notebook and boom we're just back into that cell right after we launch data Wrangler in and you'll see it puts it into it generates a piece of comment saying that hey this was generated by data Wrangler um but also puts it into very nice clean function so we can see uh everything that we saw earlier puts in nice clean function creates a copy of it so again we're not modifying that original data set so once you're in data Wrangle you don't have to worry about deleting anything or making any changes because we're modifying a copy of the data set um it kind of facilitates or lets you explore the data without having to worry about making any um unwanted modifications or or destructive modifications as well and now in the data set we can just quickly run the code again and we'll get the latest data or the clean data uh but that was in a nutshell um of exploring data Wrangler um in vs code um as a recap we started with a notebook in vs code today we had an Airbnb data set of different Airbnb listings in Seattle um we loaded that into our notebook um we opened data Wrangler um with the launch data Wrangler button um and then once since data Wrangler we did a lot of exploration um we look through the different Columns of data um we cleaned up the data so we removed a lot of the missing values um we also tried to convert some of the the types of data right so if you remember the price it had like that dollar sign so we cleaned it up as well uh we tried to do some one hot coding as well some group bu um and then we finally once everything was good we con exported it back into the notebook and we get that piece of code here and we can continue on with what we're doing and of course you can always enter data Wrangle anytime within the notebook as well um but um just is just a a nutshell of what we can do with data Wrangler so thank you everyone for joining my session and if you would like to learn more about data Wrangler it's free to download on the BS code Marketplace by sear searching for the keyword data Wrangler as well or if you want to learn more about the documentation or anything you can just go to aka.ms data Wrangler so thank you everyone

Info

Channel: Visual Studio Code

Views: 404

Rating: undefined out of 5

Keywords: vscode, visual studio code, vs code, python, datascience, datawrangler, code, visualstudiocode, datascienceday

Id: kM5mAeEaWkw

Channel Id: undefined

Length: 25min 19sec (1519 seconds)

Published: Thu Mar 28 2024