How to Do Data Exploration (step-by-step tutorial on real-life dataset)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey welcome in this video let's learn how to do data exploration so this will be the first video in a series of videos where i will teach you things about data so how to deal with data and the next one will be about data cleaning and this series is kind of going to be like a small version of my course hands-on data science so if you're curious about that i will leave a link in the description for you to go check that out but without further ado let's get started i always get so excited when i am starting a new project because i feel like data exploration and cleaning and just generally understanding the data set is one of the most fun things that you can do it also gives me a really good idea to like while i'm exploring the data set it gives me a really good idea to how to clean it later and also what kind of features that i can come up with when it is time for future engineering and if i you know need to create some new features so just to have the just like a small advice for me keep these two questions in your mind like how am i gonna clean it and how am i going to create new features in the back of your mind while you're doing the exploration um don't have to be very formal just some things that you need to be thinking about in the background so but let's start with the technical part so the first and foremost again i feel like the second first and foremost thing that i said but in the technical side of things the first thing that you need to understand is when you find a data set in the vial out wild and out in the open on the internet you really can't understand a data set by itself of course it could be a very simple like i said a toy data set like a small one very easy to understand but most of the time you're going to be seeing data that has some column names that you don't understand that is kind of or it has values that you don't understand it could have a different unit the column column values might have a different unit you know the the way that things are represented could be different so likely there will be things that you will not fully understand especially if it's in a new domain that you've never worked in before so how we solve this is there is always going to be some explanation files for the data set that you're finding just want to show you some examples like you know it might look like a pdf like this one and it basically tells me for this specific data set that i found online each field or column name and what they mean what the do the different values for this column mean for example it says it has a raid code id i mean this is a data set i used for my course hands on data science as a new york city taxi data set so it's basically like all the trips that happened in the in new york in like a whole month or something like that uh so apparently this rate code you know and if it me if it's one it means the standard rate if it's two i guess it means that they're going to jfk or newark or you know similar things or the payment type you know if you just are looking at the data you will not be able to understand this so it's important to see these explanation files and you're always going to find them somewhere so as i said it could be in a pdf format uh it could be apparently an hdmi for html format so it will just generate a website for you or a webpage where you can see you know what does this mean what does that mean etc get some indication of like what things mean or it could also be just like random text files so we can see this one for example so this is just an explanation titled the database sources how can you use these etc etc and then here uh the attributes so as i'm saying you know the columns can be named a lot of things so it could be columns fields attributes features um yeah we just need to kind of find this like table that explains uh what each column is sometimes they give you the data type sometimes they give you the um the range of the values that's also possible uh yeah this one is like that or you know this one is a different one that also has information about the columns so this is the first thing that you need to find if you are working in a company or if you're getting the data set from a different team so you didn't find it online let's say in that case you first want to understand the data set like kind of open it on a jupyter notebook like this read it and try to understand as much as you can and write down questions about what you don't understand and then go and find the people who are responsible for this data set and ask your questions like this happened to me before and what i did is basically like just reach out to the person who sent me the data file and then she was like yeah i'm sure but i someone else sends me these files i don't really know much about them so i was like okay can you give me the name of the person who sends you these files and i talked to that person is like yeah but i just downloaded from a system so i don't really know how it's collected so then i'm like okay i'll just go to the people who created the system that collects the data let's say it's the engineering team so then i go and talk to the engineering team's head and he points me to the person who set up the system or who's responsible for the system that collects the data and then you know i ask him my question so sometimes it takes a little bit of time to get the data to understand the data etc and kind of feel comfortable with it but this is very important because otherwise you know you don't know what you're working with the data set that we're going to work with today is called the street three census data set so i don't know if you're familiar with what a census is it's basically when they go well what they used to do is go from door to door ask people how many people live in your household how many of them are adults how many of them are children how many of them are working how much money are you making so basically kind of like getting a lay off the land of like who lives in a certain city or a country but they did that for trees in new york so um i might have mentioned this before but there is a new york city open data platform where you can download a lot of different types of data sets uh and this this is also where i found this i think it's really amazing it's a really cute data set so let me first start with reading it on my notebook and then we start looking into it all right so this is a data set uh it has a lot of columns as you can see there are the three dots it means that it hasn't even been able to uh show me all the columns but there is actually a solution to that if you change the settings to you know how many columns that you want to show i mean this basically says for panda set the option to maximum columns that will be shown to none so then it will display all the columns i don't want to set the option to show all the rows because that might be a little bit too much so i'll change the setting and then have it show me all the columns and yeah i can see all the columns now so as you can see there are a lot of columns and some of them are not really obvious what they are so for example this tree id okay pretty easy block id yeah you know i i kind of understand what it means created at 3 dbh for example i don't know what that is uh so to understand what these things are as i said we have the file with us here very tiny font size but at least we can understand what things are so for example 3dbh apparently is a diameter at breast of the breast height of tree so i guess the breast height of a person uh diameter of the tree measured approximately 137 centimeters above the ground okay that's cool uh diameter of the stump i don't really know what a stump is so when stuff like that happens what you need to do is google tree stump to see what it is tree stump surgery has been cut felt in there okay okay that's good to know um uh yeah yeah whether tree is a long or offset from the curb tree status in the case of the tree is alive standing dead or a stump tree health oh okay so the tree might be a stump or not a stump so if a tree is a stump i'm guessing then we only get the uh diameter of the stump otherwise we do not have that so this is an assumption that we're making but we'll see if that is the case or not tree health indicates the user's perception of tree health scientific name common name of three species number of signs of stewardship observed and that is indicate the number of unique signs of stewardship observed for this tree not recorded for stems or dead trees [Music] um so i guess steward stewards or are people who are taking care of the tree uh that's what i understand good presence and type of tree guard sidewalk damage immediately adjacent to the tree category of users who collected this tree point all right so um this is good information now we know more or less what these things are so let's look into our data set a little bit in more detail so at first it might seem a little bit overwhelming to know where to start because you know we have a lot of column names and you're like wow this is going to take a while to explore this data set to understand what's going on but i think one important thing is to understand what your priorities are so let's say generally i just want to understand uh the situation in new york about trees how much of them are healthy how much of them are not i just want to get general feel of the trees in new york state or near new york city sorry um so then i wouldn't really be caring about their location so when that happens you know then you can basically just get rid of all the columns that are giving you some information of like which state they're in which borrow they're in uh which city which uh street they're in et cetera longitude and latitude of the tree etc etc so for now i can actually get rid of these things and i will make a list of all the columns that i'm interested in and uh to make it easy what i use is the columns function it will give me all the columns so it's easier for me to delete the ones that i'm not interested in so all right so this is this is more manageable um these are all basically if there is a problem caused by a stone on the root if there's a problem on the root caused by a great or other same with the trunk and same with branches that's easy to understand the latin name of the tree if the person who was collecting the betas thought it was healthy or not at the status of the tree if there's damage on the sidewalk right next to the tree if there are any other problems um okay this is good i can work with this so one of the first things that i want to look at is the numerical values so the numerical values that we only have are the diameter of the tree and diameter of the stump but as i figured out if this tree is a stump so if it's cut already then we do not have the diameter for it which makes sense but actually even before that i want to see if there are any null var null values this is how i can see if there are any missing values or not so it looks like with health we have a bunch of missing values so how many values do we have 683 rows and out of the 683 31 000 is missing for these columns so i want to see what it looks like when these values are missing so i'm going to say show me all the ones where health is a non value all right so it looks like they're kind of like none for the same things i can decide to remove these values later or not that's kind of up to me at this point but you know i'm just exploring i'm not doing i'm not taking any action so this is something that you can take note of you can say hey there are a lot of missing values i just want to get rid of these ones okay this is good um so let's do describe to get a general feel of the data set okay so as i said we only have two numerical values well the first one tree id is actually a categorical value because it's the id is not a continuous value but it currently sees it as numerical value and we can see that if we do d types yeah most of them are strings but the first one is an integer so as an integer because the id of the tree but it's actually uh should be a categorical value so what i'm looking at is here the tree diameter and stump diameter so yeah we have these many values the mean is 11 centimeters standard deviation is 8 centimeters um okay these are some tiny trees and maximum is 450 that's uh really interesting is it centimeters the user collected another diameter of the height of the tree integer diameters we measured and close to both living trees um types are more accessible than forestry specific let me make it close because measuring tapes are more accessible than forestry specific measuring tapes designed to measure diameter users originally measured three circumference in the field to better match other forestry data sets the circumference value was subsequently divided by the pi to transform into diameter both the field measure mint and process value were rounded to the nearest whole inch so okay it's not a centimeter it's an inch so this is important knowledge for us i don't know what an inch is so one inch in centimeters is 2.54 so that actually is like even more in centimeters so if it's like a hund 450 inches that's like a lot of centimeters 450 inches in centimeters even um okay 1000 centimeters it doesn't sound likely but let's take a closer look at this actually so what i'm going to do is i'm going to create some histograms because i want to see the distribution of these values so let's say hist and i already want to make the bins a little bit bigger yeah it spins and also want to make the figure size a little bit bigger all right it's interesting so again i'm not looking at the tree id as i said it's not really something that's uh important stump diameter uh it's close to zero most of the time but then uh i don't know if you watch some other videos of mine but i think i mentioned this in the hands on data science course if you can see this value here that actually means that there needs to be a value here because that's how the histogram histograms are being made so if the maximum value was 60 the histogram range the x-axis range would be from 0 to 60 but if if it goes to all the way to 140 means there is a value somewhere here even if it's just one uh same thing with the three diameter so i think something went wrong there probably when someone was trying to you know put the value in they actually want to write like 45 but i accidentally wrote like 450 so we don't know that so i also want to see like how many values are here or how many values are here so it looks like a logical way to cut this off is maybe like 40 here because as you can see there is still some values here and also maybe like 100 inches here even that sounds like unreasonable to me to be honest but um so let's see let me let me visualize that a little bit so what i'm going to say is give me all the trees where the tree diameter is bigger than we're going to say like 50 to be honest so we have 300 values that is where the diameter is calculated or measured to be bigger than 50 inches uh interesting so what if i visualize this so let's say okay this is more or less what i expected but let me make this figure a little bit bigger yeah there are some here which i guess expected but like especially after here it's kind of like you know they're one or two trees this uh seems very silly to me that there will be a tree whose diameter is 450. i mean yeah even even the circum circumference for 250 inches sounds like ridiculous uh but yeah this is just some information that we have right now good to know good to know um we can do the same thing with the stump diameter all right similar similar thing i guess maybe until up until this point it's acceptable maybe they forgot to change it to diameter they put in the circumference and yeah and these values are kind of like not really correct could be uh yeah all right i mean i guess 140 inches is not unreasonable if it's uh the circumference and they forgot to put it into diameter otherwise it should be like a very big tree right if the diameter is 140 inches and that makes like a lot of centimeters up makes like three meters or something a three three meter diameter is kind of like yeah that would be a bit too big i guess i'm just trying to like work it in my head you know i'm trying to understand if that's like actually unreasonable or not but yeah when we move to the cleaning bit i guess one thing that we would need to ask ourselves is is this really unreasonable is this not really unreasonable and what can we do with this data point or data points um what else do we want to look at i want to see what are the possible options for some of these other columns so latin name probably there are a lot of different latin names but you know it could be interesting to see the distribution of different names so i'm going to come here and say okay i guess this tree is very common and then we have less common trees here and one thing you can do to visualize this is to turn this counts into a data frame and then plot it or maybe not histogram i guess plot let me try a bar chart that should work i guess yeah okay cool yeah i mean not the most readable chart in the world but at least it gives you an understanding of uh you know how many trees there are and which types there are and this is kind of expected right you would expect some trees to be like very common and then as you go it's like less and less common all right what else can we look at so um [Music] i saw that yeah there are stewards right some of the values for the stewards are missing uh sidewalk is missing a problem is missing but i want to see like what are the options for stewards so we can see here one or two or none but what are what are some options okay so it's either none one or two three or four four or more this is good it looks very um standardized so you know if everyone had to write it down by themselves you can see someone writing one or two someone else writing like this someone else riding like this so it's it's possible it's good to see that it's clean uh i want to see the possibilities for sidewalk and no damage or damage okay good um i guess for these ones it's either yes or no i'm guessing this could be like a you know website where they fill in a form so this looks pretty standard um i want to see the status and curb location also let's do it quickly uh-huh curb lock was it no what was it curb lock with the underscore on curb or offset from the curve okay cool doesn't seem to be neat for uh extra correction there um all right one last thing that i want to look into is if there is um some mismatch between you know this tree being a stump and uh the health of the tree for example if there is any point where it says it's a stump but it says like health is good or something like that so i want to get all the trees where uh what is the name of that column status is stump uh okay then you know then i can say stumps [Music] so this is like a new subset and it's all the values none for the stumps no oh so maybe these like stewards sidewalk problems health stuff are none for all the stumps and all the dead trees so let me check that for the dead trees that's so how much does that make 961 so you know there are 17 654 stumps it says and 13 9006 dead trees that amounts to 31 615 uh total dead or stump trees so basically how many values did we miss yeah it's more or less the same so i guess what happened is if the tree is not alive they didn't either bother or they didn't think it was relevant to fill in the information for the rest for the health latin name steward sidewalk problems etc and probably for these ones to just like put zero or something like that okay so that's that's good to know um and also one other thing that i want to look at is how these guys are distributed okay so what i wanted to see is actually how many yeses and how many nodes there are for each of these columns uh of course it's going to seem really instant to you that i achieved this but i actually took like 20 minutes or something looking online to see as you can see from my searches here um how i can see this information so basically uh i found that out so what i need to do is just assign this to a data set what should we call it um three problems let's say so this is a data set and apparently the way to see it is like this you just apply the value counts function to each of the series and then you're able to see the values for all of them so okay so let's see it looks like rootstone problems seem to exist a lot of the time uh yeah no other problem exists that much so yeah so problems caused on the route by stones is a big problem uh yeah this is just good information to have you know these are also things like when you're starting a project these are some information some statistics that you can give to whoever is responsible for the project or if you just want to like kind of show that you are progressing with the problem or with the project these are some good information uh to show to people you can also turn this into like a visualization and show it that way if these are relevant things to you okay so we what we did is we first looked into what are the possible features columns that we can use we decided that we don't really want to get into the details of where the tree is located which borough is responsible from the tree which street it's on uh so we haven't been look look we didn't look into that because we decided that's not important for our purposes um we looked into the columns that this data set has we chose the relevant columns only on this relevant columns we first looked into the missing values uh again we haven't done anything with these things yet that will be in the next video and we looked into um you know if these missing values happen all at the same time or not then we looked into the uh numeric values and how they're distributed in general we looked into the distribution the histogram and so that some of the values are look are looking a little bit suspicious so we went deeper in those values and look how how many of them are this outrageously high so if you saw a really um even distribution here that there are a lot of dots here and there then you might think that okay maybe this is normal to have but when you see there are only a bunch here the kind of like outliers then you can decide okay maybe this was a mistaken uh input by whoever was collecting data same with stumps that we saw this so uh and then we looked into some names how the distribution of different types of trees is in new york's city and some just to make sure that all the categorical values are standard and there is no different terminology you know as i said from one to two there's not like one dash two or anything like that so just to make sure that the values are standardized here we looked at some of the or most of the columns and we actually figured out that when it stumps or when it's a dead tree the information on health left names that whether it has a steward or not uh whether it's on the sidewalk or not etc if there are any problems with the tree it has not been recorded this is a good thing to know about the data set and we saw that the basically the distribution or how many problems what kind of problems there are on trees so this is a good place to stop and from now on what we're going to do is to start with the data cleaning

Info

Channel: Mısra Turp

Views: 2,595

Rating: 5 out of 5

Keywords:

Id: OY4eQrekQvs

Channel Id: undefined

Length: 29min 19sec (1759 seconds)

Published: Mon May 24 2021