How to predict House Prices in Ames, Iowa | Kaggle Competition in Data Science & Machine Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this video where we are looking at the prediction of house prices so here what you will see is a github page the link is github outcomes there's web artifacts slash Ames - housing and this is all the code and all the files the data files that you need to follow what I'm about to present to you and again and this is a presentation of using some simple machine learning algorithms to predict house prices so first if you go to this link what you see here is a couple of files and one folder up here you see a data folder if you click on that we see that there are a couple of CSV files in it and also some other files that contain the data and this is it and then we have four files that start with the numbers one two three and four these are so called tributon notebooks and these are the data formats in which we present the analysis of this case study and then you have some other files so regarding this case study this is actually based on a scientific paper and there is a paper that PDF file also included and what you can do is you can click on it and you can open a an article from the Journal of statistics education and where they basically talk about everything that we are also looking at and we were at some points contrast what we find to this paper so again you don't have to look at this to understand the video but if you want to take a little bit deeper then maybe it's worthwhile to read it and also there is a map dot PDF file which shows you a map of the city of Ames in Iowa the United States this is where the data set is taken from and this is basically showing you all the different neighborhoods we see there are many different neighborhoods here in different colors and all the houses are from these neighborhoods and as we shall see the neighborhood is also you know an important feature of where to where the price is going to be if it's a high price or not so high price so what do you need to replicate what you're about to see here well first of all you need working installation of Python I'm using Python 3.7 here so the easiest way for beginner to get this is to go to anaconda calm and anaconda is a private company that offers commercial products around the Python software but also has some open source distribution here which is called the individual Edition so you click on that and then scroll down you can download it and the nice thing about here we have the installers all for Windows Mac OS and Linux so it should work on whatever system you have and the good thing about anaconda is it not only ships you the room current version of Python but also many so-called scientific libraries for example Jupiter which we will need in this case study but also non PI and a scalar and a scalar and is the default machine learning library in Python pandas which stands for panel data it's the default library for dealing with data which is basically the axel replacement if you if you will and there are some other library don't really need but it basically ships with everything you would need in a typical data science project so if you don't use Python from an arc or not come your alternative would be to go to python.org and download a pure version of python but then you would have to install Jupiter and all the other third-party libraries on your own so I think for beginner it's easier to just download anaconda and then also as I said this is all built in Python if you are missing some basics in Python I also am the author of some pipe introduction materials which you will find at a similar URL it's github outcomes that we have artifacts slash intro to Python and you have files containing an entire semester course on Python and exercises as well and most importantly you for every chapter and a link to a YouTube video so you could basically refute the basics of Python if you still miss some of the background here and for further background of how to make this project work there is also another page help page called Jupiter lab dot read the docs dot IO here and Jupiter lab is the environment in which we will program Python so if you have any troubles installing this or any troubles to understand what are the keyboard shortcuts that you may want to use and what else you can do with it this would be the best resource to look this up and then there is one more resource which is the original kayo competition so cattle is a company that where you know many other companies and organizations can upload data sets and make them available for free for individuals around the world that can then participate in so-called competitions and try to solve some data driven problem and aims pricing house prices data set is actually also distributed on calculate even though it's also available without cackles so calculus just the competition here but if after this video you are still interested in learning more about this project what you will see here is many tutorials around this data set but also the solutions of other groups around the globe that work on this case study so if you are not sure if you have the best solution or what else could be done in terms of math and statistics can take a look here at the competition page and if I close this now here now in my web browser we see localhost eight eight eight eight and this is the Jupiter lab environment running in my local chromium browser here if you cannot install that for whatever reason if you go down on the github repository you have links to D for the four notebooks as well to a service called my binder and my binder is an interactive service which allows it to open a notebook in a web browser without installing anything so if everything fails you can always go back and try to follow the analysis here however you have to know that this is you know a temporary environment in the cloud so it is probably better to install Jupiter and Python and everything as I already said on your local machine this is in particular better if you want to use your computers you know calculating power because my binder in the free version does not have so much computing power available okay so let's go to to the lab so I assume that somehow you will find a way to open the project in Jupiter lab so the easiest one way to download the materials by the way would be to click on the green button here where it says clone and say download a zip this is how you get all the files that I have here as a sip file but if you're familiar with the so called get tool you could also get clone it this is why this button is quite a clone but I assumed that somehow you not have knowledge on this already so I'm not talking about this here again the easiest solution is just download a zip and unpack the zip this should make it work and then when you open localhost here on your on your machine this opens an instance of the tributed lab environment and what you see here is on the left hand side all the files that you just downloaded from github and then you see this launcher here so what the launcher is if for example we click on the notebook on Python 3 we get a new tubular lab notebook where we can just enter in the in these code cells here any Python code so for example one plus one and if I execute this I get back to so I can basically create new code files here in this environment and execute them however for this presentation I prepared the four notebooks already and so we will use them to go over the case study so we start with notebook number one called data cleaning and whenever you open a notebook from someone else and you start to do an analysis taught to replicate an analysis on your local machine one good practice is to click on kernel and say restart con and clear all outputs this will basically get rid of all the output that may have been there before because the person that prepared to notebook saved the file with the output but now as we run the code ourselves we don't want there to be any output and because of that I just cleared it and it just makes sure that there is nothing left from previous runs so again these notebooks that I prepared for you they have lots of text in them so they are optimized also for reading for reading food and materials I will go over the notebooks rather quickly so don't be afraid that you cannot read everything here as I go over it that is not the intention the intention is that if you want to dig deeper into a specific area of a notebook then there is lots of text and documentation that helps you to dig deeper but we will in this video only to a high-level overview on how to do data science in the context of house prices so what I do in this notebook here in the beginning is what I call housekeeping so we import some libraries for example number I pandas and so on and to make them available within this to be the notebook here and then in the next code cell I say from utils import and what utils is this is not a library that you install but this is rather a file a dot PI file so here in the folder you see there is a file called util sub PI and if you click on that then what we mostly see is a text file opens and this has lots of Python code in it and what I did is I put all the code that is not so relevant to be looked at in a notebook and also a code that is to be reused across several notebooks I put it in this file and this is basically how row up - would look like if it's not done in attribute a notebook environment and from this notebook we will import some helper functions and some helper variables to make the notebooks easier to follow so whenever you don't find any code or the code for some something you don't understand most likely it's going to be in this utils the PI file okay so let's delete this temporary file as well and let's continue here in the notebook so I again I import all these helper stuff here and then further in the housekeeping I say pandas should show me one hundred column so by default Enders will not show you so many columns and we set this to one hundred because the data set contains a lot of data a lot of columns I have to say so at first what we do here is we go ahead and here's some code that loads the data file and this builds such that when the data file the csv file is not in the project already then it goes to the original webpage and which is here at ames that org which is the official page where the data is from and downloads an excel file and prepares it but then also temporarily stores it in your folder so that you don't have to go to this URL all the time and you know load data again and again so we call this caching so the data is temporarily saved in the data folder as well and it was there to begin with because I put it in the repo so that you don't have to so if you don't have a fast internet connection or whatever then you already have a data here and then we do some we run some code that basically puts all the data into what we call a panda's data frame so a data frame in in Python in pandas in the library pandas is a special data type which you can compare to axel so this is basically how you would model axel like data in in python and then what we do here is we look with the dot head method we look at the first 10 rows so dot head takes a number so if I replace that for a 10 by a 5 for example I only see the first 5 rows and if I want to see the first 10 rows of the exo feed so to say I just say 10 and I have to run the cell of course again and again but then I see the first 10 rows and as as I already told you there are many columns in this data set and we see the last column is the sale price this is the variable in u.s. dollars that we want to predict and in order to predict this we have all the columns available that precede the sales price column and every sale has an order number it has I think it's called a placement ID or something and then again many many features over which we will go now and yeah so this is usually how our data science project starts you are given some raw data from some source and usually the data is missing some data is missing some data is messy it may not be clean and we will throw out the first two notebooks go over all the features and clean them a bit and clean the data sets and then we will create features out of them and then only in the last chapter chapter four we will do some forecasting some predictions so let's go over the next couple sides rather quickly so as we see some of the columns they have spelling mistakes and what we do is we will replace them by unified text links so this is what these code cells you do and then you will see throughout this notebook there are many so called assert statements in the code where I basically assert that some condition is true for the entire data set and this is how I run quick checks to make sure that for example one column is never empty or one column only contains integers or data of a certain type and so on so this is what you will see to me quite often so that again I make sure that an entire column is in the format that I expect it to be and then pandas has some other attributes that it provides for example every data frame so the variable DF is now the data frame DF is the variable that symbolizes the axle data so to say and by saying dot shape we get back to dimensions of the xog so the extra feet has 2930 rows and 80 columns and we will remove some of the rows because some of the data is not clean and we will create many many more columns as well because some of the columns they are not really useful for making predictions so we will create columns out of existing columns that are more useful for making predictions 80 columns that we have here there can be grouped into four different groups by the generic type so to say so we have one of these type is called what I call continuous variables so these are numeric values that are that come on a continuous scale then of course there are discrete variables as well where you know you have one two three four or five rooms in a house so this would be a discrete column but here first we look at the continuous variables and what we do here is we assert with some quick test that the data are really continued so it is these are all the columns that hold a continuous data continuous numerical data for example the the number of square feet of the first floor in a house the number of square foot of the second floor and house and so on here we have the description as well and then we have many many more like the karaoke karate area the cross living area and so on the lot area so these are all different measurements but they all come as continuous numbers and if we look at the first five of only two continue to have also continuous where I was here is one of the helper variables that are defined so whenever I just write continuous variables this is what I imported from the youtubes module this is basically a shortcut for all the names that we have here so I don't have to specify them this is why why we use details and here we have all the continuous data in the data set for example we see that most of them is really continuous so we see if the square feet they can come basically in any number it's always an integer value here but it's basically a good approximation of a continuous number because we don't have we have many different realizations of this value this is what makes it continuous then we can look at some basic statistics maybe so what we see here is most of the continuous columns are not null that means they turned out missing but then for one column which is called a lot frontage we only have 2400 available data points so there is a lot of data missing and we will see how we deal with these missing data in a bit so yeah we keep here track of our variables that we want to take a look at later on we do the same type of first check on the discrete variables so let's quickly go over them as well so these are discrete columns so for example the number of bedrooms that would be discrete variable the number of basements or you know do it or basically does the house have a basement or not this is a yes-or-no question basically how many garage garages do I have how many cars can I put there and so on let's look at all the discrete variables here so we can already tell these are typical discrete variables we see also the year here so for the year that's basically the one variable that is close to continues we could basically argue it's continuous but the year in which a house was built or soldiers to me more of a discrete variable so the reason why I do these checks is because those different groups of variables they allow us to do different things with them and that's why we look at these groups independently another group that we have is nominal data so for example if we look at some of the columns that contain a nominal columns for example these are fields that I used as text so basically attack could be the house is of this on this tile so this would be a word describing the house and so on the neighborhood is what I already told you about Indian map that we saw this is basically just the name of the local neighborhood within Ames Iowa and the street name for example would be a nominal feature and the others as well here and we will look at all the features in detail throughout this video so here is basically a brief view of the nominal variables here and we can indeed to verify decent nominal features and if we look at the statistics here here we have a more a more full picture so there's only one column that has some missing data obviously all the other columns they are basically always full and then the fourth category is a category of variables that is related to nominal is what we would call ordinal variables so the difference between nominal variables and ordinal variables is for ordinal variables you can you also have like words describing a feature but these words can be brought into a natural order so for neighborhood or street name there is no order but if we look here at certain at certain features that are an ordinal let's look at an example maybe yeah maybe there's these are usually features that describe the quality of something so how good a shape is something in for example what is the fireplace quality and so on so how big is it or is it new is it old and stuff like that and these are all appreciations and that the authors of the data set used and if you want to look up what these abbreviations mean the source where you got the original CSV file from they also contain a text file where every column is described and so this is where you would also read about all the ordinary characteristics here but we will change the ordinal variable soon so let's first look at a visualization so often time when we do data science looking at data in an Excel based format like this it's already quite insightful but visualizations are usually a lot better to quickly get an overview of the data so in Python there is a third-party library which is called missing and all and there's a library that helps us to visualize where in a data set data is missing so what we do here is I plot a so-called missing matrix of the four categories separately so what this does is it gives us back a matrix form a matrix visualization where we have all individual columns and we have white areas wherever in a row data is missing so we see there is one column called a lot frontage with which we already identified above which has a lots of missing missing data then we have the this other column here called mass vnr area whatever this is and this only has two missing data points and all the other all the other rows basically always have something filled in so this could still be messy data or dirty data but at least the other rows the other house sales basically have all the data available so this is a important to know what to decide what do we do with this column so my recommendation here would be to keep things easy to just get rid of this column because then we don't have to deal with missing values here for this column we could try to extrapolate the missing values but again because it's not that many the probably easier way to go about this is to just drop the rows and only keep rows that have data available for all for all the columns and now let's look quickly at the visualization of the other three groups so we see for the discrete variables we see a similar picture we see that for the column graph year built so we don't have data here available for all the houses so that means in most of the cases I would guess that the garage is built together with the house in the same year obviously but sometimes a garage could have been added later on to a house so and maybe sometimes this data is missing so what do we do with this data well I think the the year in which a garage was built is not that important you know the more important thing for house price would be does the property have a garage at all this is probably a more important property than when was the garage built so we could basically also drop this column and then the other two categories of variables indicate that we have basically almost no data points missing we see that sometimes when data points are missing it occurs at different rows so probably we have what we saw here round about seven or eight rows that we could basically just remove because they always have one missing value for some column and that is basically what we will do so this is what we do here we in the cleansing part here we get rid of the two or three columns that had two crowns it was that had lots of missing data so these columns are eliminated entirely and for the rows where we so here missing a lot is that these two columns here so these are the remaining columns here we keep them and then we basically go over the we build a follow-up here that goes over the entire over the entire dataset and cast them as a data type indoor flowed just to make sure that in the column we when it says we have a discrete number that it cannot be a floating-point number in there so this is typical cleaning work here and then at the end what we do is we quickly print out the shape again so we dropped two columns and we dropped a couple of rows here but again this will save us lots of work to extrapolate some some data and then what we do is we store them as data dot data underscore clean dot CSV and this would now overwrite what we already had in the data folder sorting data folder contains all the data for all the notebooks already hard-coded into it basically but if the day that weren't available the script number one would up here go out to the original source and get a data set and at the end store a clean CSV file that's basically the entire idea of this first to be the notebook here so now that we have a clean data set what do we do with it so what we do in the second notebook here if after some housekeeping which is basically in our self explanatory we load now the clean data set with some helper function so these are all the helper functions again that come from the utils module in Python here that we must be provided and then we start with the now already cleaner data and let's look at some features this is a common mistake people make so let's look at maybe the numeric variables here we have sometimes square-foot of something so for example the square feet of the entire house where is it I think yeah here we have the lot area this is the entire in square feet measured the entire area of the house but then we have a garage area and then we have some basement area and some first-floor area and so on and the important thing is the individual square feet they add up together to de todo basically and so what that means is from a mathematical point of view is that there are linear combinations of columns that add up perfectly to some other column and whenever we do for example in your regression we don't really want this we want the columns to be linearly independent so what I do here is what I check here is with a quick some quick assert statements series I check that some some of the columns they are basically perfectly the sum of sum of a combination of some other columns here for example the square feet we did but also for the placement and so on and then what we could do is we could basically get rid of some of the columns here because if two columns add up to the add up to some other number as well in any linear regression for example later on the linear this will only confuse the linear regression because the linear regression may for some for some rows take some of the one column and then for some rows some other column in other words what I'm saying is the constant the constant better that gets estimated if in the case of a linear question for example that may be a very unstable estimator so it's always helpful to get rid of redundant columns and this is what we do here and yeah so this is what happens here then another typical transformation that we will do and this is what some of you may know from Finance data when we want to predict for example prices or in some in any kind of financial model what often is done is we don't take the difference of two prices or the prices whole but we try and we take the log of some of some column of some value and this is in the more general setting called a so called box-cox transformation so and again as I said the easiest one would be to just take the log natural logarithm of some number but what we do here is we use some estimation technique that is a standard way of how to have to estimate the best of such transformations for individual columns and if you want to understand this in detail you can read through this in detail a bit more but what we will do here is I will just run this and then basically what this code here does is it goes over all the columns and it only takes for the continuous columns those that have non-negative numbers because if you do a logarithmic transformation it only works for positive numbers obviously and then it tries to estimate what would be the best so called lambda here and when a lambda of 0 basically implies we just take the natural log something so in other words what this model suggests is for the cross living area and the first floor area we should just take a lock transformation and for the total sales price here it also suggested probably a normal lock transformation would be the best so what we do here is we for all those all of these columns we keep of course the raw columns as they are but we add second columns that are the transformation so for example here at the end we have the sales price the original sales price and the next to it we have the box-cox zero transformation which is basically it's a natural logarithm of the value here and then what we will do later on when we do the health price predictions we will train prediction models those for the raw value of the of the sales price but also for the transformation and then we will check which of the transformations works better or which of the prediction is better and because sometimes the prediction may work better for the actual data but sometimes it is better to fit a model on some transformer and transformed data and we will basically look at both cases and compare what is better and then what I did here next is I created a section called correlations and I defined that a number that's a number that is correlated between a correlation coefficient of 0.66 and one is what I consider is strongly correlated if it's if if two variables are correlated between a coefficient of 0.33 and 0.266 then I call it weakly correlated and if it's if the correlation coefficient is below 0.1 I call it uncorrelated and what we do here is I define I have a function that we plot the correlation coefficient and because platting as we learned is often an easier way to look at data and then we will calculate two different kinds of of correlations so the first one is that ethical Pearson correlation so what we do is we create a correlation matrix in a visual form and the indication is the more solid the color is the more the heavier the correlation is the stronger the correlation is but we don't really care if the correlation is positive or negative all we care about in data science is if we find strong correlation in absolute value because at the end of the day the sign of you know of a feature if it's negative 1 or +1 we don't really care so much but we care more about how one feature varies with when another feature varies in particular we were interested in pairwise correlations between the sales price and some other features and this already gives us a first graphical implication of which of the features may be worthwhile to dig deeper into in which one not so color close to white basically suggests here that a feature has no correlation between and therefore yeah may not be really helpful to keep it in in the data set and then we do the same thing so what we do here is I sort the features according to the rules that you defined above into weakly strongly correlated and uncorrelated and what I will do with that is later on now we have a name error the week is not defined so I should of course run all the cell's so this is a common mistake people make in tubular labs they just skip a note and skip a code cell so what we do here is I calculate it all over again usually this works and then what we do here is we create three different sets in Python that only keep the features that are either weekly either strongly or uncorrelated with the sales data and why I do that here is because in the in chapter 4 when we talk about prediction I will not only contrast the effect of taking the logarithm for example of the price and not taking it but I will also contrast models that are allowed to work with all the features and also only to look at features that have some correlation that I identified before and what we try to analyze here is we try to find out if it's worthwhile for me as the statistician to look at this manually here basically and define such thresholds strong we can encode it in a manual way this is Justin in something here so to say and basically pick my features manually which I think are good predictors for the sales price and then what we will see is that it's actually not a good way to do so I can already give it a result here but we will see that in more detail later in Chapter four and then again here we have some code that basically shows us in a list which are the features that are totally uncorrelated to the price so for example the pool area is uncorrelated that means if a pool is at the house or not in front of the price but how big the pools does not have to have does not seem to have any influence on the under total price in contrast a strongly correlated available would be at the cross living area and this is of course this makes sense because usually when we buy and sell houses and property usually we have some factor and then we multiply this by the by the size of the house or by the number of square or living area and then we get to the actual price so this is how calculations how price calculations are done by by real-estate agents and this is why it's not surprising that the cross living area is strongly correlated to the house price and then we have a lot of what I call weakly correlated fields so how big the first floor is of course is also quite correlated to how big the overall houses and the bigger the house the more it will cost eventually and yeah so and then we do the same with the so called Spearman correlation Spearman is just a barrier and where we don't look at the more we basically look at the or the order of features so the Spearman correlation index ikkaku is the correlation based on the fact of how ordinal values correlate to each other and usually you can take you should just go ahead and work with both experiment and the Pearson correlation index and see what works better even though the Spearman is of course optimized for ordinal kind of data and then we do the same analysis as we did for Pearson we do four for a Spearman as well and we will see a similar result however now we see that for example also in the strongly correlated section for example the number of garage is now also strongly correlated so it is again the kitchen quality so this basically now enables us to also look at two ordinal variables and in terms of their correlation to the to the sale price at the end then we save the data here and we haven't we haven't really removed anything but we created some new columns here so it had the lock transformations and of course we store these lock transformations in the CSV file and as well.you so so far we haven't done anything fancy we have basically done what we would consider the dirty work so the data cleaning is the dirty work because this is usually what you would spend most of your time with and then the pairwise correlations is something that I would always recommend you to look at in the beginning so that you don't only understand the individual data the clean dataset but also what are some rough correlations that you can already identify so that you have an idea of what are the features where you have to spend more time on and what are the features where it doesn't really make sense to spend too much time on and now we go into the next chapter chapter 3 I call it descriptive visualizations the this is a chapter where this is basically all about plotting so we will plot lots of graphs and look at individual features and then we will basically briefly discuss how good is a feature to be used in in the actual forecasting model later on so again we do some housekeeping and then we load our cleaned and transform data I always also keep a dot hat in the beginning of my notebook so that I always see that when I go over the notebook later on that I feel that this is the data I'm working with in this notebook and so this makes it nicer to read and we keep a list called new variables here that will keep track of all the new features we will create so in this notebook we will not only look at visualizations or features but we will also create new features out of existing features and to yeah it's called CTO generation and there's also in a very important process because sometimes you see some pattern in the data that is not for not easy to project or not easy to learn about from a machine learning algorithm and so you have to prepare the data set a little bit and create new features out of existing ones to make it easier for the machine learning model later on to learn something out of it so at first what we do is we create some derived variables so for example we have a variable that is called the second floor square feet now I thought that the how big how big would the second floor be well usually the second floor is very much the same size as the first floor because the second floor usually is built on top of the of the first floor so I figure that the the size for example of the second floor itself is not really a strong feature to for prediction however if I create a feature which is called head second floor a yes or no feature which basically just indicates if second floor is available or not then this may be a stronger feature because someone that looks for a house may pay a premium for example if there is a second floor or maybe not so we don't know yet what the structure of the house price is but building a yes or no or for example the second feature here has basement well the total size of a basement is usually not so important as the fact if a basement is there or not so are the same is a fireplace so we don't really care if a house has one or two fireplaces we only care if it has at least one so that's what we do here we create new variables here which are binary and we will add them to our new verb list here and I would also always include a brief preview on the data set on the new feature so we see how the new features look like and again this would be a 0 or 1 feature so either a place has a fireplace or not or it has a garage or not but there is no other value possible now if you look at a second floor data so what we do here is now I create some pairwise plots so we have two sales price on the y-axis and on the x-axis we have the cross living area and we see that the bigger a house is the more living space there is the more expensive it is this is why the cloud of data points goes from the lower left-hand corner to the upper right-hand corner and if we are using the color here the color scheme here indicate if the house has a second floor not what we see is given a given a fixed area if the house has a second floor it has a basically comes at a discount so in other words people in Ames Iowa seem to value or seem to be willing to pay a premium if for a given size of a house if the house is only one floor in in contrast to a second floor so in other words people in Iowa or people in Ames here they don't really like second floor it seems like they at least they are not paying a premium here so this is an interesting interesting realization here let's look at basements so if you look at basements when we see here's if a house has no basement it will get a discount but we also see that there are not really many houses that have no basement so in other words even though in the United States in general it is very rare for houses to have a basement in Ames Iowa the vast majority of houses has a basement and therefore having a basement is really not a good indication of if a price is going to be higher or not because basically every house has a basement unless the very few houses that don't have a basement they come at a discount so it seems what we deduce from this picture here is we we could say that people in Iowa they want the house to have a basement they are not willing to pay for it and not not willing to pay a premium for but they want the house to have a basement let's look at fireplaces so what we see here is given so first what we see is there seem to be a relationship between the fact that you know house has to be rather big in order to have a fireplace this makes sense so if you have like a bigger let's say a luxury house maybe then there's an increased chance that this house also has a fireplace and small houses which are down in which are on the left here they tend to not have a fireplace and then if a house has a fireplace then the price seems to increase so given the same area the same living area having a fireplace yields a premium this makes sense because we could see that the fireplace we could treat it as an indicator variable for it's a luxury house so to say and then for a luxury house you are willing to or you have to pay a premium to get it so these are some ways of how to to show some stuff here garages garages here we see if a house has no garage it comes it gets a discount other than that we don't really see any any pet here so um beyond a certain price point every house has a garage here so it seems that so what we can tell from these variables here there is different so that was will take a different role in in the in the prediction model later on so some of the variables there only seem to make sense when taken together with some other variable so only for a for example for a cheap house it makes sense to look at the garage or not available at all so this is how we see that there are there may be some complex underlying relationships between different sets of variables here but it's still good to get an idea at least visualize it and so that we see that what is going on here if we look at pools what we see here is basically we the variable here is quite uninteresting why because most of the houses don't have a pool so having a pool and you know the the red dots the rare orange dots here they're all over the place so if we have a pool we cannot really say anything about the house yet so that's interesting because I would have guessed without looking at the visualization that house it has a pool must be a luxury house and therefore have a higher price but it seems like this variable is really worthless because we don't have lots of data on it and not too many houses have a pool so fireplaces seems to be really a really better indicator for luxury in Ames Iowa and this may be different in other places in the United States let's look at porches porches seem to have a different effect than a garage so if a house has no porch and then it comes at a discount other than that it's also you know not a really good way because most how most hosts do have a porch so it's available that it's it may be very hard to learn from it then let's look at neighborhoods and here I quote the paper that I originally showed you and they say the instructors basically say that they said they found that the neighborhood plays are very large flow and this is of course not surprising because I'm also due to the history of redlining in in the United States you have poor neighborhoods and you have rich neighborhoods so the neighborhood where the houses in is probably the most important indicator of a house price at all so let's visualize it and the good way to visualize this would be to use box plots so we have the different neighborhoods on the x-axis here and then we have the different neighborhoods in different colors and we see that for every neighborhood we have an average or a median I guess it is and we have the entire span of course we have outliers as well so box plots they are usually the boxes here they usually disregard the outliers but then they show you here usually this is the 95 percent of the houses where they lie and so we see that there are huge differences not only between the average but also in terms of the spread and and so on when it comes to a house price versus neighborhoods so what we do here is the the variable house prices so far is a nominal variable so it can have up to I think 28 different realizations but we cannot really learn anything from a text column so what we do here in the next in the next code cell we use the pendous gets dummies have a method to translate the neighborhood feature the nominal neighborhood feature into a so-called sector variable so what this means is at the end we get 28 columns the basically are a yes-or-no answer to the question is the house in this neighborhood yes or no so it's a 28 binary variables and we can check that we did the right thing by checking if there is only one one in every row because a house can only be in one in one neighborhood of course and so we see here that our feature matrix later on will you know will grow tremendously to the right or in its width because many of the nominals that we have have to be translated into these indicator variables here as well and this is what we do it this is how we do it in Python using pandas get dummies let's look at the nominal features without the neighborhood now let's look how la playa roll so we have an alley in the United States if a small road or a small street that is behind the house and usually when they do a trash collection they usually go on the back side of the house this is quite different to how this is done in Europe and we see that not every house has this so so in other words what we see here is the the absent or and also what we see here is we see it the other pew dots here they are actually called an a4 not available so we don't have lots of information on that and this is also something that is interesting so when I go back to the data cleaning part what we did is we deleted all the rows it did not have all the end users in however now what we see here is even that doesn't mean that there are no missing data because as we see here in the category are called la there is the most common value in there is just called not available so we have to be careful here even though in so physically the data point is not empty it is really empty it's practically empty and because of that what we do here is we delete the column because it's really it's really not helpful if if the feature is missing for the majority of the data set then let's look at the feature called the building type so there are different types of buildings in aims so there's the most common one is the one family home called one Pham and then we have two different kinds of townhouses in orange and green and then we have a duplex which is down here and we have a two family condo which is also rare but it's also down here so we see that the type of a house does play a role so what we see here is to make the feature and maybe a little bit better to work with if we go ahead and we merge the two townhouses into one category so green and orange here which is about the same so it seems to be the Creed the orange ones they are alike a slightly bigger townhouses and the green ones are slightly smaller townhouse but this lot at the townhouse itself they are around about here so we have a different slope here in the data cloud or if you only look at that townhouses and the slope is here rather constant here so what we do is we lump those two groups together into one to get a stronger signal later on and we also do that for the two family condo and duplex which are both down here because what we see here if we look at the violet and the red dots here they are most it down here so it makes sense to lump them together because the two types of houses they really seem to be one so in terms of pricing and so there's also something that you should you something that requires manual work here a computer wouldn't see that these two categories basically are the same but what we have here is basically as as people when we categorize it statistically we make up some categories but we and in this case we make up two categories where we should have statistically speaking only made up one category name for it so this is basically a way of cleaning up what the statistician came up with and then of course we have to create dummy verbs again so here we get 1 0 variables again and because that's only only thing we can work with you let's look at the air conditioning so what we see here is if a house has no air condition it comes at a discount so this seems to be a variable that is a very important variable and now here again this is a nominal variable which only has two labels or two or two realization that it comes in which is why for yes and n for no so really if there's a nominal data the data type no not really it is really a binary data type so just because the data in the raw format it's provided to us as a text string which only contains the essent knows why we have to really get some more meaning out of it by casting it as correct data type which in this case obviously is again a 1 or 0 here and so we have we end up with one column which is called air conditioning and it has a 1 if we have an air conditioning and then we have a category that is called proximity due to various conditions and this is really a very messy column on very too messy columns actually so we have to contact two columns called condition1 and condition2 and they are they're basically columns that can contain text that's basically the best way to describe it and so when we look at it let's maybe look at what are the the most commonly used text so the abbreviation feeder is basically one of the one of the realization that comes up quite often and what a feeder is this is basic it is vertically means the houses are close to a feeder street and a feeder street si Street that basically brings cars to the next highway so to say so basically what I'm saying is if the house is next to such a street this may have a bad influence on the price because no one wants to live close to a big street at the end of the day and then there are some categories called are something and these are always these are different categories regarding railroads and you can look them up what they mean in what they mean in particular but what we do here is we will go ahead and we will basically lamp all the different railroad categories together into one a new category which is called a near railroad yes or no so this is kind of the transforming that needs to be done so that and that we can actually use these labels here and yeah let's look at what this means here we have a visualization so as I said if you live close to a feeder street you get a discount right and so what we use what we do here is in this next code here I I a new variable called Street I created a new verb called railroad and then there's also a one called Park so if you are neck close to some some park there are several ways of saying that in the dataset and we only make it one unified way of saying that and we will lump together all the different Park categories also into one big category just called Park and now if I run this this will also create some blood and we see that the different categories that we have they they reside in different slopes in the data set in the data cloud so they they will have a different effect so this is a very easy way to plot called a library called map load lip or Seabourn this is what it does is it's also called LM plot and then LM plot is basically a linear regression model without doing the regression but it's basically a simple linear model and we see different scopes here different slopes here and this is what we what we are looking for as data scientists we want to see different correlations how strong is something correct with one another and also of course the slopes which is kind of a different way of saying something is correlated or not so here we create the categories I talked about and then we finally delete the condition a columns because the text conditions they don't really help us in any prediction and so these are the new features we created here it's so quickly at some other some other feature calls the exterior this is also attack attacked like a feature so this is basically a feature that lumps together in two in two columns the two most commonly used materials out of which the house is built and we can see that there is some pattern in it right so some houses are built of materials that are more pricey and some built of cheaper materials so what we will do is we we'll look at this but then what I found here is the category is too diverse so we couldn't really use this to make any good prediction so this is I'm already taking a head the result that we will see but of course I played with the data a lot more and I found that the material out of which the house is done you know in theory it may have an impact but in practicality practicality for our models we didn't really see much of a difference so we deleted the column here and then there are some other categories some other features some called the foundation and here we see some strong pattern so depending on how the house is founded and this may that this basically it seems to indicate a price discount or a price premium depending on how you look at it so what we do here is we get dummy variables as well for the different foundation types and what they are in detail let's not get and too much detail here you could read it up but again we only take these variables here because we can really measure a linear relationship and we can see different slopes so it really makes sense to keep them here in a clean way in a 1 or 0 way then we look at some other features that are not so relevant well as we look here at the feature called garage type and so this talks about ista garage next to the house or a little bit away further away from the house is it a carport or really like a Russian and so on and what I found here is that this really didn't play too much of a role here so what we will do here is we will just delete this as well it's more important if a house has a garage or not it's not important what kind of a garage it has if you look at heating heating we see here different data dots most of the houses have a gas heating so if there is another heating gas water and so on whatever this is we see there's not too much variation in here so this was another good feature to learn from so batik we get rid of the feature as well let's look at how style so here we think we that we see a pattern but again the slopes are not too much different here and I worked with it again and I found also that this featured the house type is not so important so here it says is it a one-story house or two-story house well we also have another feature called has second floor up there already which were using so the house style if it's a one-story or two-story so a house style is a text field first and foremost so this is not so important because we already have other variables that basically indicate the same kind a kind of information information and the slopes here are not too different so this is why we get rid of this feature here as well length contour we see there is some sort of a pattern here but this isn't really not too much too many data points here and so we got rid of this as well and there's some more so let configuration if we looked and look at this we see this yeah it's a little bit messy here not too much convert to variation here so we also don't look at this and you see that in general I get rid of many many of these fields because at the end of the day what these are these are fields that either already included via another field already that we have or they are just too messy and we cannot really deal with it and you cannot just take any text field and leave it in a data set and put give it to the prediction algorithm later on because a feature has to be made a 1 0 dummy kind of column first and sometimes this is too hard to do and at the end of the day if you have too many 1 or 0 columns then what we will have we have another problem which is called the curse of dimensionality so we also don't want to run into this so it we also have to find a balance of not creating too many variables later on this is why I'm removing many of the variables here and also missing miscellaneous this is basically a column that you will find in any dataset of any sort something called other or miss miss core Misskelley miscellaneous and so on and usually as we look at this data cloud here doesn't it's not really helpful him let's look at the rules roofs as we see there are different kind of roof but most of the houses in Ames have the same kind of roof here so again there's also a field that is not really helpful and so we get rid of it as well and now we come to some more interesting fields so there's one text field called sale info and this basically covers abnormal sales so in other words foreclosures so someone some person goes bankrupt and then the house is foreclosed or so the bank takes away the house and what this basically does is this basically does not enable a fast or a real sell process but it goes into a very rapid to sell process so we can expect that the house price will be at a discount here so let's look at this so first what categories are there is a normal sale sale then we have a partial sale and we have the abnormal cell and again abnormal means a bank foreclosure so let's look at a data plot here and we see the foreclosures they all come at a discount so if a house was basically auctioned away by a bank or not this is a very important detail here and when we want to predict house prices and so partial sales this seems to be where someone has a house and is basically only selling some part of the house some you know unity within the house and this seems to come at the premium here so we keep we keep features out of that we have here a line plot again that we see that is perfectly partial and where is it abnormal they have a result in different slopes so this basically indicates that we should keep him and we should keep the verbs in the clean way so what we will do here is we'll create a clean one or zero variables here again and three new features partial sale abnormal sale and new home new home is a feature if a house was sold for the first time or a second time and whenever a house is sold for the first time it also comes at a light premium here street names as a feature are not valuable at all so we saw that neighborhood is important but we could argue that a street name or the street in which the houses is kind of related to the neighborhood in which it houses but it's too too granular so we have probably too many different streets and then we would again run into the indian to the curse of dimensionality where we end up with too many 1 or 0 variables for every street basically and this will not be helpful in the prediction later on some more interesting features we can develop is the age of a house so in the data set if you look closely you will see that we have columns called year remodeled and year built and then we have basically year sold or year sale we also have a variable and the idea is that while it is important when dust when the house was sold because of inflation so if a house was sold in 1980 and not and another one was sold in 1990 and the next one was sought in the year 2000 then we would expect that the price of the houses is going to increase due to a natural inflation however this is not what we are looking at here we're looking here at how all the houses so we're looking at the difference of when this when the sale was done and when the house was built and this lever does not exist so there is no very called age so what we do is we just created and we also do this a similar variable for remodels so whenever a house gets very old then usually what happens it's remodeled by some construction company and then it is sold again and then usually it gets a premium because it was modernized and this is what we capture with the verbal remodel yes or no here and then we have variables called year since built year since remodel so we can to the age and the time and we see that there is some variation again so most houses are sold when they are new and then after a house gets beyond a certain age it is not really yeah there are not so many left but this is also due to the fact that the city of Ames is not so old I guess so we create someone of zero variables here again and plot them again and we see that if a house has been rebuilt a built recently built this is also an I forgot to mention this some feature level so instead of just looking at the years built and the years since remodeled which is a continuous variable we can already also create feature levels that occurred recently built or recently modeled where I create a one or zero variable by asking if the house was built within the last 10 years then it's a yes and if it's built beyond ten years ago then we say no so what this does is it basically translate the continuous variable the age of a house or the age of all the time of when the house was remodeled into a binary variable and we keep this as a secondary variable because we see that maybe the variable itself in terms of its being continuous does not help so much so we take the binary variable instead as a proxy and we will of course in the next chapter look at both of that and then we will see which one works better so again we get rid of some of the variables here and then lastly before we end this chapter here what we do here is we have to take care of some outliers now you may wonder why did I not take care of some outliers before well now that the data is clean what we can do is we can run some automated algorithm to detect outliers and I chose one that is called and so-called isolation forests you can do your own read up on water then what that is is it's basically um you know and the machine learning algorithm or a statistic method that basically looks at the entire data set and given some of the some parameters it basically gets rid of some of some rows that are too different from the from the cross of the rows and this is a way of you know automating the outlier are removing outlier process we could do that manually as well but the interesting thing is what I did is I plotted the outliers and we see all the outliers here visualized these are houses that are extremely big or extremely pricey or add the opposite so extremely small and hunting and extremely cheap and they are removed and the ones that were removed by the authors of the paper they are also and enclosed in this set here so in so in other words our automated way of removing outliers detected all of the outliers that the statistic people from the paper also discovered and removed and I think we remove one or two more outliers due to statistical properties and then at the end all we do is we start and now we're not only cleaned a data set but also the data set that now contains all the features or the new features that we generated so now we have gone a long way from coming from the raw data set and do some first approximate cleaning then we looked at some correlations and now in this in this data set we also put in many many more much much more manual work many many more steps and now we have a new a new data set which is now 109 columns wide so we started with the initial 78 columns and we are now at 109 columns and we have discarded somewhere rose again but at the end of the day we have basically transformed the data set a little bit more and now we have a feature set that we can actually use for prediction so again some as I said throughout this chapter here some of the features they are not as suitable for statistical models and now in the final step we open chapter four which is called predictive models and now what we will do is we will run some of the some machine learning algorithms on the clean dataset and also under not so clean dataset just to show the contrast so one caveat here this is not about getting the best possible result on the Ames house price data set if you want to look what is the best data set here as the best solution maybe the best model predicted you should go to for example the cattle competition page and look up some of the competitions there what we do here is the here the idea is to contrast using the cleaned data mortality the clean data set versus the not so clean data set and also showcasing a different subset of variables that we manually detected in Chapter two by only looking at correlations and then we will see some we will take some learnings from that okay we imported some stuff here and now do some housekeeping and now we will load the original data set which we called data clean so this is the data set as it was after the first chapter was done here so this is the data set that is already a little bit clean so there is no missing data in there for example and some other stuff but we have no transformation here it's no cox transformation no log transformation no new features nothing and we call this data set DF 1 here and then what we do is we encode your ordinals there is a helper function in SK learn SK learns to standard machine learning library in Python so there is an hyper method that can encode the ordinals automatically this is basically automating all the work that we put in in chapter 3 however I will show you that SK learn will do a much worse job than we did because as people we as humans we see that there is an underlying story so to say or intuition behind the nominal values this is why we dropped some of the columns and why we lumped several values into some into one value one in one uniform value and with all the cleanings in Chapter three and the automated way of scikit-learn cannot do that so it can only basically create dummy variables so zero or one variables out of out of the ordinary and to nominal characteristics and that means we we can basically pass this data and now on to the SK learn models however the data set is not really good so this is really the minimal transformation that is needed to make it mathematically work but as we will see it's not a really good result so this is the data set contains now only numbers all the nominal values are gone and now we store it into a into a matrix X x1 and into a vector y1 to say price because the names usually used in machine learning is a big feature matrix X is fitted to a target vector Y so we are we are not using the pandas dataframe anymore but we are now using real matrix matrices and vectors from number now we look at the improved data so this loads data from the chapter 3 from the end of chapter 3 so we have the transformation in there and the factor variables they're created and we will do the same type of transformation here as well and we will store that in x2 and here because we have locked transformation we now have y2 and y2 l4 the price as as as the lock value and then we do for comparison reason this is what I mentioned in chapter 2 we do one with an import we import basically now the variables that are strongly correlated and also the ones that are weakly correlated and what we do is as I told you in Chapter tools where we had those matrices here those correlation matrices what we could do as humans as we could trap all the features that are somewhat weakly transform at correlated to the sales price and say okay it's not correlated strongly or not not not even coded at all so let's just drop it and not look at it in the prediction and this is basicly what we do here so we only work with the strongly and the weakly correlated features so only the features that have a correlation coefficient of at least 0.33 with the sales price and this is something that we could have done as humans but we will see it turns out this is actually something we shouldn't do so this is just something to show you that you should always learning from this will be that you should always give the machine learning algorithms all the features you have as long as they are clean and not try to sub select them and be smarter than the machine learning algorithm because once you have a very clean data set the machine learning algorithms they are already very good at selecting a known feature set so we do that again here and we start this in the variables x3 and y3 and y3 lock so now we have three data sets as matrices and vectors and now whatever I do here in the next step I create a helper function called cross-validation which does a cross-validation for any kind of machine learning model I pass it and by default we do 10k fault or we do 10-fold cross-validation and so what cross-validation is in a nutshell is we take a data set and we split it into ten equal parts and we take nine of them and so we take 90% of the data and we fit the model and then we predicted on 10% that we did not use for training and then we calculate an error measure on this 10% and then we do it for another set of nine of nine faults and we do that ten times until every 10% of the data that has been in the useful prediction and useful evaluation ones and this way what we do is we train the model in this case ten times and we make it we make prediction ten times on data the model has not yet seen before and then we average the arrow and this way we get an unbiased estimate of how the how the model would perform in the real world on unseen data because the idea of the prediction here is we want to predict the house the price of a house that we haven't seen before and then in here we calculate different kinds of error measures most notably for you probably would be the root mean square error this is what basically most of you should know and maybe they are too and then we have some other called the bias and the mean absolute arrow and the maximum deviation and so on but we will focus on the are - and the rmse here are just Emma C and yeah so we define the helper function you will often see this happening in some analysis where you define one or two have a functions once and then you all over and over again until we will have a dictionary in Python called results where we'll store all the results to compare it at the end and now we will run our first couple of models here so we start with a simple linear regression as we all learn in stats 101 so how this works in the FK learn is the following we take the algorithm called linear regression we imported this before and we have to initialize it this is what we do it with the coil operator and we store the model on in the very cold LM and then inside this cross validation function what happens is we pass in the L M as the model variable here and then somewhere in in the in the cross-validation the code says model dot fit so we record the dot fit method on the model and then we call it a predicting my method and this is how what scikit-learn basically does and this is what makes the model fit and then predict on a new data set and those are all automated in this function called cross-validation okay so let's create a new linear regression model and run it on our original data that we just barely cleaned and we didn't do any future generation yet and now we run the the tenfold cross-validation and we get back besides and one thing that we should already tell you that something bad happens here is that the our tool here is negative and basically negative infinity so the R 2 is usually between 0 & 1 the adjusted r2 and in we're welcome in rare circumstances the R squared can actually be negative and it's basically negative so when something goes terribly wrong in the model and this is already indicating that the linear model on the on the raw data set so to say on the little on the least clean dataset it's not really a good model and also the mean squared error the root mean squared error is very very high here so we shouldn't really trust this model yet however this is that the easiest benchmark we may have however it's really a bad benchmark now it's use our improve data and this now works with the data with all the new features that we generated in the chap in chapter 3 here so we do that for two cases one for the normal price and then for the log scale price so we're under in linear regression and immediately we see we get an r2 of 0.92 and we get a way lower root mean squared error than above so that means with our clean data set with all the time that we put in into cleaning the data set and generating new features it's really worth it so we improved the prediction by a lot here and we have a bias that is very low so a bias of negatives $89 means that on average our model predicts a price that is on average $90 to low and if we go back to the original bias this was basically plus infinity right so we see that design now we understand the values that the measures even more and now let's run it on the log scale as well and on the log scale what we see here the r2 goes up and the root-mean-square the Rope goes a little bit lower so this means and interesting what we also see is that the bias is in absolute values a little bit higher than before so in other words what happens is a model that is trained on a log scale so only the prices to say prices that we fit the model on I'll put on a log scale leads to a situation where the bias is a little bit higher however the overall r2 and root means good our they basically improve so by giving up a little bit of bias we get on average a way better model so in other words using a log scale for prices and in the in this house data set seems to improve the situation this does not have to be the case transformations are often used for rate of returns most notably in finance but it also helps to improve the situation here in order to affect Nereo for the linear regression let's now use the improved however we only use pebbles that have a strong or weak correlation with the sales price so we basically shop all the other columns that have an almost white whitish here color in terms of the highlighting here basically all the other features that had seemed to have no correlation with the sales price and now what happens as you run this one for the normal scale and once sold for the log scale is our r2 goes goes down and the root mean squared error goes up so in other words by only giving the linear regression model strongly and the weakly correlated features and dropping all the seemingly unrelated features or uncorrelated features to use the correct word we actually get a worse performance in the projection and that means in this case we as humans cannot outsmart so to say the computer the computer already the linear regression model is rather good it's better than us humans in selecting features and now how does a linear regression model fit select features well basically a linear regression model uses better terms and so and a better better value of close to zero basically means that the linear equation gives no weight on a certain feature and this is how this is how linear how the linear model can get rid of some of some feature and also what is important is the linear regression model here there's just a simple linear regression model this is not a linear regression model with interaction terms and so on so this so the linear regression what I'm saying here is could be improved by making the model a little bit more complex however we will not do that and this is again if you want to know how this could be done check the cattle competition for how to to win it and now what we do instead is we will use another linear model which is called the lasso and the lesser model is a linear regression model similar to the next one also I can already give you the name of it which is called root regression and both the lasso model and the Ritz regression they have a different way of constraining the better terms then the normal linear regression does so what the lessor basically does is if a better is close to 0 it basically sets it to hard 0 so it basically you know if a value in oil for better to be nonzero it has to be significantly different from the non 0 from 0 this is what lasso and rich do at the end of the day in you know a rough speak so to say now let's do that and that's all has to be calibrated so the Dessel regression takes a parameter called alpha and this has to be optimized and what we do is we use a so called crit search to also not only do the k-fold cross-validation but also to optimize the alpha value so at first what we do is we go ahead and we do the grid search to find the best possible alpha so at the end of the day this is similar to how cross-validation works and then we do that for different kinds of alphas and then at the end we choose the best alpha and the alphas that we use here was between what is it I think we used maybe let's just copy-paste this year in its own cell so copy/paste so these are all the alphas that we are looking at and the creature search determined that the alpha of 20 results in the best result and now we use the best alpha in the in the lesser model and do the cross-validation to get an unbiased estimate and we see that the unbiased estimate on the old data on the original data is now kind of at least stable so in comparison to the original linear regression the simple linear regression run on dia and on the very clean data set on you I call it the original data set this resulted in basically and negative r2 above now resides in at least a kind of okay r2 which is 0.8 1 but it's still bad right it's still way worse than the linear regression for the improved data set up up here so what we learn from this is that the lesser regression but also the rich regression they can handle they make the linear regression model a stable in a sense but still if we don't give the model a good data set with a good feature with good generated features then again we will get bad results so let's now do the that's all for the improved data set and now what we see here is and also on the log scale and now we see that the left over equation has an r2 of 0.92 5 and if we go back up to the normal linear regression we have an r2 which is a little bit higher so in other words the data request here is a little bit worse for the improved data set given compared to the normal linear model and this is something that we can only find out by by by a trial and error if we again go ahead with our manually chosen features now what happens is r2 and our private even worse so again here and there's a general rule whenever we manually pre select the features we get worse result so we should always give the model all the data that we have or the all the columns that we have and then have the model make the selection and in an automated way and not make this on our own let's look at the last linear model the so-called rich regression also here we have a grid search so also here we have to optimize a parameter called alpha and also the retrogression which the regression is able to work with the Tholian uncleaned data set however the result are two of eight point of 0.85 is also not so good so let's use our improved data set again and now improve the data set and the log scale years a very good result here and if I manually choose or pre select some features I get a worse result again so this is a general rule so as I said before the learning is don't try to be smarter than the machine learning algorithm here let's look at a different family of algorithms so know we have looked at three different models that are linear and now let's look at a tree based model with most notably the random forest the random forest I can already tell you I like it very much because the random forest is a very flexible model in terms of what kind of patterns it can learn and it's also a model that doesn't require you to clean the data set to an absolute maximum so for example instead of having 1 or 0 dummy variables we could have a yes or no text variable and the random forest would still work so of course I give it the data set with a dummy variables here but the random forest if you need a quick and dirty approach to get to get like a first indication of how good the prediction could be you can use a random forest on you should always include a random forest and you don't have to really generate all the features in a dummy way and the random for can still work with so let's look at it the random force has one downside and we can already see it here when I run it it takes some time so here we run the 10-fold cross-validation and two random forest regression creates 500 or so called randomize trees in the in within the same forest and then each tree makes a prediction and then the collective is put together and to make one prediction on the overall level and training all these individual trees in the forest takes time so now now this is over that we see that the random forest on the 30 data set on the 19 dataset from chapter 1 already goes to r2 of almost 0.9 so this is something the linear models couldn't do the linear models could only go up to let's say 0.85 in the best-case scenario a further original data and the random forest already is able to detect some of the features so the learning from this is either you've spent lots of time to generate features on your own manually or you rely on a little bit more sophisticated algorithm like the random forest and then you couldn't use the linear model here unfortunately but the random forest basically spares you some of the manual work for future generation now let's run left when enforced run on the normal scale and the log scale for our improved data set and again now here this takes some time so one way to optimize this would be to use less trees and I think if we used 50 to 100 trees in the forest we would also get already get a very similar result usually what you do is when you work on a big data set is you start with a tree that with the false it has not so many trees let's say you start with 100 and then you increase the number of trees and then you run the same model over and over again and until you see that the addition benefit it's not given anymore and then you stop crowing the forest and then you use this as the number of trees in the in the forest for all the other models that you build so you have to manually optimize this parameter as well and now let's look at what the result is so the result is that we get a better result with our improved data set however here the normal scale is a little bit better than the last gear but we can neglect this actually so it's not a big of a deal but the linear model here was even better so this basically tells us the story that even though the random forest spares us to do all the future generation to the maximum so to say during the future generation to the max manually helps us together with the linear model because the house pricing the house prices they seem to be explainable in with linear models in a better way so to get the best result in order to make this work we have to put in the manual work there's no way to do not get to the best result to do the manual work and now let's run the last two cells here just to be complete so this is now running the random forest with the manually pre-selected features from Chapter two we using the strong and weakly correlated features and we will see what the output is here as well so now here we see the result is not so good as compared to giving it all the features and of course the random forest is also very good at selecting so to say features it does so implicitly and so again either learning is if you have data in a clean format just give it to the machine learning model and half the machine learning model select the features now let's look at the over besides from another angle let's look at the two most common error methods here let's first look at the root mean square arrow and let's compare all sighs so if we look at the original data set the uncleaned data set and only the root means but arrow we see that the random forest is the best model and the linear model itself doesn't really is it's not stable but the raw linear model doesn't work basically and if we use a rich or lesser regression then we can make it work but if we don't want to process our data if you don't want to put in the manual cleaning work and the manual future generation work a random forest may be the best model in terms of root mean squared arrow if we use our if we use our improved data set the data set that we spend so much time cleaning then we see that we can get a way better result using a linear model now and not a random forest however we had we had to spend all the time cleaning right so we get this is basic the improvement we get I you know spending manual work let's look at the logarithmic transformation this gives an even better result so in this situation the best result would probably be to put in the manual work use a logarithmic transformation of the price and then use some sort of a linear model and again let's look at the our pre-selected model and our pre-selected model is somewhere in between so again let's not once we have a data set let's not try to be smarter than machine during an algorithm and now finally let's also compare this to the r2 error measure so in r2 of course we get the same order further for the original data set for the unclean data that the random forest would be the best and explains roughly 90% of the variation in the in the features and if we look at the improved data set also with the logarithmic transformation the ridge regression is the best model and if we look at the entity pre-selected data set and then what we see is we are somewhere in the middle so that's the big learning so let me summarize what we looked at in this case study we look at this case study at first of how we open a data set and then look at it in a very high level so remember that I in my mind include all the individual features into these big groups continuous discrete nominal ordinal this is something that you should always do because these you know variables of these four types always require different treatment later on we did some very raw level cleaning by getting rid of rows that were obviously missing data and columns that were not really filled in like here with the visualization there was a nice help however then we realized that this is not enough so if we go into the into the chapter 3 where we did all the manual work we saw that some of the fields for example were not empty but hadn't not available here so they were basically empty but the exertion tell us it was empty so we have to really spend many time to realize this and automate this the correlations I don't want to disregard here even though they were not helpful in choosing a better selection of features to make prayer make a prediction always still worthwhile to do in the beginning to get a rough idea of what other features that should be used that it could work and because if we are time constrained in a real work scenario and we cannot to go over all the features and put in some manual work then maybe it was a worthwhile idea to start with all the features that have a strong correlation with the sales price and put in the manual work we did to these features first and see how far we can go and if we are happy with our predictions then we just keep it and then only if we're not happy with our prediction results we would then more and more include features here that are seemingly in unreal a unrelated or uncorrelated to the sales price and by doing the we can also in this light here we can also put already also explained some of the work we did in the chapter 3 in the future generation part so what we did basically is so features that are seemingly anger uncorrelated to the sales price we discovered some pattern that the computer wouldn't discover and by lumping different categories together into one or by creating one serial variables out of nominal variables for example on deriving some other variables out of some existing verbs or buddy we did some manual work that a computer these days cannot yet do and then finally we ended up with basically easiest part in machine on and which is just to run the models so this is basically the part that requires the least amount of manual work and then we learned that it does not pay any benefits to try to be smarter and do some manual pre-selection automated feature selection is the best is the way to go and then we saw that in this case it is obviously worthwhile to use a linear model and the linear model requires keen data so in order to get the best possible result you have to the manual work here there's no way around this now what we didn't do and this in this case study if we did not use any deep learning methods just make sense why because in order to I use some new networks and deep learning methods you need a way bigger data set the number of samples should be in the at least in tens of thousands maybe in the one hundreds of thousands and we only had roughly two thousand three hundred rows here so a sample source data so doing deep learning gear doesn't make any sense and so this is one reason why we couldn't do that but I would now do if I had more time and I want to get a better results I would probably try to run some other machine learning models in particular I would try to stick to linear models and try to get some interaction terms into it to do some let's say quadratic quadratic linear regression something like this this is where I would now spend some time on and see if I can get better results but we're not doing this here okay so it doesn't sit this is the case study so I hope you liked it and again here's the link get help that comes as web part effects slash aims housing and this video will also be made available and if you have any comments I would also appreciate to receive for example I have an issue where you could raise a question of hey why why did you do it this way and couldn't just be done in a better way or maybe if you come up with a better solution how to solve this problem maybe it's just copy or clone my project here and use it and make some improvements and then make a pull request to basically merge in your improvement to this project this would be a nice contribution but other than that yeah I hope you liked it and I hope you learned a whole lot so I see you soon on the channel
Info
Channel: Alexander Hess - Pythonista
Views: 2,041
Rating: 5 out of 5
Keywords:
Id: VSeGseoJsNA
Channel Id: undefined
Length: 103min 49sec (6229 seconds)
Published: Mon Jun 29 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.