Beginner Kaggle Data Science Project Walk-Through (Housing Prices) | With ChatGPT

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello well it's been a while but welcome to another kaggle walkthrough of one of the more basic data sets today we're going to be analyzing the housing prices data competition on kaggle this is a great way to help you learn how a data scientist like myself might approach a problem similar to this this should be very similar to my analysis I did on the Titanic data set where I walk you through how I approach an analysis from just starting out with a new data set all the way to the submission of results this one is just a little bit more advanced and focuses on regression techniques rather than classification like the other one did here I cover basic data exploration creating a data pipeline feature engineering model building parameter tuning and ensembling you can follow along by cloning The Notebook that I use on kaggle with the link that's in the description below if you find this useful you can also support my work by liking this video and also upvoting the kaggle notebook I have quite a few other free resources and notebooks for learning on my cargo profile and on GitHub so if you like this definitely to check those out as well what's special about this analysis is that I also talk about how I would use new tools like Chachi PT to assist me throughout the video I talk about the points where I either use AI to help assist me on the project or where I could have used it to facilitate my work now let's jump in alright so this is the project we're going to be analyzing today I will share a link to the data and everything in the video description below something I like to do before I start any analysis or any cattle competition or kaggle notebook is I like to read everything I can about the overview the evaluation look at some tutorials and also look at the data so something I went through I actually didn't go through that many tutorials because I thought it would be better for me to have a fresh perspective it'd be better for you all to let me take a fresh take on this but after the fact I reviewed some stuff and actually made some changes in revisions so I actually think that if you're a more advanced data scientist that's a really good way to learn is you ascend actually do your analysis first and then after the fact you can go back through and you can see what could be improved upon and I actually did make one major change that really it helped my performance by looking at other people's code and of course I will credit them when when we get to that point in the in the video here so the description so we can use python or R we want to essentially be able to predict housing prices based on a bunch of different features we're going to be evaluated by root mean square error so when we're doing this analysis we probably want to use root mean square error as our measure of Merit or how we're going to evaluate all of our models that we're using um I also really like to look at the data that's something really important so we're going to be doing essentially all of our analysis on the train set and then we're going to be trying to predict this test set and uploading the results we also have this data description that talks a lot about all of our features in a bit more detail which is really important and we have the sample submission data which shows the format that we should be submitting our results in so in this data set something that I noticed immediately is there's a lot of different fields so we can look at the shape of the data set once we start doing the analysis but something that is important to me as a data scientist is using my time efficiently so if I wanted this analysis to take a really long time and in theory do the best possible analysis I could I would go through each one of these individually and really think about it carefully but a lot of the time as data scientists were under time constraints so we might not have the ability to go through or the time to go through all of the different features and really make the best judgment about each one of them so I looked at a couple that I thought were particularly interesting and I also used my analysis to inform which variables that I might be able to manipulate or do things with I also don't really have a lot of subject area expertise around real estate and housing I I know like probably around the same amount as the average person who is looking to buy a home but has not purchased a home yet so those are things that I as someone trying to analyze this data would be looking into to understand better you know if if I was really doing this super serious from our work I might read a bunch of Articles and try to understand what is a half bath versus a full bath or what do people in the industry say about garages versus enclosed porches versus you know as a pool worth it based on maintenance and all of these types of things something that we'll also see is this data set is relatively small I think there's only around a 14 or 500 different properties that are evaluated so maybe I would try to pull in additional data from the outside world that I can find to improve the quality of this analysis this data description is really useful also so it says within each of the categorical data features what each of the different codes mean and this is important because when we read this in we're seeing 20 30 40 the the programs we're using they interpret this as numeric data rather than categorical so we really need to make sure we go in and we adjust this to make it categorical rather than numeric so it has the correct meaning for our models so I would go through and briefly explore all of these things I think this is pretty important just reading through and having a high level understanding of what types of data points are in our model and what each of the individual columns might mean I personally you know we're talking about how I would use AI tools how I would explore these a little bit differently when I started this analysis I essentially made a prompt on chatgvt where I said hey I'm going to read in a bunch of different columns in the data and a bunch of details about the columns and I'd like you to just keep that in memory for if we wanted to analyze this data so now my chat GPT the AI that I'm working with I'm using the gpt4 um you know like kernel I guess for that it has some understanding about the analysis that I'm doing and it can help me perhaps make more informed decisions or help me streamline what I'm working on here all right let's jump into the notebook here so again there are a few things that I touch on in the notebook and before I get started I always like to give myself a task list so I go through and I tell myself I write down usually in comments what I'm expecting to do with the analysis so in this one I wrote down these seven things so I want to do some basic data cleaning and feature exploration I want to do a more traditional exploratory data analysis I want to do some basic data engineering some model experimentation and parameter tuning feature engineering ensembling and then I want to submit then within each of those I'll write comments and I'll say hey I need to do these five things for me to feel like I've completed this task so I go through and I create that structure before I start and then when I approach the analysis to actually do it I have a task list again to just go through and knock out those things rather than getting to a point thinking about what I have to do and then starting that to me saves me a lot of time and it helps me structure a lot of my analyzes so in this video I want to also emphasize like why the watching the video is actually useful rather than just going through the notebook so I talk a lot about how I approach the problem just like I was talking about how I create the task list I also talk about how I consider using AI tools like Chachi BT to work and solve work on and solve a problem like this I mean this is something that has changed how I work from five months ago the way I code now and the way I'm approaching this problem is very different than I would before these tools were available and I think these tools are super helpful in getting work done and creating value through data but also for learning different things so I'm going to integrate some of that into this video and talk about where I where I myself have used strategypt in this analysis I'm also going to talk about why I made certain design decisions so why I decided to do things a certain way so data science is open-ended 100 data scientists could do this analysis a hundred different ways and so there are certain choices that I made that might not be right might not be wrong but might be different from other people and I like to talk about how there's some ambiguity in any of the decisions we make with data science and then finally at the end I'll talk about some ways that you can improve upon this analysis so I think it was fine you know I was in like the 60th 70th percentile with my results but I'm going to talk about some ways that you could improve these results if you were to take this and make some changes to the code and if you were to iterate on this a little bit more and I fully expect you to do that that's a legitimate project that's something that you can share you know might not be useful to share it on in your portfolio this is a pretty normal project but it is something that can 100 help you learn so let's start working through the code here so the first thing we do here is we import the relevant packages so we use numpy and pandas for data manipulation we use plotly graph objects and plotly express to do data visualization I import scipy to do some normalization that I'll talk a little bit more about later and then I use this IPython display to make this more visually readable for for us I 100 use chat gbt with this IPython display code to make it more readable I actually think that's a perfect use case for this and I'll touch a little bit more on that in a second so some people like to put all of their all the packages they use up front and load them in at once I think that's totally fine but for the sake of teaching I think it makes more sense for me to have the most of the packages around the code that I'm using them for just so it's a little bit less confusing what we're reading in so um just that's the syntax that I'm going to use you don't have to do that when you're doing your own work you know it's probably better to actually put everything at the top especially for creating requirements files or things like that now let's go into some of the more traditional code so in our data exploration we're going to import the data we're going to look at some summary statistics so we went through and looked through all the columns and and through some of the details here but we want a little bit better idea of what the shape of the data is what we can expect if they're outliers if there are no values those types of things uh so we're also going to explore these null values and we want to see hey maybe we shouldn't include some columns yeah they're almost all null or if it would be useful to include this or not so the first thing we do is we read in the data um in order to get the path on kaggle there's a couple different ways so the first way I'm going to slide this over here is we're going to work from this test data and we could just copy the path and copy and paste it in to here that's probably the easiest way to do it um but you can also just get like the system output and and usually at the top of each kaggle file when you create a new one you can just run that line of code and it'll it'll show you the path to the file uh this next line here I'm going to run the first couple columns I'm not going to run all the code because when we actually get to training the model that takes quite a bit of time this code is some I got directly from chatgpt so this makes it so we have essentially scrollable tables uh like this rather than them not showing us all of the data I could give you an example of that um yeah sure I might as well show you what that means not particularly important for the analysis um but here we go so you see how again we can scroll through this if we just take this summary stats and print it like that it it goes down all the way and we you know it's just less convenient to to view it in that sense so I thought that was a a useful design decision and a perfect use case again for chat GPT so this isn't code I'm going to write all the time this is particularly useful just for teaching not necessarily for my work and I don't really want to occupy my brain with any of that stuff I have chatgpt write it it's useful but it isn't an integral part of my analysis by any stretch of the imagination so right here all I did was um well I'm actually just explaining this code so what we're going to do is we're going to get our numerical features and we're going to transpose it so this is numerical features that has in the in the rows the count mean standard deviation Min 20 percentile 50th percentile 70th percentile and Max and then on the top all the columns are the different numerical columns uh this code here is essentially just getting all numerical features we're transposing this describe function here so that all those are in the rows and and then we're putting it into this display so we can use the scroll down and view it that way so we can start looking and analyzing all of the different data points you know we just might want to look at and see hey like what are what is the average for for some of these things like lot area is interesting what's the standard deviation wow the standard deviation is really big here what does that mean for us and so we can just start exploring and analyzing all these things um if I was doing this for work I'd probably want to know the majority of what these things are and understand them better for the sake of this analysis that would be pretty time consuming you know there's a lot of different columns um if we go and we just wanted to look at the the shape of this data Maybe we wanted to look you know we can see there's uh 1460 rows and there's 81 columns right so that means there's 81 different features that's a lot of time you know you really could dig into each one might take an hour to purely Understand Each one um and you know if you're really trying to understand this from work perspective or or develop subject area expertise it could be worth spending that time and getting to know these things for a project like this probably not worth it to get that in depth but it is also useful to to think about um you know how we approach these types of of problems something we might also see in here is that uh year built for example is a you know it's a continuous variable but do we necessarily want it to be a continuous continuous variable right or you're sold or some of these types of things you're sold I mean there's so much fluctuation in the housing market uh especially during I think 2008 was the housing bubble that you might actually want each year to be a category rather than a continuous thing because can continuous variables implies there's some sort of relationship between them whereas each individual year might be completely independent based on Market factors or things along those lines so those are decisions that we want to think about when we're analyzing and looking through this data the next thing we're going to do is we're going to run this for the categorical variables so we're going to see all of the different categories that we have and as we saw before you know some of the things that should be categories like the subclass we need to make them into categories um because they're they're currently encoded as numeric so those are things that we'll want to adjust and take note of now uh and and we'll do that in the future engineering spot so I just did that a little further above we don't need to redo that next we want to look at null values so we'll run this code and we can see the null values in the data set and then we can also just look at the this is the percent of missing value so like 93 of these data sets have null in in the alley so you know what what does that tell us do we want to remove alleys altogether maybe we should just have um Ali as a categorical and the alley will be you know it'll be one if there is an ally zero if there's no alien that might give us uh more information about the data set itself so next this isn't really necessary but I thought it would be useful I just went through and looked at a sample of the data that had null values if you have less and all values or there's columns where and there you know there isn't something like Ali that's almost always null this could be more useful just to explore uh those specific null values all we're doing is we're looking at um you know we're looking at the rows that do have a null value and we're selecting from that subset and then we're putting that into this function that we created to make it scrollable um I probably should go up here and talk a little bit more about some of this code I realized I didn't really walk through that that much but for data the selecting data types uh we can do numeric like we did up here for with MP number or we can do objects which are all of the other types of data in the data frame usually categoricals this T dot T transposes the data frame so it goes from um my example above where all of these are on the in the row area and it switches them to being in the column and just transposes the data frame that way so that's something I actually learned from Chachi PT I didn't know the dot T did something it's a little embarrassing but you know we learn as we go and I'm glad I know that now rather than than not knowing it later so next I just got a list of all the columns we can just explore that I kind of use this just to copy and paste if I want to get a couple of the columns and I don't want to worry about misspelling them so now something that's important that we do is we explore the dependent variable in our data set so we need to ask should it be normalized and then we can go about normalizing this dependent variable this is something I did after the fact so I didn't think about it really for this analysis but when I went through some of the other analyzes that were out there that were scoring significantly better than I did on the competition this is something that made a lot of sense so this is The Notebook I got this from um it's a really good notebook he scores quite high on the leaderboard so it's worth uh after the fact going through and exploring other people's work I think when you're learning it's okay to to just do stuff on your own before Consulting other other work and stuff just so you can bounce stuff off but if I'm actually doing data science work I'm looking at a bunch of resources I'm looking at a bunch of other people's code trying to get the best results possible I also thought this was useful for other other people when you might not have a notebook that does something similar for me to go this through this with just my own approach but this was definitely something useful and I want to be clear that I did add this in after the fact so what we're doing here is we're creating two graphs so we're creating a histogram as well as a QQ plot and that shows the distribution and the skew associated with the pricing of the houses so as we can see there's a ton of write skew in this distribution associated with sale price and the QQ plot is so if there was no SKU it would be the purple dots would be right in line with this green line so that suggests that we should probably use um like a log transform on the the norm on the data just to make it fit a more normal distribution we're going to do that a little bit later this code looks really confusing right all these data you know all of the stuff that comes in honestly it started out I basically just wrote this line so we're using plotly graph objects we're taking the sale price as the X we're setting the number of bins we're calling it a histogram um and and that's all I put in I put that into chatgpt and I I said hey essentially like give it a dark theme also add a trend line and I want the colors to be green and purple and I iterated a couple times and it produced these two visualizations I personally think this is again a really good use case for chatgpt there's a lot of different plotting libraries out there you know you have matplotlibs Seaborn plotly like we're using here the list goes on on others even like ggplot for python now and I want to be able to use all of them I don't necessarily want to have to know how to use like to learn everything there is about each individual tool so I think using uh chat TPT or these AI tools to create visuals is very very effective saves you a ton of time and the cool thing about visualizations are that you can look at it and say hey this matches up with what I think it should be so there's some validation built in just by looking at the visuals you probably should as you advance learn one of these packages well so you can debug more effectively but I think in the learning process being able to create powerful visualizations like this just like without having to know that much is something that can accelerate how you learn and help you create impact or help you you know learn these things and see these things more effectively really quickly so you know that that's something that I I teach this now so different than I would again five months ago before these tools but I think that this evolution is is really useful and really practical so next let's go into what questions do we want to ask of the data so this is an exercise I do for every project that I have I sit down and I say okay you know I'm trying to figure something out about price that's relevant to me with all this information at hand what things would I like to know and understand better and I went through and I I actually asked Chachi PT after I read in all that data hey what questions would we like to answer and it came up with about half of these and I added a couple of my own so um how does the distribution of dwelling types uh relate to sale prices does zoning impact sale price does the street and Alley access type affect sale price what's the average sale price by property type is there a correlation between property age and sale price is our correlation between living area and sale price and does price change year to year so I went through and I created this list for myself that we can go through and answer each of these individual questions I solve this in the same way that I solved the other uh you know that created the other visualizations where I wrote the very basic code so you know I wrote essentially this go figure the data into the bar chart what was going in um and I let chatgpt improve the quality of the graph so again it looked pretty basic uh just the bar chart and I had it update the theme and and put in the labels and those types of things so we can see here that their single family homes are the most important these townhouse e which is townhouse end units is the second most popular duplexes are the the third most common normal townhouses so not end units and two family I don't know what two family two oops let's go so 2 FM con 2 family conversion originally built as a one family dwelling so those are generally the least common and then what we wanted to know is how price impacted that so something that's really interesting is single family homes and townhouse end units are pretty comparably priced but townhouses and townhouse end units are there's a huge premium on these end units so if I'm someone interested in maybe investing in real estate maybe when a new development goes in all of the townhouses are the same price but that end unit it might be worth more in real resale Market whatever that might be that's that's something that tells me useful information um next we wanted to see how zoning impacted the sale price and again I essentially just wrote some of this code so we grouped by this and then we're going to use a plotly Express or uh to create this graph of the average sale price by zoning so I thought I would instead of using graph objects use politely Express you know I can use two different tools and I had again chat GPT just get the theming um and and create like for example these these dollar sign formatting that that would have been a pain in the butt it was really useful for for improving the quality of this data visualization so let's look into what the different zoning things are so we can look at zoning okay so we can see um we have I'm gonna pull this out here to do so we have C which is commercial um I'm trying to get this all in one thing FV which is floating Village Residential don't know what that is RH is residential high density RL is residential low density and RM is residential medium density and then it looks like there wasn't enough data for some of these other data types um you know what was Fe again floating Village Residential was the highest price we should probably do distribution of these as well like how many data points there were but this is pretty interesting so again from an investing or buying perspective these are probably going to be relevant in our model now let's keep going down so we wanted to know about Street and Alley access and if there was a you know if there were impacts on sale prices um we're essentially grouping by Street and then we're looking at the the average sale price for those doing the same thing by Ali price and we're going to create two different graphs here I I use the same process that I described before for all of these graphs again I could walk you through each of the individual graphs what all the different things do but in a future where we have ai we have some of these systems I don't know if it's necessarily even worth it like I I think we're going to be doing a lot more of this iterative coding and and looking and getting feedback um and you know maybe it would be worth it to talk about um you know some of the fundamentals of plotly but I would rather you be equipped to use any of the visualization tools even if it's through an interface uh with AI um so I'm probably not going to focus as much on this if if there really is an interest in me going into the details of all of the visualization tools like this just let me know in the comments but as of right now I'm probably gonna uh gonna lay off with that and just work through this so we have the average sale price by Street uh you would imagine paved roads paved streets are are worth more not much of a surprise there um same thing with with alleys um next here we have the average sale price by property shape um I did myself a favor this time and actually put what these were so I don't have to keep going back and forth so ir1 is slightly irregular ir2 is moderately irregular and ir3 is completely irregular um and so we see here so R1 R2 R3 so ir2 so like slightly what was it moderately irregular uh are actually more uh generally fetch uh more on the open market um and then property Contour uh let's go back here and explore that land Contour is the flatness of the property so we have banked um so this is banked what does banked mean it's quick and significant rise from Street raid to building we have Hillside significant slope from side to side low depression and then flat so Hillside seems to be at the highest value low and then level um I guess that's not super surprising you want to have a view down but it's interesting that data supports this uh and if you know we were a realtor I would imagine I would find this fairly useful um so let's keep going here and keep going down this one was pretty obvious but I was interested so we make a new variable here that's a little bit different so we create property age we just take the year sold minus the year built so we want the age when it was sold right like how old it was and we're using a core plot here rather than a bar chart and so we can just see all of the data here and we can see that there's generally a downward Trend associated with price and age so the older houses um are generally sold for less than the newer houses um you know the idea of like an antique where we where we see a massive Spike after a certain age Mark is is not present there we also wanted to look at something as simple as a living area until price this is a lot clearer of a relationship and maybe I'll have a chat GPT go back in and add in a trend line that's something I can do relatively easily with these tools probably take me like 30 minutes to figure out on my own if I hadn't done it before if I didn't have them so to me that's a that's a pretty interesting one is that okay this is probably going to be meaningful in the in the in the data set here and then the last thing I wanted to do is have a little bit of a different graph so we're going to look at the at the box plot of price over the years we also actually going to see what years are in the data set so 2006 2007 2008 2009 2010. um obviously I think 2008 was the the housing bubble burst and so it's interesting to see um or maybe it was this whole period actually uh but you know there's a massive um uh variation in 2007 of pricing uh obviously the the pricing in 2008 2009 2010 uh decreased uh quite a bit there so we can get a lot of information about this might tell us something about how um the individual year is a continuous variable the year sold might not be as good an indicator uh as as a like a relational variable like a continuous compared to a categorical variable where each of these are counted as independent events so let's keep going um the next thing that we want to do here is to actually clean up this data and make it usable in a model so you know we did some stuff like creating this variable here and that's something that we're going to do that a little bit later but you know we want to create these we also want to potentially normalize the data based on the models we were using we might want to impute some of the data for the null values we also might want to scale the data based on the types of models that we're using um and eventually we also might want to one hot encode the data we're actually definitely going to want to do that in my ml process course we talk a lot about these we go into all the different ways to scale the different ways to impute all the different ways to encode if you're interested in that you can either check out the course I also have the all of the code that we use for the course and the resources we use for the course completely for free on my GitHub that goes for both the ml process as well as our algorithms course that we recently released so um you know obviously a lot of it if you purchase my courses and I think the values there because we explain everything but there's a lot of free stuff there as well that can 100 give you a lot of really really high level education and insight into this type of stuff so what we're doing here is we're creating Transformers and we're creating uh well right here we're creating pipelines so pipelines essentially just let you run um two different or as many as you want it lets you chain actions so the data that goes into the computer then goes into the scalar and then it outputs it this is just a really like clean way to chain functions together or to create what's known as like a data cleaning pipeline you could even put your model being trained at the end of these pipelines and run it all together rather than running these things in steps so I like creating pipelines for by data cleaning so that when we start with data start with our trained data we put it through this pipeline it gets output the data we use to clean our model to train our models when we run the test data we put that test data through the exact same Pipeline and then it means that it will essentially like guarantee that it works in our model unless we we watch something so that's a really good practice for continuity it's a really good practice for making sure that your code is is scalable and repeatable and it eliminates a lot of issues there so what we're doing here is we're creating a numerical Transformer so this is going to act on all of the numeric data in our data set so we're going to impute all of the numeric data with the average and then we're going to scale the data here we're gonna for the categorical Transformer this works only on our categoricals we're going to impute all of the null values and categoricals with a constant so if it's null we're going to treat that like it's a new category uh we're also going to one hot encode which means we're going to create dummy variables so let's say we have three three categories that things could be so like hot cold room temperature this makes our data set now have three columns actually two columns one that's hot and one that's cold and the last column will be essentially if if it's zero in those two other columns we'll know that it's room temperature if there are no values we can have three columns um and if it's if it's zero in all those columns then we would know that it was a null value and that's accounted for in the model the next thing we're going to do is we're going to split our data into the categorical columns and the numeric columns um and then we're going to drop our dependent variable then we're going to create this preprocessor which is a column Transformer so a column Transformer lets you choose which columns you want to adjust so we're going to only want to for the numerical Transformer to adjust the numeric columns and for the categorical Transformer we're only going to want to act on the categorical columns and this pass-through just means that all of the data is going in and all of the data is coming out we're not just returning the columns that we transformed so we can add an additional pipeline this step I guess technically isn't necessarily necessary usually you would just add your model that your your trained model um that you'd want the pipeline to go through and you'd run the pipeline and it would produce a result so you would just have everything packaged into one thing I like to pre-process the data and then test all my models and then eventually maybe at the end throw it into to the pipeline that's how you would generally in theory productionize something you use maybe more fancy tools but in theory that's that's how it would work so now uh what we're going to do is we're going to create X and Y values so we're just taking our data set we're dropping the price we're taking our the the sale price which is our dependent variable our Target variable and we're going to take the log of it so remember that's essentially because we found that it was skewed this should normalize the dependent variable and it will you know based on my like doing this problem it'll give us better results um we're now going to take X pre-processed and we're going to fit transform based on the pipeline that we created here it's going to run this processor which is this column Transformer so our data will now be completely cleaned um through imputing and scaling and imputing and one encoding so later we're going to do some more data cleaning manipulation future engineering and in theory I could go back here and make adjustments to this pipeline for the sake of like teaching I'm going to completely you know copy this Pipeline and iterate it a different time so you can see where the where the changes are made so the next thing we're going to do is we're going to start our model training so we have linear regression random Force XG boost and then we're also doing a I will in a little bit we'll do an um an MLP which is essentially a neural net um but it's done on the CPU we're not doing Advanced tensorflow or Keras but it you know we'll see how it compares to the to the other results especially with this small of a data set I think it's probably Overkill to use one of the more advanced models um for this data pre-processing step I tried using chatgpt I I didn't have a whole lot of success it got me like some framework but uh I I found that for me to get the exact steps that I wanted for my data I had to be pretty Hands-On here um I'll link the documentation to sklearn pipelines and categorical Transformers if you really want to dive into that I think that'll be useful um admittedly this is not in my ml process course yet but I'm working on that segment and hopefully I'll have it uh in there within like a week or two of of this video coming out because after doing this project I was like man this is really important I haven't been using these as much as I should um so you know I thought it would be useful to to use a linear model and then random forest and actually boost obviously tree based models are a lot more generalizable have less constraints associated with them and so I just wanted to compare compare uh these three um I God I feel like I'm plugging my courses a lot but in our ml algorithms course I go through and you know we explain all of these different models in a lot of detail so you can understand hey when would be a good idea to use linear regression why would random Force make sense for something why would actually boost make sense for something and we go through how you can train all of these um which I also think is is pretty valuable um as well but in the in the GitHub I do have like details and kind of exhaustive parameter tuning for all of these so you can just go there and just copy and paste that if you want to use these models you can copy and paste all the parameters for them um I went through we created some parameter grids for each of these we use three-fold cross-validation and then I have essentially uh this going through and using grid search which means it tries all of the different permutations of these for each model and Returns the best results I'm not going to run this because it took a little while but xgboost produced the best results with these parameters realistically I also use chat GPT to create this code uh I said what models I was using I asked it to create the parameter grid and I asked it to train the grids and produce the best the best results and the best parameters I also kind of think that this part of data science like the optim optimizing and the model fitting and tuning that's something that is going to be um not obsoleted but it's something that will be going out of Vogue over time I mean this is something that like AI or automl or these types of things are good at so this is less Artistry this is pure science and this is something that I feel comfortable saying hey like have a machine do the majority of this it saves me a lot of time um so you know obviously future engineering which is something we'll talk about later that's something that I think is really difficult for machines to do and that's something I would encourage other people to focus spending a lot of time on um even the data visualization although you're not um you're not necessarily like coding the visuals I think it's really valuable to Aggregate and build that skill um even if it is through telling uh an AI system what you want because there is tremendous power in those visualizations still this I feel it is a little bit more like utilitarian like you're just there is some Artistry in like choosing the parameters but I think that'll be less and less over time as you get uh really like good rule rules of thumb uh the next thing I do is we're going to do an MLP regressor um we we copy in the data this is something we we need it to be scaled uh so that's something I made sure I did um we didn't actually do any more scaling because we scaled it in our pipeline but I just copy the training and tested I in theory could have used just train and just XT X test but um I think I was feeling a little bit lazy because because all the code is already written um you know we we go through uh we look at we we set the random state so we get similar results every time um we were running to some issues I think with alpha or with the learning rate and uh I extended the number of iterations of the model to 10 000. um and this is an early stopping parameter so if the results don't improve for three straight trainings um it will just stop and then we initialize the the learning rate at that value um you know again uh in the in the algorithms course I go into detail and all the parameters but essentially these are just uh four different uh like architectures that I tried so this neural net has uh like 10 nodes in one layer this has two 10 layer uh two ten node layers that says three and this has just one layer of 25 guns uh we try two different activation functions relu and tan uh relu is usually the most common use now and we use the atom Optimizer and then we set Alpha which is the normalization parameter to try these different ones and we try a couple three different different learning rates here um we do the same thing grid search we fit and then we try to look at what the best score is at least in this one it didn't perform that well I did find out later that my Alpha values um were not optimal so I expanded these later and we ended up getting a lot better results but that's something that I just needed to experiment to to Really find more additional details about so something you might have noticed right here about linear regression is that the linear regression results were unbelievably bad and that's because we were feeding in so much data and also a lot of the variables were highly correlated so I didn't do a core plot because there were so many variables and it would have been really messy but this is something you know for example if you have a large yard you're probably going to have a lot of house core footage you're also probably going to have a lot of bathrooms so because of the multicollinearity and the data we ran into some some pretty big issues with linear regression being useful here so something I wanted to try to do is see if it would clean things up was a principal component analysis so principal component analysis essentially is a that's a data manipulation technique that makes uh that gives you a certain number of features and it tries to uncorrelate the data so what we do here is we looked at how many um rows would I'm sorry how many different like new variables would um would it take to have 90 95 of the data like saved and we'll just take that many and essentially would be able to now feed into the linear regression um variables that were not correlated like they were before so we create this PCA and admittedly my explanation and understanding of PCA probably not the best they could be um I don't actually use it that often but for this use case it was it was pretty practical uh just to hedge myself I'll I'll throw in some links about how PCA actually works and what it is um in the description and maybe I'll even I'll even tag it at the top of the screen uh but so we we add a PCA to this pipeline um so we have the preprocessor now we just add in PCA and then we can transform this data so it's xpca so we run the exact same models now I shouldn't have to re-explain all of this and I wanted to see if the results were any different so I should probably go up there so it's like 0.16 so the linear regression was way better so we're getting 0.16 instead I think um the random forest and the XG boost were just a little bit worse yeah it looks like they were just both a little bit worse but the the linear regression we fixed that problem with multicollinearity now when we run the MLP regressor uh this time I knew a little better and I fixed the alpha values and we ended up getting slightly better results um than we had before I think it was 0.23 that we saw before uh so now I went through and I wanted to see um based on the training that we did how would we perform if we were projecting on our y-test so why test is not to be confused with uh with test over here we have not read not to be confused with this test this test is only to be used when we're submitting the Y test we created was [Music] um was right uh well this is for the PCA um the Y test we created was right here from our training data set so we split our training data set into train like additional train and test sets and so like the Y test here is technically an additional validation Set uh for us to evaluate how well our model training did like in theory we probably could get slightly better results if we didn't split it uh just because we'd have more data to train on and we're working with the sample data set so that's something I probably could improve upon or you could improve upon if you wanted to make this better but I thought this was pretty useful um just for us to to get a status check on how we were doing so as we can see xgboost um without PCA um like that was actually just a little bit worse than xgbooth wish boost with oh no it's actually a little bit better than XG boost with PCA but linear regression dramatically better uh when it was when it was normalized like that um we also wanted to look at the MLPs so that our little neural network and with PCA it looks like our MLP was just a little bit better so we we started to understand something about our data you know maybe PCA is better for some maybe not better for others um and if you wanted to use PCA for some of the models but not some of the other models you could build a pipeline to do that so let's keep going down here this is where I start getting into feature engineering so you know earlier we talked about how um you know some of the things were not encoded correctly right we had I forget what what um what piece of data it was um but it was a numeric rather than a a when it should have been a categorical data data set so we can go through and start correcting uh some of these uh some some of the data around that so I want to explore just some of these different data points um and we can go through and and look at all of them here um there was a couple that I that I thought were were pretty useful to transform again I asked chatgpt for some of this and it didn't really give me great results so I went through and built some of these custom variables so the first one that we already saw I built property age we have total square foot which is the basement square foot plus the first quarter floor and second floor square footage we have total baths we have has the house been remodeled um uh does the house have a second floor um because we were looking at like getting these derived from other things uh does it have a garage so we're creating another categorical from a numeric um we want to convert this year sold to category um I don't remember what Mo sold was but we turned it to a category a year built category and this Ms subclass as a category as you recall this was a numeric before so we're creating a couple more features that would be relevant here um the way we're going to do this is with a with a function Transformer so it just takes this function in and it returns uh like where we're including the data frame and it returns all of these um all of these new features that we've created so if you wanted to go in and just add more features which I highly recommend you do you would just put them in here you would add them to the new categorical or new numeric columns and then it would automatically be integrated into your pipeline so that's one of the beautiful things about creating pipelines is that you really only have to adjust them in one place rather than me having to go through and like change code all the time in all the different feature engineering steps so has remodeled has second floor nice garage those are the categorical variables we're adding um property age total square foot total bath your sold category most sold category year build category and Ms subclass category so let's because I'm curious we'll look at Mo sold I can't remember what that was um month sold okay there we go yeah so like seasonal months sold might be be useful that was pretty obvious I probably should have known that one um all right now let's go down here um we're essentially just rebuilding this pipeline I I could have put all of this above in that previous uh data processing step but I thought it would be good just to put it in here so um what we're going to do now is we added those I added this line of code to append these new columns to the to the categorical lists and then we essentially just go down here um and add this feature engineering Transformer before that preprocessor code so if we make a new component like we want to do another future engineering step we don't have to go and rewrite this preprocessor code if it comes before or after we can just put it in here and it'll run it in sequence so again I cannot stress how useful these pipelines are um and I I think it's really worth it for you to invest time getting a little bit better at those and practicing those that's something I mean I guess you could ask gbt to put something in a pipeline but to make sure they work make sure they excuse me make sure they work reliably um is pretty pretty useful skill to develop all right with that being said we're going to do the exact same thing um we're gonna go ahead and train these same models and see if our future engineering gets us uh a little bit better results I really should have just had like a running list of the code um to see if we could do a little bit better but um so let's see here so 16.3 15.2 and 13.5 so we had 16.4 so the linear regression is a little bit worse oh random Force dead a little bit better and then I think that did oh that actually boosted a little bit worse so um you know getting varying results I I encourage you again to experiment with additional features that you create that'll probably help quite a bit next we're going to go through and run the exact same thing for our um for our MLP regressor for our neural net I changed some of the activation functions here I was a little curious if we can improve so this isn't an Apple's apples comparison of the other ones but I added a sigmoid activation function I also changed the alpha to go all the way up to 100 to see if that would would help with anything and we changed the learning rate initiation to point one instead of point zero zero one so if I recall correctly this did a lot better than our previous iterations I just encountered a lot of Errors because since some of the parameters weren't optimal it just made it really long and we got the by far our best MLP results here all right so I think it was changing the you know the alpha and um maybe the initial learning rate oh the solver was different so SGD is stochastic gradient descent which you know pissed for it's not cutting edge like atom Optimizer if I recall correctly is more commonly used but maybe since the data set is so small the gradient does not worked significantly better who knows um here we go through and we check and see how our results did again okay so we saw that we did a lot better with our MLP now we actually want to start getting prepared to submit here so what we're going to do is we're going to read in our test set and you know we saved ourselves a ton of time engineering this test set by creating the pipeline so we're just going to run the test set through our Pipeline and now our data is pre-processed it's ready to go into our models so now we can just run the predictions on the test set and see how we do okay so we're going to prepare some of these different data sets for submission so we're going to do an extra boost submission so we're taking the pre-pross data or making a prediction based on the the feature engineered XG boost and we're going to take the exponent of this so this makes it go from our log prediction to our exponent prediction so we have to undo the normalization that we did initially on the sale price right so that's going to reconvert it back into something that's usable in the prediction right here we're just gonna get the data set ready for uh for submission so it needs to be in a specific format so we need to have an ID and we also need to have the sale price prediction that goes in so we essentially if we if we look at the the sample of the submission we have two columns the ID and the the sale price right here we're going to create the CSV if we wanted this to be our actual submission we would change it to just submission.csv I just wanted to save these results I submitted them all on my own independently um to see how they did we're going to do the same thing for random Forest we're going to do the same thing for um MLP and then we're gonna do the same thing for a Ensemble model here so right here we're just taking all of our results and we're averaging them so this is a the most basic Ensemble you can do and we'll see how the results performed on that front and then the last thing we're going to do is we're going to do another type of ensembling which is stacking so we're going to take the results of our three models and we're going to run them through another predictive model to see how that performs so what we did here is um essentially we created three different types of models so we have another neural net we have another linear regression we have another XG boost and we're going to run the results of our our three models that we're choosing um through all of all of these to see which one produces the best results again I didn't write this code I had chat gbt write this code because I wanted it to all be done at once but I thought that this would would make the most sense I go through in the ml process course through all of all ensembling techniques uh that are that are relevant there's really only the two there might be one more that I'm not thinking of right now or the averaging and the the stacking um and we see that the xgboost Ensemble by far produced the best results so all we do here is we go down after all the training we have the stack model I'm sorry the best model from that stacking regressor um produce our predictions and then we put it into the data set and then we submit the results so we can see let's see if I can find all of my submissions real quick um let's see here um we can look at my submissions so I I submitted quite a bit um I think this was my best result yes so it turned out I submitted everything and the best result was the just MLP prediction um without the additional ensembling that happens sometimes um I think we can go to leaderboard and see where that Stacks up so he has like 800 and um well it didn't it didn't oh there we go yeah 876 out of maybe like four or five thousand so you know not the best out there but pretty good um let's talk about some ways that you can improve upon upon these results so I only used I think it was four different models linear regression random Force XG boost and uh and uh and neural net um you could try different modeling approaches you can plot try maybe sport Vector machines you could uh you know try maybe like light GBM or some of these other models you can also do more feature engineering you could explore that you can explore different different ensembling techniques um you could explore um you know maybe normalizing some of the other features that have SKU outside of the dependent variable there's a lot of different things that are available to you a lot of tools at your disposal heck you could even ask Chachi BT how you could improve upon these things um I think that that's again a really good Learning Resource something I've relied on a lot more heavily in how I code and how I do projects like these so hopefully this was useful hopefully this helps you understand how you would approach a problem like this from a data scientist perspective using chat GPT and also how you can continue to expand upon this analysis so thank you so much for watching and good luck on your data science Journey
Info
Channel: Ken Jee
Views: 41,863
Rating: undefined out of 5
Keywords: Data Science, Ken Jee, Machine Learning, data scientist, data science 2023, data science project, data science project walkthrough, kaggle compeition, kaggle project, data science basic project, kaggle.com, kaggle basics, kaggle submission, data science kaggle, kaggle beginners, data science beginners, kaggle analysis, svc, kaggle, housing prices
Id: NQQ3DRdXAXE
Channel Id: undefined
Length: 61min 1sec (3661 seconds)
Published: Tue May 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.