Predicting AirBnB prices using Mito and Scikit-learn

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey i'm aaron from mido i'm going to walk you through an example of how to use mido in conjunction with some machine learning packages to do an interesting analysis so let's jump into it this is the notebook that we're going to create together so we're going to use mido to kind of look at the initial data set uh we're going to be looking at an airbnb data set and we're going to try to predict the prices based off some of the information that we have about each home we're then going to do some cleaning of the data in mido and then we're going to kind of ultimately prepare the data set to go into the machine learning kind of pipeline so we're going to do some of that in pandas we're then going to kind of validate the transformations that we've made in mido again and do a little bit more cleaning and then finally we'll put the data and the models to the test and see how we did so without further ado let's jump into it like always the first thing i do is uh import a miter sheet and then uh import my data so i can kind of start to get a sense of what the data looks like so we have quite a large number of rows 17 columns and let's go through and kind of figure out which columns we're going to want to use which ones we have to clean up and which ones we want to just get rid of so i'm going to start to do that i don't think we're going to want the name we'll just delete that and the host id is kind of the same thing and i don't want the host name okay great so now i've cleaned up this data set i've removed a few columns and just so i can continue using this in uh you know the rest of my analysis i'm going to run this code and now i have this new altered data frame in in my notebook like that the next step that we're going to want to do is actually kind of transform some of these string columns into numeric values the data set the you know algorithm that we're going to use to actually do our machine learning requires that we do all of this on on indicators so i'm going to start to write some code to do that now that we've converted all of our categorical variables to numeric variables i'm just going to validate that it all worked correctly by looking at the data again in the moto sheet cool so here we can see the neighborhood group these are the nand values that we filled in then there we can see them here the neighborhood is filled in with values instead of the strings and all of our columns look like they are numeric so the last thing i want to do before doing the machine learning part of this is just to make sure that we don't have any any columns that have a huge amount of nand values because that's going to throw throw off our algorithm so let's just go in and validate that by looking at the summary stats for each column okay so the reviews per month column looks like it has 48 000 nand values missing values if we look at the values tab we can see i think if we scroll down to the bottom we have a lot of unique values but 21 of them are missing so let's create a new column again and use our fill nan formula and let's just assume that if they're missing um a value in the reviews per month column then they have zero reviews and let's call this reviews per month cleaned now i'm just going to make sure that the other columns look good [Music] cool so i think our data is ready to go uh we're gonna now get some of the boilerplate code from scikit-learn and put our data through the ringer so we found this tutorial code on creating a linear regression using scikit-learn on medium from the people over at becoming human ai so let's just copy this code and let's paste it into our analysis here and cool so how this code is going to work is it's training a linear regression model and it's breaking up our data set into three different splits of that data and then it's going to train the data use the training data to fit the model and then we're going to test the data using this testing data set and we're going to do that three times and each time we're going to get a score of how well our model performed and then we're going to print that square out uh when we're done and we can assess the accuracy of the model and figure out how we want to improve it going forward for this first model we're just going to look at let's say the reviews per month and see if that is a good predictor and we're going to try to predict the price and i think with that everything is ready to go let's just make sure these named correctly and let's give it a run okay so as we can see these are the r squared values of the three tests that we ran they are quite quite atrocious so so far we've only used the reviews per month cleaned variable that we created let's try just throwing in the rest of our the variables that we've set up all the ones that we have in this meadow sheet here and see how that performs [Music] okay cool so we've added all of our fields to the x variable here which we're going to use as uh to make our predictions but since these uh values are all uh quite different we can see here some of them are you know in the range of a couple hundred a lot of them are in the you know low single digits uh we want to just standardize these variables to make sure that our algorithm treats them all kind of kind of equally so we can get some more code to do that i'm going to hop over to the website again to pull that code and we'll be back in a sec okay so we're back again we have all of our variables set and we've added in these couple lines here to actually standardize the training features so what we're doing is we're using the pre-processing package that we imported from sklearn here and we are going to use a standard scalar to actually you know scale these scale these um values to be more in the range of zero or so negative one to one so we're going to do that for our training data we're then gonna fit the model again we're then gonna do the same thing for our testing data and then again we're gonna score our results and fingers crossed this will be better than the previous ones well they are better 0.027 as a compared to 0.006 it's quite a big jump but unfortunately these are still quite horrible results that's all we have to go over this week unfortunately the model didn't turn out to be as great a predictor as we would have liked so if you have ideas about how we can improve the model leave a comment and we'll try to implement the best the most promising ones next week i hope that this video although it didn't turn out to be the best airbnb predictor did show you how you can use mido as part of a larger python machine learning workflow you can use mito where it's most helpful to visualize and make quick transformations to your data and then you can continue using the gender the generated code and the altered data frames that you've created to interact with the rest of the python ecosystem have a great one
Info
Channel: Mito
Views: 105
Rating: 5 out of 5
Keywords:
Id: SCcKSqXPM0s
Channel Id: undefined
Length: 9min 9sec (549 seconds)
Published: Tue Sep 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.