Price Prediction with Python and Power BI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm going to walk you through a diamond prediction model that i created and with a couple different machine learning regression models using a data set of 54 000 diamonds and we are going to walk through the code so you see how the prediction work and works and see you know which models we chose to use i piped this over to power bi which are you looking at at the moment and then i was able to customize this so we can look at the various thresholds using a parameter and a color option that allows us to see which values fall within the parameter that we're looking at so let's get started first by taking a look at the code that i have in jupyter notebook so um the first thing i'm going to do is load in the dependencies so i've loaded in numpy pandas which is our data manipulation library numpy allows us to do linear algebra i've already loaded in a random forest regressor a linear regressor a lasso regressor and a k neighbors regressor so those are all the models that i'm going to attempt to see which one gives me the best result then i'm going to be using train test and split to split the data into a training set and a test set um i'm going to be using some pre-processing libraries like standard scalar uh one hot encoder but i think we're going to use a pandas git dummies which does the same thing um sk learn metrics which i'm going to be using mean squared error and then i'm going to be using numpy to turn that into a root mean squared squared error as my measure of model success then i'm going to be bringing in seabourn or and matplotlib to do some visualization so i'm bringing in the next line i'm bringing in my data set as a csv and then we can take a quick look at the data set here by just using the head function on df so i've read the csv in using the variable df i used a data frame function called info to see if i had any null values and what the data types were so you can see we have one two three four four categorical or string features and the rest of them are numerical so i can run a df.describe function to see what my statistical summary metrics are for all of the numerical numbers then i've isolated the columns that i wanted by just using df columns and the only reason i do this is so i can just copy and paste the columns when i'm trying to isolate them instead of typing them in again we take a quick look at the data types and then i'm going to drop this unnamed zero field that sometimes happens when you read in a csv and i'm going to drop that on the axis equals one because i want to drop the column and then if i just quickly check the head you can see that that column is gone so the first thing i wanted to do is use my seaborne library to see what the correlation was between some of the metrics and because this is a diamond data set we have like cut color clarity which are the what they call the big c's uh the depth of the diamond the table which is the top part of the diamond the flat part uh the price which we're going to be using as our predicting variable our target variable and then x y and z and x y and z are are the dimension and those are obviously going to be highly correlated and we can see that here so i just ran a df dot cor for correlation and then i put that in a heat map and we can see here that we can see the 0.95 are highly correlated and these are all dimensions so i want to prepare the data set and work on some feature engineering because we want to eliminate some of those variables that has a really high correlation because they're unneeded um [Music] just i i perform the drop in a but i don't think we have any na values in there so i think we're fine and here's where i change the categorical variables into numeric variables because this is necessary for psychic learn to work to have all numerical variables so i instead of using one one hot encoder i use the get dummies function which does the same thing and i ran that across the whole data frame then i separated that x variable by just dropping the price the x the y and the z variable and i'm oh and i'm sorry i for we did do a little feature engineering where instead of using x y and z i just made a new column called symmetry and i just divided x divided by y to get symmetry and i added this column back in here then i created a new variable called df trans because i dropped all the uh the i transformed all the categorical variables into numerical variable variables um and then i dropped the price to x the y and the z you uh from the x data set because i wanted to remove price because that's our target data target variable and then i did any x y and z because i created that new feature engineering variable called symmetry then i isolated the y which is our target variable by just bringing in price using bracket notation and then i saved the x columns and something called features don't remember why i did that but that all that does is save all the the column names next i that i have my my whole data set transformed i wanted to look at correlation again so i use another heat map and we can see all of these things this is the total data set uh not x and y so we still have those x y and z variables uh you can see all this highly correlated data here and then we can see that the symmetry variable that i have here so then i want to scale this data because it's all on different scales and we wanted to get a similar scale to all of the the metrics then i did that by using standard scalar and i fit the data and transformed it just by passing in the x variable now that i have my x and y variable i did a train test split to get the training set and the test set now we can start predicting and using the models that we have so the first thing i did was create a data frame to hold all the results that i want then i got what we call the null um prediction which is just the mean of um our variable there so i just took the mean and i'm going to pass that in so that's kind of like our baseline that's the mean of the target variable y then the first model that i use is a k in regressor and i brought that in i'm using 7 for the nearest neighbor variable and then i fit that data to the x and y variable and then i made a prediction and saved that to y print using the k n variable we did the same exact thing for random force we did the same thing for a simple linear regression so just by bringing in that model fitting the x and y variables and then predicting we did the same thing for lasso and then i saved all of this into the data frame that i built at the top and then for my measure of success i'm using root mean squared error which is in the which is in the values of the actual y variable so if diamond price is six thousand dollars we can see how far we're away from that so the mean squared the root mean squared error is here in these models and we can see which models perform the best k n and uh looks like random forest did a good job you can see here is the null variable that's our baseline so i think we'll just use the k n variable because uh it although it did the same as random force i'll just use that one then i wanted to visualize what i want this model to look like and what i wanted to communicate to the end user so i created a scatter plot and i did this with subplots so we have a scatter plot of prediction and the actual variables we're going to recreate this in power bi so we can see the difference here so we can definitely see there's a linear relationship and then this line here is just our uh test data and change the color to red so it's literally just test so we can see what the model looks like so we're going to recreate this in power bi but give it a little bit more interaction so here is what i was able to create in power bi um but let's see how i put that coding that we did into this to predict so let's go over to the query editor by going to transform data and i've had to modify the python code to work in this so i have the diamond data set and then i'm going to show you the script that i modified to put in so here's exactly what we did in the last one um remember this variable at the top you have to change everything over to data set instead of using df as your data frame variable because that's what power bi uses i've brought in the most important thing so we decided to go to the sk the kn neighbors model so we just brought that in we did a we brought in train test split standard scaler i dropped all the columns to get my x variable again and then i've ran the uh get dummies on that i isolated my target variable by just bringing in price i use standard scalar then perform the train test split fit the data to the x and y train and then instead of just predicting predicting the the the test variables i did a prediction on all of my data set because i know the model was good from the previous the previous jupiter notebook so i ran that on all of the diamonds on all of the data set and then i created a new column called prediction so if i go down here we can see that i have a whole new column called prediction here and this is going to be a prediction of each one of the variables so if i just close that and i also i created an index here just to be able to to identify which diamond is uh which so i'm going to close that um so we have our predictions on the bottom and our price it doesn't matter if you flip the two um i did create this threshold that we see that allows me to change the error so let me walk you through how i did that because i think that's like a and that definitely works if you isolate it by cut or color so you can definitely see how everything is affected so let's look at our fields here's the diamond data set that i i have so the first thing this is a parameter i've made this with a what if parameter and what that does is it creates a new table for you and then you're allowed to put a parameter in here so i'm going from the default is zero and it goes all the way up to 5000 and then it goes in increments of 100. you can see there and that's connected to this thing called error threshold and all that is is the parameter that is created there the prediction error here is just the average of the price minus the prediction and the way i did that was just go over and use um average x to go down the whole table and minus prediction minus price to get the average the next thing uh we just brought in these categorical columns and created filters this is a total sample so let's see how this was created this is using a difference um variable so if i look at the visualization here i just put made a new column called different so i could see what the difference was in the tool tips so that all that is is um predictions minus the actuals and then um i created the index column so we have this level of detail so for for the ability to see the color change this is here also so the how we created this color change is by using the conditional formatting and using this metric so what this is saying if the difference which is just the prediction minus the actuals is greater than our parameter or less than i parameter give me yellow and it's less than the negative parameter so that we can have the symmetry here so it's looking at what's above 500 in terms of difference and what's below negative 500 and that allows us to give us that column of information and all i had to do is go over to this visual click into our colors data colors and then use the function here and then bring in the field value which is the color change dax that you just saw there and that allows us to dynamically highlight the variables based on that negative parameter value there i'll definitely uh put this in a github repository so you can download it also have the code in the and also if you have any questions please don't hesitate to ask please subscribe and um i hope that added some value thank you
Info
Channel: Absent Data
Views: 26,902
Rating: undefined out of 5
Keywords: machine learning, prediction, python, power bi, power bi machine learning, python and power bi, power bi tutorial, prediction power bi, python tutorial, pandas tutorial
Id: IP76UJ4nZ70
Channel Id: undefined
Length: 19min 10sec (1150 seconds)
Published: Sat Oct 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.