Using XGBoost for Time Series Forecasting in Python ❌ XGBoost for Stock Price Prediction Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there welcome back in this video i want to show you how we can use xgboost for time series prediction and in order to understand xgboost first we need to understand what boosting is and boosting is a technique to create ensemble models and boosting fits a series of models and then fits each successive model to minimize the error of the previous models and there are different variants of boosting one of them being gradient boosting and this is exactly what xgboost is is an optimized distributed gradient boosting library that is designed to be efficient and scalable it implements machine learning algorithms under the gradient boosting framework now in order to understand better what xg boost is from the perspective of what machine learning algorithms it uses xgboost is an ensemble of decision trees where new trees fix errors of the previous trees that are already part of the model and trees are added up until there are no further improvements that can be made to the actual model now although xgboost is usually used to fit tabular data there is no problem fitting time series data because the only thing that we need to do is make sure that we evaluate it a little bit differently and for that we're going to need to use a walk-forward validation instead of k-fold because k-fold might have some bias results of course right so we would need to use walk for validation and if you're not familiar with walk forward validation don't worry i'm going to explain everything further along but before i start i just want to thank you for being here for viewing this video and for subscribing to our channel because we really want to help you guys grow in the data science and machine learning space and all of the videos that i want to do are practical in nature so that you guys can have an applied edge over other people that just focus on uh the theoretical aspect because in the real life we need to learn applied data science rather than just theoretical aspects that cannot be put in practice so again thank you for visiting this channel and now let's go ahead and uh and see what we can do about this hgboost problem so the first thing that we need to do is install xgboost it's very easy you can just install it with pip or pen and if we look at the documentation if we look at the installation guide this is exactly what we get right you can install it with pip install xgboost is very simple and for our particular problem we're going to need to install scikit-learn as well because we're going to use the scikit-learn api for our regressor so make sure that you have scikit-learn installed as well and let's first load the libraries and the debuggers and we're going to be using for this problem a microsoft hourly data for the past year and if we check the head of the data we can see that we have the open high low close volume average and bar count but we're just going to use the close so we're going to have a univariate time series problem and if we check the head now we can see that we only have the close prices all right so we're going to need to transform this univariate problem to a supervised learning problem now the way we do that is by setting the target to be the next days or the next hours in this case the next hour's close price so we're going to create a target column where we actually shift the close prices by minus one okay and if we run this and then of course we drop uh the nulls because we're going to going to have a null at the end of the series let's just see the head now and you can see that the close for 132 the target is 134 as you can see from here we're using 132 to try to predict the next hours close price all right and the same way here right 134.75 we're going to try to predict 133.88 so the first thing that we need to do is split the data set into a train and test it and we're going to do that very easily by applying this method okay so we're going to just get the values and then we're going to specify basically a cutoff point so we're going to have to specify a percentage that we want to hold for the test set and then we're going to return the data we're going to return the the train set up until that cutoff point and then the test set will be from that cut cutoff point up until the end so this is what our method does and if we split our data frame and we split it in a 20 test set we can actually see that the original data frame has 1700 and then our train set has 1 400 and only 354 hours tested okay so this is our train test set split which is very obvious and very easy to to create as well so in order to make a prediction we're going to use the xgb regressor class and xg boost regressor is an implementation of the scikit-learn api so if we go here to the documentation we can just go to our python package so we can see exactly what documentation we get for for the python package for xgboost and if we go to the psyclearn api we will find our xj boost regressor class and we can see all the parameters that we can pass for for this specific class so you can see there are a lot of parameters and i'm not going to walk you through these ones because it's not relevant for this particular video but it's important for you guys to just go through them and understand what each parameter does so that later on when you apply it for your problem you will know exactly whether the default parameters are suited for your problem let's go back and we're going to take the train set and the test input row as inputs and then fit a model and then make a prediction so we're going to first get the x and y okay which is if we look at the x we're going to find our feature and if we look at our y we're going to find the target so the only thing that we need to do now is import the xgb regressor from xgboost and then create the model and we're going to use an objective which is the squared error this is just a regression that uses a linear model and if we check the api reference so that you understand where you can find these objectives it's very very easy you just go to extra boost parameters and then we can go to learning task parameters and here we have all of the objectives that we can use and of course our default is our regression with squared loss but i just specified it because i wanted to show you guys where you can find all these learning task parameters for your particular problem for example if you would want to fit a logistic regression then you have binary logistic or you have as well for a multi-class classification problem then you can use this soft max okay you can use multi soft prop or multi soft max now this is super important for you guys to understand where you can find all of these things in the documentation because of course you will have various problems and the whole point is to show you how easy it is to find everything in the documentation if you know exactly where to look for it now we can just fit the model so we are specifying the xgb regressor with an objective of squared loss and a number of estimators of 1000 and we're going to fit the model we see all of the parameters that are specified by default so we're using a learning rate of 0.3 and if we check let's actually try to predict one step into the future so we're going to just get the first item in the test set and we can see that the first item the feature is 184 and we want to predict 183. so we're going to use the value of 184 and try to predict with the model that we just fit let's run this and our prediction is 184.65 which is not anywhere close to 183.74 okay because this is the actual uh real value that we wanted to get and our prediction is 184. so we didn't even get the direction right and also the magnitude is way off but of course we just applied this to one sample but we have 351 records that we need to uh to predict so we'll we're going to see later on exactly what is our root mean squared error and whether our model actually performs well or not all we need to do now is create a predict method so we can predict one sample at a time so all of this thing that you've seen before what i've done here i just put it in a in a method so we first get our train set and then we just specify the x and y and then we fit it and then we predict one value that is given as a parameter what you see here is exactly what we've done what we've done before so let's run this and if we check the results of this method we can see that we get the same result because we just predicted the same value now that we have our xgb predict method we're going to have to create our validate method with our walk forward validation technique and this is where things get interesting because since we are making a one step forward prediction for this specific case we make an hourly prediction we're going to predict the first item in the test set and then the second item and so on and so forth so the way the walk 4 validation works is that we use a train set and then we predict one step into the future and then we add that to the train set and then we retrain the model and then we predict the next item the second item into the in in the test set and then we add it to our train set and then we predict the third one and so on and so forth until we get through all of the records in our test sets and this is why we call it a walk for validation because what we do we take the train set and then we predict one step into the future and then we add the record to the train set retrain the model and predict the second step and then so on and so forth because we only predict one step at a time and we're going to evaluate the model with the root mean squared error and this is the method is very very simple okay so first we're going to just import the mean squared error from second learn and the validation works like this so we're going to first split the train and test it by a percentage that we are giving to the validate method and then we specify the history which is our train set and our train set initially will be just the train set and then we for the whole range of our test set what we do we get our test x and test y and then we predict the first step in our test set and then we append this prediction to a prediction list so that we have all of these predictions in the list and then we append the item in the test set to the history so that our train set is always updated with a new record at each point and we do that for the full length of our test set because as i said earlier we want to predict only one step into the future retrain the model with the the new record and then predict the next step into the future and so on and so forth until we go through all of the records in our test set and this way after all of this is finished we can just get the error we can get the root mean squared error because here as you can see we pass in the predictions and the original values and we pass in squared equals false because we want we want it to be a root mean squared error and we're going to return the error the the test the original result and the predictions as well so if we run this let me just run the validate method and now when we run this and we're going to print the root mean squared error at the end it's going to take a couple of minutes but as you can see i passed here a 0.2 so basically we're going to use a 20 for the test set as soon as this is finished we're going to have the root mean squared error printed so we're going to see exactly what that error is the root mean squared error doesn't tell you much in absolute terms unless you actually compare it to the scale of the features that you passed into the model but let's see exactly what what value we get and then we can decide exactly what uh whether that value is is relatively good okay now it finished and the runtime is 3 minutes and 49 seconds and the root mean squared error is 1.7 now having in mind that our variable is in the range of okay 130 140 our root mean squared error is relatively small but of course you will have to try out different models and try to optimize this root mean squared error to minim minimize it as much as you can and therefore you're going to find the model that best fits your data by running just a model and just having one root mean squared error you won't know that much okay because you need to compare the root mean squared errors to other root mean squared errors and see exactly which model minimizes the root mean squared error okay so maybe the parameters that we use for our xgb regressor aren't the best okay maybe their learning rate should be smaller or bigger i don't know but that's why it's up to you guys to figure out the best parameters that you can use uh for xgboost or for any other uh model or ensemble that you want to use okay because this is the trick with time series uh it's very hard to predict into the future right especially for in in the stock market like we're trying to do the problems aren't that simple as has to be solved just with with one model as a conclusion my advice is that you don't use xgboost or any other machine learning models that are fitted only on close prices or just the open high lows and close prices you will need to have an edge over the market so you're going to need to find information from other sources that you can add to your models because if you just try to predict by the close price or by the open high low and close prices you're not going to be very successful because the markets are way too complex for this type of problem okay so definitely do not run this model in a real-world scenario in the stock market do your research and try to get an edge over other participants so the markets are way too complex just to be fitted with uh with one variable or a couple of variables that are known by everybody you need to have some sort of information some alternative data that you can pass into your models and try to extract that alpha so with this in mind i really hope that this video helped you and i'll see you in the next one you

Info

Channel: DecisionForest

Views: 12,184

Rating: 4.8390093 out of 5

Keywords: time series forecasting, data science, machine learning python, machine learning basics, time series analysis python, time series machine learning, time series forecasting machine learning, time series forecasting models, machine learning projects, xgboost machine learning, xgboost, xgboost python, xgboost tutorial, xgboost regression python, stock price prediction python, stock price prediction xgboost, xgboost time series forecasting python, xgboost prediction python

Id: 4rikgkt4IcU

Channel Id: undefined

Length: 16min 39sec (999 seconds)

Published: Tue Oct 27 2020