Multi-Variate Time Series Forecasting (VAR Model)| Complete Python Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone my name is ranchiket and welcome back to another video in this video i'm gonna guide you through a complete project of implementing vector order regression on a time series model so i've made several videos on time series forecasting and there were several requests for an implementation of a vector order regression model and as usual i've gone to great lengths to make sure that the code that i'm gonna be providing is gonna be the simplest possible and i'm sure by the end of this video you're not gonna have any doubts left and i'm gonna guide you through the complete process of getting the data how to pre-process it and what kind of steps you need to first take and then how to forecast uh your time series into the future using a vector regression model so with that let's get started and this is the code that i'm going to be using if you want to understand the vector order regression in depth the intuition the mathematics behind it i've already made a video on this in this video i'm going to be focusing on the implementation aspect so we start by importing all the necessary libraries and i'm gonna get explain what libraries we need as we move along with the code and over here if you don't know what vector or order regression is i'm just gonna give you a gist of it this is how the equation looks like and you basically use vector regression where there are two or more time series which are interdependent on each other so for example normally in auto regression what we do is we want to make predictions in the future and we use the previous values the previous time lags of a particular time series to make predictions into the future in back to order regression we assume that there are two time series which have a correlation so when we are printing at time series let's say y1 we're gonna be using the previous lag of y one which is y one t minus one as well as the previous time lag of some other time series y two similarly you can look at the equation of y two as well which is dependent on both uh the previous log of y two as well as the previous lag of buy one this is the fundamental concept and in more detail i've explained in my other theoretical video but in this i'm just gonna focus on how to actually implement it so to start with let's first start with the data set i'm gonna providing this entire code as well we start by importing a data set and i get a data set from this github link and this contains a lot of information of the economics of a particular country like the real gross national product the potential cross national product and stuff like that first step is of course to read the data set we do that over here using the read underscore csv command we provided the url of the data and we mentioned past dates equals the date column this makes sure that the pandas understands that the date column has date values and it does not treat it like a string after that we mention the index column as the date and over here i'm going to printing the shape and i'm going to be printing the first five values so that you can have a look and this how my data set looks like these are the time series there are around three four five six seven eight time series and there are 123 rows all right and you don't need to know what these represent the basic concept is you have some time series and you want to make a forecast and rgnp basically stands for real gross national product pgnp stands for potential gross national product and ulc is unit labor cost the rest of the information you can find in the url of the data set as well first step always is to plot a data set to get a feel of what the data actually looks like and for that i'm gonna be using the matplotlib library i'm creating some subplots basically since i have multiple time series i'm gonna be creating four rows and two columns so that i can have four into two equals to eight eight plots in one single group and basically i'm gonna be numerating through all the columns one by one which you can see over here and all the columns are gonna be plotted with their respective names which i'm setting over here the rest is something just to make the plot uh look better so here is what the plots look like and you can use vector order regression to forecast all of them together but for the sake of simplicity i'm gonna be using only two time series and just by visual inspection it looked like r g and p and the ulc value might have some sort of correlation it looks like they're increasing uh in the same pattern so for now i'm only gonna be using vector order regression for these two models but you can extend the concept for multiple time series as well so before starting of course it's important to check if your data is stationary or not although when i'm training the model i'm not going to be giving it stationary data which i'll explain why i'm doing later but in general in all time series problem this is what you should do so we use the augmented the keyfield test and we have to provide which time series i'm gonna be checking i'm not gonna go in depth of this uh test but i'm just gonna explain you how to interpret this right so i'm gonna be running this test on uh macro data rgmp column similarly i'm gonna be running it on the ulc column and i'm gonna be printing two values which is the adf statistic and the p value and when i run this this is what i get and the only thing that you need to be bothered about is the p value if the p value is less than 0.05 that means your data is stationary if it's greater than that then it's not so i can see the p value is 0.98 and 0.99 which means my data is not stationary and you need to convert it to stationary and what's gonna work for you in most cases is just gonna be simple differencing and i've just commented over here and i'm gonna remove this comment so basically i'm gonna be differencing the data set over here and this is basically first order differencing each value is gonna be subtracted by its previous value and the same thing i'm gonna be doing for ulc column as well okay so when i print this i can see that both the p values are less than 0.05 which means my data set is now stationary if it doesn't work for you you might have to do the differencing multiple times and then your data set can become stationary this works in most scenarios all right so before proceeding further when you're doing vector order regression it is important to check that the two time series that you're working with or multiple time series are they correlated do they have any sort of correlation and that is the fundamental of vector order regression and you can check that using the granger causality test function which you have in the start model library and if i go to the original documentation you can see you just have to call the function and provided the data set and it tells you basically that whether the time series in the second column causes the time series in the first column so for example if i'm giving it two columns ulc and rgnp it will tell me does rgnp cause ulc and i can specify for how many lags do i want to check for now i'm only checking for four lags if it doesn't work for you you can extend to multiple lags and see if that works so similarly i'm gonna be doing it the other way around does ulc cause rgnp and when i print the results this is what i get again uh there's a lot to understand about the tests but if you simply want to interpret it look at the p values the p values should be zero point the p values should be less than zero point zero five so if it is that means that this hypothesis is true basically r g and p times use call this ulc and if i look at the other result does ulc cause rgnp i see that the p value is not less than 0.05 for the first lag at least similarly i get a value less than 0.05 somewhere around the second lag somewhere around the third lag as well as the fourth lag right so i have to take a lag greater than one if i want to do forecasting for this that's clear right off the bat now uh before actually feeding the model into the time series and fitting it we have to split the data set into training and testing and here is what i'm going to do that first i'm only going to extract the two columns that i'm going to be working with which is the ulc unit labor cost and the real gross national product and the shape of the data is 123 rows you can see that i have 123 observations in which i have to work with so for now i'm going to be splitting the data set like this i'm going to be taking the last 12 values in the testing part and the rest values in the training part so if i run this code and i print the shape of the test data you can see there are 12 rules all right so let's proceed with the fitting we still have to ascertain how many number of lags do we want to consider so for that i'm gonna be using the var class which i imported from starts model and what i'm gonna be doing is i'm gonna provide the difference data uh and i'm not gonna fit the model here because i'm not gonna use this class to fit my model i'm only using this because it provides an amazing function called a select order where you simply have to provide the maximum lags that you want to consider in this case i've given it 20 and it prints the summary and it will automatically run the analysis and it gives you some characteristics like the aic score bic score fpe and hq ic score you don't have to understand this just understand that for a good model all of these parameters should be minimum as possible normally only looking at the aic score is also enough and what this class does amazingly well is that it highlights where the minimum was found so for aic fpe and hqic the minimum lag was found the minimum value was found at lag number four so that is what i'm gonna be using uh to build my model i'm gonna be using the past four lags right so in the equation that i've shown you it only considered uh one lag before right only y one t minus one and y two t minus one what we're gonna be using is we're gonna be using four lakhs back as well so so in the equation you will see t minus two t minus three t minus four as well but you don't have to understand that if you want to build the model because the predefined classes handle everything for us and to fit the model i'm i'm going to be using the var max class and the reason i use this particular class is because it makes forecasting very easy so what i have to do i have to simply provide the training data note that i'm not providing it the difference data i'm providing it the non-stationary data and i'm going to tell you why i'm doing that i specify the order four comma zero so notice this is a var max model which means we have to specify the order for the auto regression and the moving average part but we're building only a var model so the moving average part does not exist for now so the first order for the order direction part is 4 since we're not gonna be using the moving average part the other order also called sq is gonna be zero so this is a simple wire model that we're gonna be using and this wire max class has something called as enforce stationarity and i specify that as true and when i went to the original documentation they mentioned if you keep it as true it will automatically transform the er parameters to ensure there is stationarity so that is why it did not provide a difference data to this and now i'm simply going to fit the model using wire model dot fit and i'm gonna print this summary and the summary is what is gonna let you know about all the original equation and you can see what the minimum aic bic hqi scores was and this is the main part this is what tells you what the equation is right so you can see the equation for the time series ulc is going to be using all these parameters so you can see the lag l1 basically means lag one so ulc depends on the previous lag of ulc the previous log of r g and p and from l one you have all the values up till l four and their respective coefficients right similarly you have the equation for rgnp as well all the four lags for ulc and rgnp and their corresponding coefficients so now you have got the equation as well for your time series model now if you want to make forecast it's really simple because of the varmax library i simply call the function i call the model first and i call the function get underscore prediction and i have to provide it from what date i want to start making prediction and the end date so the start date is the length of training data set so wherever the playing data set ends from there i want to start making predictions and for now i specified the number of forecast as 12 right so it's going to make for 12 steps in the future and if you want to make future if you want to make predictions way into the future simply change this number if you change to 36 48 the number of predictions will increase so here's where i specify the last forecast so from the point the train data set ends plus and forecast minus one all right so just running this line is gonna get you the predictions and uh when i write predict dot predicted underscore mean it's gonna give me the mean of all the predictions and you can simply look at all the predictions over here i have renamed the column as usc predicted and rgnp predicted and you can see we have the predictions up till the next 12 months till the end of my data set all right so if i want predictions way into the future i can simply vary this parameter let's say i put it to 24 and i run the same code all right and uh i can see the predictions i get is up till 1992 right but for now i'm not going to do that but feel free uh to use this i can also specify instead of the starting and ending point i can also specify the starting date and ending date as well i'm just gonna comment it and leave it here and put this in the github repo and you can explore that over there so my predictions is done now is the part where i actually plot and see how my model did so i'm gonna creating i'm gonna be creating a new pandas data frame where i'm gonna concatenate the original testing data set and the predictions all right and i'm going to store it in a new data frame called as test versus predictions and i plot it with a figure size of 12 comma 5. so this is how my predictions look like the two lines at the top is basically rgnp and rgn be predicted and the bottom two lines is ulc and ulc underscore predicted and the model seems to be doing pretty well right and if you want to put a number to how good it is doing you can find out the mean squared error which is what i'm doing over here i'm finding the mean squared error i provided the predictions and the testing value and i simply print it over here i'm printing two things i'm printing the root mean squared error as well as the mean value general logic is if the mean value is let's say 100 then the root mean squared error should be way less than that right so when i printed i can see that the mean value was 178 and the root mean squared error was 54 this for ulc and for rgnp the mean value was 3900 and the error was 345 which is pretty good because it's less than 10 of the mean value right so that was all for this video if you did like it do like this video and subscribe to this channel and see you on the next video
Info
Channel: Nachiketa Hebbar
Views: 52,912
Rating: undefined out of 5
Keywords:
Id: 4jv1NGlAc_0
Channel Id: undefined
Length: 15min 0sec (900 seconds)
Published: Sun Jul 11 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.