Time Series Analysis in Python Tutorial - V1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's up your data friends is janice here and welcome to my channel this video is video part one of the time series models series of videos that i will be running in this video i'm going to go through what is time series models explain how they work explain the machine learning process how to import and install your packages and libraries what is the problem we are trying to solve then we're going to load some raw data into python then we're going to go through the data pre-processing phase and then i'm going to show you how to run arima which is auto-regressive integrated moving average then i'm going to show you how to tune your arima model by finding the best parameters so you can improve the accuracy and then i'm going to show you how to run auto arima after that i'm going to show you how to run profit which is another time series model provided by facebook and then i'm gonna show you how to run two plus models together so this example is how to run both arima and auto arima together in the next example i'm gonna show you how to add both profit and regression so now on the same code in the same loop we're gonna have four models in one code and at the end i'm gonna show you how to run it on every single country which is about 230 countries in our data set then so this is all the output from all the countries and at the end i'm going to show you how to combine the actual data with the forecast and then we're going to export the data both in a csv and in an sql database afterwards i'm going to show you how to take your results and predictions and build these power bi dashboard from scratch so here we can see all the countries there are actuals there are validation predictions in their future predictions all the way to 2023 and if we were to select a country for example so let's do united states of america we can see the four models and their predictions we can see the mean square error so if we were to choose the best model it was going to be the one with the least mean square error so that was going to be the regression square error so if i was to do a prediction for united states i was going to choose this number over here for 2023 i have all four models here and we should expect the gdp per capita of united states to be about 67 000 for 2023 we can do this for any country or all countries together so if we check india for example we can see there is a bit of difference in the models but we can choose the one with the least mean square error which is the arima one so that's going to be the green one so that's going to be this one over here so we should expect an increase from 7 000 into 10 000 gdp per capita for india in 2023 right and before we start this tutorial if you're new to my channel and you're passionate about data science please consider liking this video subscribing to my channel and enabling notifications for my future videos right so let's start what is a time series model the time series are series of data points listed in time order if we go over here just to show you an example time series is when we have a series of dates so you can see here it starts from 1990 all the way to 2008 and we have some values then we take these raw data we feed it into a model and we get a value prediction for each date or we can forecast going forward so if i go back again time series forecasting is the use of a model to predict future values based on previous observed values and then the process is similar with the machine learning process but the inputs are limited to two inputs as we have shown here we only have date and value and we predict going forward so if we go through the machine learning process quickly we start with the problem formulation we gather the raw data then we do our data pre-processing and then we have to split the raw data so this is where we take the split as you can see in this example we have the mobile sales per month and year so we split the data somewhere around here then we select and train our model based on these data so these are x train all the way until here then we do the model evaluation so we predict which is the red dotted line and then based on a metric so we're going to use mean square error we are going to evaluate our model and then we are going to do parameter tuning so we are going to adjust some of the parameters so we can minimize the mean square error as much as possible and improve our model after we are happy with our model we are going to run our final model so we're going to use all the data from here to september 2021 to train the model and then run a forecast on the future which is going to be this dotted line over here right moving on we have our packages and libraries so we're going to be using all the main ones so operating system numpy pandas macro leaf and sebum we're going to be using from stats model the arima one from skl and linear regression libraries for these libraries are for the auto arima this is for profit and some additional libraries for daytime so if you're missing any of these libraries or packages you can run pip install and then the name of the library and that's going to install it on your pc right moving on is the problem formulation phase so here we need to ask the question of what we are trying to solve and the answer is that we want to forecast the gdp per capita per country for the next 7 to 10 years so that's what our time series models are going to be forecast moving on we need to load our raw data so here i load the raw data from an sql server but for you i'm going to be providing you a csv file where you can load the raw data from a csv file so i'm going to have the links in the video description below in order to load the raw data now all you have to do is say row data equals pd.3 underscore csv then in here you're gonna need to paste the path where you have downloaded the raw data and also at the end add the name of the file and a dot csv also use encoding as latin slash 1 and this is going to load the raw data from excel so this raw data is going to load it into raw data over here then what i do here is that i print the shape of our data so our shape is 270 000 observations and eight columns and i also print the head of our data frame so you can see we have all the columns in the first five rows of our dataset right next we need to go through the data preprocessing phase so the first thing i always do is check for null values so by adding our data frame here and then saying dot is now dot sum we're going to see the number of rows that have null values by the way i'm not using this raw data i am using the sql raw data but both of them are the same one of them comes from sql the other one comes from the csv is exactly the same data frame anyway as you can see we have about 2066 null values in value so in here so we need to investigate this however what i did is that i went straight for the delete because i know that these null values do not come from the gdp per capita indicator that we are going to use and just to show you how to investigate this to investigate the nurse we can do we can write this query so we can add the value and then say dot is null so this is going to return all the rows that have a null value in value however i only want to investigate the uh indicators so if i add the indicator in here and then i'll say don't unique we can see the unique indicators that have null values and the gdp per capita which is this one over here gdp per capita coma ppp current international dollar doesn't have any null values here and this is why i drop the null values straight away as you can see sql to drop n a from the subset of values and then i also removing the last two columns so these two columns that are full of null values and i don't need then i check again for null values and i don't have any null values the next thing i do is a standard piece of code that i always run to investigate all the elements within each feature and what this code does is that it goes through every single column from our data frame and it prints out two things either the number of distinct values so 35 distinct values or the actual elements within the column so if i change these number of values to let's say more than 36 so it's gonna pass the threshold it's going to print all the demo ids and all the indicators so if i run it again you can see now it says the number of values for feature demo in is 35 and then it prints all the 35 so 1 2 3 4 all the way to 35 distinct demo ids and then it does the same for the indicator so it prints all the 35 indicators up until here so these are all the indicators and then for the location country time and value it just prints the number of distinct values in those columns right moving on the next thing i do is that i am just selecting everything from my data set where the indicator equals gdp per capita ppp current international dollar as this is the only thing i want to forecast going forward additionally i also exclude a few countries from our dataset the reason i'm excluding them is because they have some problematic data that i have identified later on on these analysis but just for ease of this tutorial i'm just deleting them from here so the way to delete them is actually run the same code where you say dot e-c in but if you add this symbol at the beginning of your data frame just before you run the e-scene it's actually going to exclude them and then i'm passing it back into this data set so if we run this we get the same the same columns as before we have above here but now we only have one indicator so in indicators before we had multiple indicators now we only have indicators equals to gdp per capita the next thing we want to do is to limit the data to the columns we need so we only need country which is this one time and value so what i'm doing is that i am passing these three columns back into our data frame so you can see we limited the data the next thing we want to do is to change the time to date so change this column to be an actual date as most time series models require a date structure to do this i'm just taking the time i'm making it a string and then i'm adding the dot zero one and then the dot zero one as you can see over here and then i'm taking this column as it is so now it's a string remember and i'm passing it into a pd dot uh two date time and then i'm passing into two day time before cast time so now this time is actually a day and it's actually a day time the next step now is to split our raw data into x train and x valid our x train is going to have all the data before 2012 and our x valid which we're going to use for a validation and evaluation is going to have data after 2011. so just to visualize this because we do have the output if i go over here imagine that i am splitting so this is just india but i'm doing this for all countries i am splitting all these gdp up until here so imagine i'm putting a line here and then i'm selecting all of these for training data and then from here 2011 all the way to 2017 the green line i'm going to use for a validation data set which is exactly what i'm doing with this code over here and if we print the two shapes we can see that our training has about 3966 observations and our validation data set has about 1 32 observations and both of them have three columns and the columns are country time and value right moving on we are going to run arima which is auto regressive integrated moving average time series forecasting model and we are only going to run this on one country for now just to simplify things a few notes on arima now arima uses a number of lag observations of time series to forecast future observations and it takes as inputs and parameters the p which is the number of lag observations in the model which is also known as the lag order and this is the ar so the auto autoregressive the d which is uh the degree of difference which is the integrated and the q which is the size of the moving average window which stands for m a which is the moving average if you want to learn more about arima you can follow this link and read more about the math of arima and how everything is calculated i'm not going to get into it just because this tutorial is more focused on applying it rather than explaining it right so the first thing i want to do is to create two new subsets of data that have data only for one country and i'm going to use australia as an example so i create this au so if i copy this quickly if i run it and paste it you can see that we only have data for australia in this data frame and also create the au2 which is just the time and the value for this data frame i repeat the same process for our validation data set because i'm going to use these two to validate our results then what i do over here is that i set the time into an index as you can see time is now an index you have to do this to run a time series models and the last thing i do is that i create an index for seven years going forward so as you can see over here the index i'm creating it's a range from the au2 index minus one so it takes this auto index minus one it means that it's going to take b's the last value and it's going to use frequency as a s which is an annual frequency and it's going to use seven periods so it starts from here and it adds uh six more uh years as you can see from 2011 all the way to 2017. the reason i've done this is that i'm gonna use this index to make predictions to to use it as a forecasting input right the next step now is to run arima with just some random numbers so i call the model arima and then into the model i fit the au2 which is the au2 created over here i give a random order of two zero zero so two is actually the auto regressive so from here the zero is the degree of difference so the zero is my i and then the last zero is the moving average we have from here and then i fit my model with discrepancy equals minus one then i create some predictions so i call my model model arima underscore fit.forecast i want to forecast seven data points and i want to take the first row then i'm passing the same index as uh the index i've created into my forecast so just to run it bit by bit to show you let me just create this if i copy this and run it so my forecast now if i copy my forecast is just seven numbers without an index and then if i run this it means that um i'm going to pass this index i've created of those seven days into my forecast so if i run this now you can see there is an index on my forecast and the last thing i do here is that i plot my predictions so i initialize an empty plot with this line of code then i pass a line plot which is this blue line over here x i'm using time y i'm using value and data i'm using the au which is the first one not the indexed one and then i'm setting the title into a you and then i'm creating two additional plots which is two additional lines the first one is the red one which is my uh au color sorry yeah the red one is the forecast because i'm the forecast one dot plot so red is the forecast and blue is the actual so as you can see we haven't done a very good job as our prediction is going plateau while the actual is still going up and the last thing we print is the root mean square error it's actually the mean square error not the root mean square error which takes as inputs the actuals so the this is the uh comes from valid from the validation over here and then the values then takes it compares them with the forecast uh values it calculates the difference it takes the mean and then it squares the mean but i'm not squaring the mean it's just the mean now so what this is telling us is that our predictions from our current model are off by 3130 on average the next thing we have to do now is to try to improve our model and to do this we have to try multiple different combinations in this order over here one way of doing it is this piece of code i have down here so what i've done down here is that i've created um a range of p d and q from 0 to 5 and then i get i passed the dot product of this range 0 to 5 into a list so just to let you know what this is if i copy it and and paste it down here is all possible combinations from zero to five you can see i keep scrolling and it's not actually five because it starts from zero but it's five different numbers and all their possible combinations for the dot product so what i do here i say for parameter in pdq i try to fit a arima again on the same dataset and passing the first order which is going to be this order then i fit the model and then i call the aic so the aic is the akaige information criterion which is an estimator of ensembl prudential error and thereby relative quality of statistical models for a given data set so what it means is just a metric that evaluates how good your feed is and it works like the mean square error so the smaller the number the better it is what this code did now if i run it it goes through every possible combination and it prints this a i c and what we have to do now is to choose the combination with the minimum a ic which is this one 0 2 and 3 as you can see is the minimum you can also see that we don't have all possible combinations and this is because arima is not going to work on all possible combinations so for example 444 is not here and if i put 444 at the top it's going to throw me an error there you go so you can see d more than two is not supported based on the data set we currently have so what we need to do now is to try this combination the 0 2 3 as you can see it gives us the lowest value so if i go at the top and i do 0 2 and then 3 and if i run this you can see we have a much lower mean square error and you can see our prediction is much closer to the actuals the challenge now with this method is that we have only done one country which is australia and different countries are going to have different optimal parameters so imagine doing these repeating the same process as we've done here 189 times which is the number of countries we have it's not very efficient so this is where auto arima comes into place if i scroll down here is the second option where we have the auto arima model so to run auto arima is very similar with the arima so you just call the model you fit the same data in you say if there is seasonality or not in our case because it's gdp per capita there is no seasonality and then you have to set the m now the m is a bit challenging because if you read about what is the m the m parameter relates to the number of observations per seasonal cycle and that is one that must be known from before so because we don't have any seasonal cycles for uh gdp as far as i know i'm gonna try with seven and i'm also gonna try with 10 and 15 remember we use years so if i run this with seven by the way everything else is exactly the same as before the previous code which we have just explained you can see that seven is not giving us a better model than the one we have identified then if we try five maybe so five still is not as good if we try ten still not the best so i'm just going to leave it at seven and my main point here why i have both arima and auto arima is that you cannot be certain that auto arima is going to pick up the best parameters my conclusion is that you have to run some auto ones so when i say auto i mean manual ones find the best parameters and at the same time run the auto arima just to compare the two together which is something i'm going to show you later on when we combine all the countries and the models together right data friends i'm gonna stop this tutorial here and finish the rest of this series in the next video so i hope you've enjoyed this video and you've gained enough value out of this video if you feel like you did please click the like button subscribe to my channel enable notifications for my future videos and if you have any questions please let me know in the comments below otherwise thank you very much for watching this and i'm gonna see you in the next video
Info
Channel: Data 360 YP
Views: 12,317
Rating: undefined out of 5
Keywords: Time series models in python, Arima in python, Auto arima in python, Prophet in python, Regression in python, Arima example in python, Prophet example in python, How to run time series models in python, How to run arima in python, How to run prophet in python, While loop with multiple time series models in python, Time series machine learning in python, time series forecasting, plot time series python, python time series, time series forecasting python, facebook prophet, arima
Id: axjgEgBgIY0
Channel Id: undefined
Length: 25min 13sec (1513 seconds)
Published: Mon Dec 21 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.