sales forecasting with Prophet (data science deep-dive project part 1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone welcome to the 30 days of data series my goal throughout this entire series is to get you to understand what it's like to think like a data scientist give you resources to truly learn about data processes the math behind models Which models to even choose in certain business scenarios and use cases and really give you actionable code for you to start projects with so this isn't about just reading the code that I'm gonna give you it is about trying things out yourself today we're going to talk about forecasting it's going to be an entire forecasting pipeline this is what's going to start our series and the one request I have for you is to please engage ask me questions request videos for the 30-day series ask specific questions about this forecasting pipeline because this is just part one of three I just really want to engage with you and get you to engage with data science in general I'm really excited to start this don't forget to subscribe if you haven't already there's basically going to be 30 technical videos some of which are about feature core relation and the math behind it a full segmentation pipeline a regression series that will allow you to delve into identifying the right regression model for your business use cases and so on you don't want to miss out so let's get started together notebook links in the description and let's go I just realized I forgot to introduce myself to anyone who might not know me I'm Priya this is mango I'm currently a senior data scientist at Uber I work on Uber advertising been a data scientist since I graduated with a degree in astrophysics from uchicago and yeah that's pretty much it and we're going to talk about the forecasting pipeline just want to let you know that forecasting is essentially time series modeling when anyone talks about sales or demand forecasting this is pretty common for all businesses honestly like the daily sales on Amazon daily sales on Walmart the average temperature daily in Chicago it's really about that kind of data you can think of a date time stamp and a corresponding value per day and these these are usually the minimum requirements for demand or time series forecasting today we're going to use a model called profit let's get into the code today we're going to work on demand forecasting using data I found on kaggle this is going to be sales data for a store in Ecuador and we basically have data for various categories of items that they sell in the store on The Daily level so let's get straight into it the first thing I do is just import everything I need so you could just copy all of this from my Jupiter notebook onto yours and I have two functions in here that I just like having on hand at all times I just copy it from different project to project but I have one function that basically gives me information about if there's data missing it's a very easy function I just use it everywhere because you always need to check for nulls in your data cleaning process I also read in a function called mape that just uses numpy in there to calculate the mean absolute percentage error this is something you're going to hear often when it comes to machine learning model metrics of evaluation because you want to know how accurate your model is when you back test that model so these are just two things I read in now let's go into reading in the data into our notebook we see that the minimum date and the maximum date of the model like this is all of the possible data we have from the beginning of 2013 to 2017. generally for time series you're going to want about at least I would say like minimum one year worth of data but like one year really isn't enough to capture nuances in Trend growth between sales one year versus sales another year so I would say two to three years is something you want to Target but time series is the model where garbage in garbage out is the motto so you have to make sure your time series data no matter how many years of data you have is accurate let's aggregate this entire data set because honestly for the most basic time series model and you'll see how accurate time series can be even with the most basic data because everything is in inherently about seasonality and Trends within the historical data let's aggregate the entire data set to just the category which here it's called family and the sales each row should represent one day that's what we really want because we want the highest volume metric so we're aggregating across all stories nationally and this is going to be really good because high volume metrics based in my experience is what gives you good data just so many different things that could affect a store's volume of sales so the higher volume you have it can detect the actual values through the noise I'm not really talking about a couple hundred dollars a day in sales or even a thousand dollars of course you can create a model but if you want the highest level of accuracy you're going to want as much data as possible so aggregating this is going to give us more data across all stores nationally so we created an aggregate data set and I'm just plotting over here like I just added in a picture of when I plotted this previously this is what the data looks like for all these features there are so many different different columns so much data over here for a Time series model you want to do it one by one so for example over here for automotive we're gonna have one time series model for automotive it's going to calculate the best hyper parameters tune the best model for automotive because seasonally Automotive might sell based on a different seasonality than books for example because books might increase in sales during the school year at the start of the school year and so on there's going to be different spikes for different categories so we're going to do this one by one but in this pipeline we're just going to go through one category first to show you how to create one time series model part two is going to show you how to create a full pipeline that optimizes and automates every single category here in just one pipeline for you to get all the results at once this is what the data looks like and you can kind of see earlier on before 2014 the volume is much lower this isn't the clearest visual so let's audit one by one over here you see Automotive there are obviously some data discrepancies you see over here at the beginning of the year presumably January 1st the sales are zero this just might be an inventory issue the way their inventory catalog was forgetting that thinking about that later you can see Automotive baby care looks like you know really really irregular like the compared to automotive automotive you see a general linear Trend upwards baby care you don't really see that beauty again linear Trend seems reasonable books has no data there's literally no data here and so on so you can see in the in just the same data set how different the forecasting models are going to have to be based on these differences in data looking at Frozen Foods very very linear data similar spikes similar years this is great this is a really good data set for seasonality you can see that there is an inherent seasonal Trend that cycle frequency is yearly you see those spikes at the end of every year so now that we kind of understand our data looking at the the data you can kind of tell that before 2015 the data is not great so we're going to use data from August 15 2015 to August 15 2017. now that we did a little bit of data exploration and understood where all of these missing values are or zeros are now let's look at just general missing data in nulls we use the function that we defined earlier in the notebook there's no missing data so we can go ahead we're going to come back to this later but you know there are many instances if not all where it drops to zero for January 1st we'll clean that up later but just want you to remember that put that at the back of your mind so now that we have all of the forecasting data let's kind of group them by volume of data the volume of data is really important when it comes to time series because the higher volume metrics you have the better you can forecast past the noise we're just going to break out the different types of categories we have by low mid and high based on percentile so we're going to do 33 percentile 66 percentile and 100 so basically we're going to split it between 0 0 to 33 33 to 66 66 to 100. all of this is set up so we have a list of the columns the lists are broken out by the volume whether it's a low volume column a mid-volume column or a high volume column and this is all like automated you can do this like if you want to use this kind of breakdown just to visualize your data on the same scale because you can tell the skills are very different some of these features are in the two hundred thousand dollars a day region you know and that obviously great for forecasting some of them you see are down here at like you know a thousand two thousand dollars a day this is not the same volume now let's plot it based on the volume so these are all the low volume metrics you can see all of the seasonality within the data the spikes the similarity and spikes that happen yearly because we only have two years worth of data these are the high volume metrics you can see like on the y-axis these scales are in the hundred thousands so these are going to be generally good for us to forecast assuming we trust this day data so now throughout the rest of the pipeline let's try to build a really basic time series model profit is the model we're going to use throughout this entire forecasting three-parter and I'm obsessed with this model I've used it before and it's just fantastic so let's just it makes everything so easy to understand so my third video in this 30 part series is going to be just about Facebook profit we're going to go through this pipeline for produce first today the next video is going to be automating this entire pipeline in a way where we can address all of the categories across all the volumes all in one single pipeline automated so that all you would have to do is put in the data in the right format and then the pipeline does everything for you including hyper parameter tuning and optimization so that's the next video and then the third video gonna be all about just Facebook profit delving into the math behind the model and why I personally love the model so let's go into testing things out with this high volume metric for Pro produce let's try to predict the next 30 days so all we do is set the feature over here to produce basically the date and the sales data for produce on The Daily level for two years this is essentially the way we want to put data into the model then we add in Holiday data this is an Ecuador data set so you can actually pull out all of Ecuador's holidays and I added a link to all of the information and the codes for all the different country holidays over here for the holiday library but basically what I'm gonna do over here is just read out all of the possible holidays we can have in Ecuador and add a sliding timeline where we have a window of a day after the holiday and two days before the holiday like that four day period around the holiday could induce seasonality we just want to add that information to the model so it knows hey if we see something weird happening that day or we see a Spike let's attribute that to seasonality because this model takes into account many things let's chat a little bit about the model before we go straight into it I'm going to start the forecast on July 1st 2017. remember our data set ends on 2017 8 15. so there's basically only 45 days of data after this date of July 1st 2017. I want to do a back test essentially right now to see how accurate our model is going to be so I'm gonna train this model up until the forecast start date and we're going to predict out the next 30 days using this model now we actually have the next 30 days worth of data right so we're going to back test it and see what that accuracy is and the metric we're going to use is mean absolute percentage error here I want to go just a tiny bit into How Time series modeling and profit actually works the goal of this entire pipeline is to just give you like a start to end like this is how I had approached like a very basic project the first 30 minutes would be me building out a pipeline like this just to understand and test the data but I do want to go into a bit about what what profit is before the third video in this series where we delve all into Facebook profit so this model is essentially a regression model that takes into account various things related to the data so that goes through Trend factors the holiday the seasonality and there's an error term at the end of this equation time series essentially depends on something called Fourier series the way I think about Fourier sums this is something I learned in physics class so it's definitely funny that all of Time series is based on this but any image you think of any trend line you see whether it has a million squiggly things or not any trend line is made up of a sum of sine and cosine functions this model Facebook profit determines all these coefficients when you see over here it says you know Trend Factor holiday component seasonality component it's really doing partial Fourier sums in the background where it's trying to determine what are the the exact coefficient for what are the weights for these coefficients to create that exact Trend that you see you're essentially recreating that image of the trend using historical data with partial Fourier sum coefficient we could always do a full video on that I'm going to link some resources about what Fourier transforms are Fourier sums Fourier orders like all of these things are the back end of what lets you create those time series periods and oscillations so think sine and cosine coefficients I've linked a fantastic article over here called time series from scratch which will show you the breakdown of how to identify different types of Time series because there's a lot of parameters in the profit model that you're going to need to know one by one looking at your data hey this is an additive model this is a multiplicative model and there are different types of Trends and you do have to give the model information about what type of trend that is I will in this article I'm going to link right above me right now the different types of Trends and the different types of seasonality whether it's additive or multiplicative you can tell like just looking at this article that it's very obvious like sales data in general especially all of the data I have ever worked with in all the capacities I've built forecasting projects and they've been multiplicative models because over time demand increases but it's never going to just shoot up Skyrocket there's always going to be a limit to that growth and it happens over time so just think of this in the back of your mind and we're going to go through different profit parameters in the third video but let's just not even configure anything in this model just throw the model and put the data in and see what happens so for a profit model you always have to rename your columns you need a date column and a sales or value column whatever you're forecasting that is your second column you rename it to DS and why because that's how the model works in the back end and then I always just make sure that it's a a numeric value and that it's a date time object now we have the training set here I'm going to limit it to everything under the forecast date because we're back testing in this case and I want to know forecasting out the next 30 days and compare those next 30 days to what actually happened so we can create an accuracy score I'm going to read in the model over here this is literally like you know like 10 characters we're just reading the model and there are so many parameters for this model I'm just going to show you really quickly over here that there are so many different parameters that could be tuned or not tuned over here in Facebook profit and I really really recommend that you go through this documentation it's absolutely fantastic but for right now we're just going to read the model in and we're going to fit the training data set to it and then make a prediction where the number of prediction days we set is 30 30 days we're gonna predict the future and then yeah we have our predicted data frame it's just as easy as that let's see what it looks like you can see that you have you know from July 1st all the way till July 30th we have data and we have the predictions the different variables that you could potentially add into this model over here are growth change Point prior scale range the coefficients of seasonality or just a true or false on whether there is seasonality there's always seasonality with sales data the seasonality mode which in our case is multiplicative and holidays these are all things that we can add in we didn't add it in so this is the most basic prediction we can get but we are going to consider all these other variables in the next video and now that we have the predicted data frame let's merge it with the true data that we have now that we've merged it let's look at what it looks like we have the true data on the day and then the data that we predicted out of the model you can see that honestly it's pretty spot on like for the next 30 days you can look one by one and see that it's honestly not like super far off like if you just need a general directional metric you don't even need to configure this model it's going to detect all of these things in the back end mean absolute percentage error is really like it's it's pretty low at six percent again we didn't do cross validation cross validation is when you do this test this exact test I showed you where you cut the model off that one day and predict the next 30 days and we have the data for the next 30 days so you compare it to get the mape which is essentially 100 minus the mate is the accuracy score which in this case you have a 93 accuracy and you can assume in the future when you make predictions over the next 30 days your accuracy is going to be within 93 so it's going to either be 107 between 93 to 107 that's going to be your general range of what you're going to provide for your sales forecasting but we only did this cutting it off at the beginning of July 1st right and predicting the future the thing is you're gonna have to cut it off various times and predict 30 days out cut off predict 30 days out over a large time frame so you can normalize because sometimes the data points are going to cut off in a place where you don't have enough data to predict the future or something's off so cross validation is really important it's basically a way you can assess the effectiveness of the model normalized to all of the potential things that could be a downfall like the time period cutoffs the date for that time period and so on so we didn't really cross validate anything but generally this looks like a decent model and you know we could just end it here but we are going to go through a cross validation what this basically means is I'm going to add in some Facebook documentation over here so you can like look at it as I speak to it but essentially for cost validation you put your model in your model is split with your training data set already and you start with an initial period this is going to be the period where the model knows this amount of data let's say the model is trained up until August 15 2016. that's what this is saying you're your initial is 365 days the start date let's write this out the start date is going to be 8 15 2015 Right End date the way we have it set up right now it's going to be 8 15 2016. we have one year's worth of data we said that in initial the period means now the start date is always going to be the same for how much the training set has so this is essentially for the training set right now the end date is going to increase by 30 days because that's the period so it's basically going to keep going like keep taking 30 days extra worth of data adding it to the training set predicting 30 days out giving you a map and it's going to do this in a periodic order as many times as it possibly can given our cross validation metrics after all that cross validation the general mape you're gonna have given all of the possible things that can go on and now that we have that predicting out it creates a mape where the Horizon is like next three days next four days next five five days and so on and it gives you all of these metrics these are all metrics that help you evaluate the model and we have Mapes over here and this is generally what I use because the mean absolute percentage error is a very simple percentage comparing the True Value to the predicted value and you could use that very easily explain it to stakeholders and be like hey we're within 100 minus the mape so like 91 accuracy if the map is nine percent and so on so we keep going we keep going we see the Mapes are good when you get to 30 days the map over here is three which is 300 error that looks kind of off like that's weird right so obviously there's something wrong over here a little bit in the model and we'll see what that is soon but I just wanted to give you the general like idea of this cross-validation and understanding the performance of the model once you normalize for all of these cutoffs and all these factors that could influence the make because you can tell about like you know our May seemed really good it was six percent seems in line with all these other Mapes but there was an instance in the model title where the error was very very off and we want to understand why because you know you might have that off chance where that model error is off the performance gives you the overall map if you want a more granular difference with your daily data when you predict on The Daily level you can basically just use this formula that I have over here and you could check the mape every single day over here you can tell it's like 44 43 37 I I sorted it by ascending equals false so like in general you can just look at dfcv and you can tell the Mapes are generally five percent five point eight percent so on they're like a lot of Mapes on the day level you can see how accurate you are but there are situations where we had the Mapes at like you know 44 and that's what made you know the 30-day Horizon in your evaluation look a little off like there are some situations I think what's really really interesting over here is we're sorting this on the day level by the least accurate onwards right and you see that generally almost every single one of these cases do you see a similarity other than them all not being overestimate meaning that we're severely under predicting in these cases to look at the date look at the date they're all around the same time period this time period is what's making our error increase if you're predicting anything around December it looks kind of off let's go back to produce and just plot produce that's why look at January 1st there is an issue where the data looks just it's weird it's a weird Trend that's obviously something that's just a systematic error and not an inherent seasonal Trend where every single time around that time period it goes to zero that's not true it's it's a data issue we should normalize that out that's why there have been issues in the time series model specifically around that one date so I think that like when I was going through December January January date it's because the data in was wrong or like there was just something off systematically about it so it's really really interesting to test all of these things out think critically and go much deeper than just putting code you want to go One Step Beyond to see what's actually going wrong in the model and testing all possible situations out because if you just looked at the 93 accuracy and you were like this is great where you would have missed that there is an issue in December and January that we might have to change because of the systematic error and now the last thing I'm going to hit on in this video before we come back in the next video and create a pipeline that essentially fixes all of these things we talked about automagically the hyper tuning is fantastic for Facebook profit because they did everything on the back end they give you recommendations for what to test out they tell you which parameters can be changed and which ones are generally inherent in the model and cannot be changed based on the data you give it change Point prior skill and seasonality prior skill it essentially gives you a weightage to all of the historical data in the past the default for change Point prior skill is 0.05 if you increase it it makes the trend more flexible but the thing is sometimes Trends are rigid for example liquor sales are always going to increase for St Patrick's Day they're always going to increase for Valentine's Day Champagne cells are always going to increase for New Year's there are some things that are inherent to the type of category of data so you don't want to make a trend too flexible because it might miss those Trends and over fit to noise in other parts that you know might just be noise so it's really interesting there's a lot of things you need to think about these two parameters are what profit said could potentially change and could be hypertuned and they give you these exact like lines of code in their documentation to show you how to find the best tuning results so you just run this and it's going to test every single variation of the parameters you gave it so you can test it you get tuning results and over here you could just print the tuning results you see that you want basically the lowest rmse this is the metric that they're using to calculate accuracy you know lowest error best model essentially so let's sort it by root mean square error and we see a change Point prior scale 0.1 and then seasonality prior scale 10 seem to be the best and you can actually automate it I want you to think every time you do this you don't want to have to do this for every single category copy and paste this parameter add another self re-run the model with this parameter so on no if you want to automate all this so a way to get the change point and seasonality prior scales from this tuning data frame now that we've sorted it you see over here that it's sorted with the lowest error onwards and this index is going to change obviously based on like what you're running so you can reset the index and then when you reset the index the index 0 is always going to have the lowest error now and then you can locate that and you realize like you know change Point prior scale seasonality prior scale these are the values we want to add into the model you can create a dictionary and honestly like it adds an rmse here as well so you could just drop rmse over here and locate 0 which is going to be the lowest error as we set it up and then you're going to get your change Point prior scale and seasonality prior skill in a dictionary for example if you name this prams dictionary m equals profit and then you want to set up your change Point prior scale equals and you're we're going to set this up in a loop right so everything's automated you don't have to do anything yourself you could just look in params dictionary and then throw in change Point prior scale and then the same for seasonality prior skill this is just like one way of thinking about how to automate things so over here let's move this down and you could just change this to seasonality and it's going to pull out the best seasonality prior skill combination with change point that is going to give you the lowest error and you see that the the map is 4.9 now that we've wow I have never filmed something that was this long before my camera died but basically now we see that the mape went from 6.3 percent to 4.9 it is lower now come back for the next video where we use all of these learnings to create a full forecasting model for every single category and optimize it with all of these different parameters and then the third video is going to be all about Facebook profit and the parameters so you're going to learn more about what you're optimizing thanks so much for watching don't forget to subscribe if you made it this far and I will see you next time thank you foreign [Music]

Info

Channel: The Almost Astrophysicist

Views: 14,772

Rating: undefined out of 5

Keywords: sales forecasting, python sales forecasting, python fbprophet, facebook prophet, prophet forecasting, timeseries, end to end data science project, end to end forecasting pipeline, data science project, data science tutorial, data science machine learning tutorial, sales forecasting tutorial, forecasting with prophet, data science forecasting

Id: hht0iKzviWE

Channel Id: undefined

Length: 27min 43sec (1663 seconds)

Published: Mon Jul 24 2023