Implementing and Training Predictive Customer Lifetime Value Models in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my colleague Ben van Dijk and myself are both data scientists at data science comm and today we're going to give you an introduction to customer lifetime value modeling in Python but before we start I would like to give you I would like to say a few words about our company data science comm at data science on comm we have created and end-to-end comprehensive platform to provide data scientist with a collaborative and scalable environment that allows them to do their work more efficiently so we led the data scientists use the open-source tools that you're familiar with they love while allowing them to move their work off their platform and into production as quickly as possible so in our platform your data science projects are backed by your version control provider of your choice whether it's github gitlab or bitbucket and by a scalable on cloud OMM infrastructure so I would welcome all you guys to stop by our booth to get a demo over platform later on so to get back to customer lifetime value this is a brief overview of our presentation so we're going to start by giving you a little bit of background about CLV and why you should care about customer lifetime value model and then we will give you a quick overview of the different CLV context and it's important to understand the definition of these different contexts because they actually help you determine what kind of models you need to model your business setting and then we're going to go into a deep dive of a particular model the parental and DD privilege SiC model this is a Bayesian model and we're going to tell you why you should be using this model and I'm kind of what kind of insights you can extract from this model and finally we'll have a little lab where we actually implement this model in a Jupiter notebook using PI MC and so you can actually follow the notebook at this link here it's a public get up repo that contains all of the information about the model and how you can train it on your data set so what is customer lifetime value so customer life is essentially just a total profit of the entire relationship with the customer so when you think about profit you think about costs which are the cost that you need that you take to have to take into account when you want to attract service and maintain a customer and then the revenue part and the revenue is just essentially the number of transactions that a customer is going to make and the value of each one of those transactions in dollars and there's additional component to that to that value that are let's say harder to quantify like the network effects or word-of-mouth effects so customer can bring their friends into your business and that's actually extremely valuable in our in our presentation we will focus exclusively on the revenue part of this equation so we will not address the cost part simply because in general the revenue modeling revenue is actually much more difficult than the cost and so if you're familiar with customer lifetime value literature these are known as equity models and so we can really summarize CLV by saying that it's a metric that tried to capture how healthy a relationship is between the customer and a business before we start doing into a lot of details can I get a show of hands of who is implemented CLE model before okay it's good about 30 people that's great so now why why would you care about CoV well CLV is used in many different purposes you can use customer lifetime value to actually segment your customers and find the most profitable ones you can also identify the traits and the features that are associated to these very valuable customers and so you can try to find more of these highly valuable customers in the future and it's really useful to be able to determine how you can allocate your resources among the different customers that you have and it also gives you a baseline for how much you should pay to acquire these different customers so let's start to discuss a little bit about the business contexts that are associated with CLV so the first dimension that we're going to look at the first one that's actually the most important one in determining what kind of TLB model you're going to use is whether or not your business is in a contractual versus a non contractual setting in a contractual setting the depth of a customer can be observed and so this is your typical membership like services where I someone does not renew their membership then you know that they have left your business and so these these situations are often model using survival based approaches and in the non contractual setting which is the one that's really common among online retailers is the one in which the customer that is actually unobserved and so in this case the model will the lifetime of a customer will be a latent if you want unobserved parameter and so typically in these models the customer lifetime distribution is modeled via exponential models and so this is your Amazon of the world and the second dimension that's really useful to look into is that this screen versus continuous purchase opportunity in the discrete case the purchases can only occur at very fixed frequencies so you can imagine if you have a magazine subscription again the magazine once a month so that's one example of a discrete purchase opportunity but a continuous one is when the purchase can happen and at any given time in the relationship so if you look at it in a table we actually included here different business examples that fall into each one of these four quadrants so in the continuous purchase case in the non-contractual setting this is your online retailer really or movie rental medical appointments hotel stays and the contractual setting would continuous purchases Costco is a good example so Costco knows when you leave the business because you don't renew their membership but you can purchase at any given time while you have that membership and so discrete purchases in a non contractual setting you have prescription refills is a good one a good example charity fund drives that tend to happen at pick a fixed frequency once a year for example event attendance is not a good one and then in the discrete purchase case where you have a contractual setting then you have your canonical magazine newspaper subscription fitness clubs and a lot of insurance policies also follow that category so the CoV equation is very simple it applies to all four quadrants actually I should mention before we go and talk about the model that we will be focusing only on this quadrant in red here in this presentation this is typically one of the hardest quadrant to model they've been mostly probabilistic models I've been used to model lifetime value for this particular business setting and so it will introduce the brand BD later in the presentation but the C of the equation is very simple it's really the total number of purchases that each customer is going to make times the value of each one of these transaction the customer level and so traditionally these two parts of the model have been attacked separately these are typically two independent models and in our presentation we'll really focus on modeling the number of purchases the total number of transaction that customer is going to make and it's just typically the hardest part to model and so we're going to jump into an introduction to the Pareto and BD model how many of you have heard of this model before okay but it doesn't that's good and then this is a article Bayesian model so I'm going to give you a brief introduction to that model and how you can train it on your dataset so before talking about the model let's figure out what kind of data you need to actually train this model so this diagram actually illustrate pretty well the kind of data that you need so you need a transactional data set so essentially a log of each transaction that every customer is making in your business and so here in this diagram you have different customers making transactions at different times so these transactions are represented by a vertical line and the horizontal line shows time going from left to right and so in many businesses customer lifetime value is modeled using very simple heuristics and the classical example is just averaging pass transactions that have happened and inferring that people are going to make the same number of purchases in the future or in the same rate of purchases in the future so if you look in this particular case if you use that Urist ik you would say that customer the first customer actually is more valuable than the second one because they've made more transaction in the past or as this customer here tends to make purchases more that are more spaced in time and actually customers would be here is more likely to make purchases in the future than this first customer who hasn't been making any purchases in a long time and so probabilistic models are very good at capturing the ethereal genius behavior among customers and so that's why they're preferable than using simple heuristics so essentially that's what you need and the goal is transactionally then the goal of the CLE model here is to predict what's going to happen in the future for these customers the Pareto and Beauty model uses essentially two dimensions to characterize the behavior of your customers and these two parameters are latent or unobserved parameters so the model is actually going to be able to constrain those the first one is the purchase account if you want to purchase rate that was use a number of purchases in a given amount of time and it's typically modeled or characterized by the Greek letter lambda and the lifetime which again is not observed directly in this case is modeled with or represented by the Greek letter mu here so really the purpose of trade owned BD is to come up with estimate of these two parameters at the individual level so lambda mu so the new so the lifetime is actually going to be represented by an exponential distribution and mu is the slope of that exponential distribution missus at the customer level so on top of that so the CoV the permeability model is a year article bayesian model and the your article part of that model comes from the fact that we're going to use a prior distribution for each one of these parameters lambda in view and this prior distribution is going to be based on our belief of how these parameters are distributed in your customer population your customer cohort so if you have a customer that has made very few purchases it's going to be hard for you to constraint lambda mu very accurately for that particular customer but you'll be able to use your knowledge of the customer population to really infer that these values would be able to use that to to constraint lambda a little bit more so whereas if you have a customer who has made a lot of purchases then the likelihood function will actually be driving the estimates of lambda mu and so to be a little bit more specific the Pareto and BD model making assumptions on how the drought that all the purchases are distributed for each customer so the assumption here is that it follows a Poisson process that the purchases the time of the purchases are somewhat random and that the life time actually follows an exponential distribution and in this particular model the priors on both lambda and mu follows gamma distributions so if you were to derive the likelihood function of that model you would realize that you actually don't need all the transaction information for every single customer what you need are essentially three quantities the first one is the recency and the recency is just the time between the initial purchase made by a customer and their last purchase in a calibration window of time so this is called the recency and the frequency actually the frequency here is a misnomer is the repeat frequency is the number of purchases that the customer has made beyond first one so in this case this customer has made a repeat frequency of 8 and the last parameter that you need is the duration of your calibration period so the duration of the time period you're going to use to train your model which is represented by capital T here so these three parameters are are known as the RFM data structure for recency frequency and if you also include monetary value which we're not going to do in this presentation this is where the M come from monetary value so I think I'm going to let my colleague Ben here show you how you can generate an RF M object very easily and then Jupiter notebook thanks Jer so in our example notebook here if you scroll down halfway or about a quarter of the way you see I we the typical data set that is used in a lot of the literature on these models is the CD now data set does anyone remember the company's CD now or has ever purchased the CD okay a couple pree Amazon I think Amazon acquired them many years ago but this is a transactional data set they sold CDs in this case it's a continuous time purchase you pattern you you can buy CDs whenever you want and the transaction log here kind of looks like this we have a customer ID you have a date and you have a quantity purchased and this is basically what you need to start working with these types of RFM models is so if you have any kind of business where you're selling something you probably have what you need to start and here what we're doing is there's a Python library out there right now called lifetimes which has been contributed by Cameron Davidson puan who has actually written quite a bit about probabilistic programming using PMC and this is another one of his projects and he provides actually utility functions for loading this data set you know for constructing these benchmarks yourself and then he provides utility functions for creating the RFM data set and of course it's easy to do on your own but this is what it's going to look like where we've specified the customer ID a date when we want the training period to end and then what frequency bucket we want to group our transactions in and so in this case we have we're doing some kind of aggregation and you want to select a frequency that won't obscure too many purchases if someone purchases every week and you select a frequency of month what you're going to group all those transactions within that month into one observation so you want to select appropriate granularity to your data so that's creating the RFM object that's all the data you need that's one of the reasons these models are are really attractive is they don't require a particularly large quantity of data and most most people who you know operate this kind of business have this this data readily available thanks man so now let's talk a little bit about the training process and so for those of you who are already familiar with machine learning model I think that this slide will be pretty intuitive here so what's going to happen in practice is that we're going to select a training period over which you're going to train the model and we've run a lot of simulations internally and we we found that we at least three times the typical inter purchase time of your customers is the kind of duration that you need to get a good performance out of that model three times really is the minimum we recommend five to ten times is is definitely better so you train your model over that period of time and then what you want to do is you want to validate your model over a holdout period and again here we recommend at least have the size of your training period so here what we we definitely want to avoid is a case where you validate over a period of let's say a couple of weeks when your customers are making purchases every few months on average so that holdout beard is going to be pretty meaningless so you want to make sure that your holdout there is a is a size that's comparable to your training period and so if you're satisfied with the performance of the model then what you want to do is you want to take all of that data from both the training and the validation period and then forecast for period that could depend on the company's need could be a year a few years depending on the forecast that your business actually needs so now let's jump into the training of the Preta and beaten model because in this section of the notebook we have some of the actual specification of the model and you can find all this in the link two papers that we've referenced but I want to kind of show the PMC implementation and see if there are other implementations out there they typically use the analytical approximation and we think this is a nice way to use Bayesian methods in particular Markov chain Monte Carlo methods because then we can approximate the full posterior of the model parameters and we can really gain a nice understanding of the uncertainty around our estimates not just say that the average case so how many of you have used PI MC or heard of it okay a few of you yes it's an excellent way to get started if you're interested in building these kind of hierarchical Bayesian models which you really are really would be very difficult to solve analytically they're very easy to specify in this language and which is which is just Python and to to build and compute things that would be really be impossible otherwise so to start off with what we have here is because our likelihood function is not just an off-the-shelf distribution we define our kind of custom likelihood function and in many cases you'll be fine with just whatever distribution whatever likelihood says no GLM you'll have normal likelihood but in this case we implement the likelihood function from above as a custom continuous distribution here and the two components are we override the constructor with our two parameters lambda mu and then we implement this log P which represents the log likelihood of the parameters given the data and so these are our data points we're going to follow the notation in the literature so X is the frequency t underscore X is the recency and T is that length of the observation period okay so now let's get into the actual model here you know we have mentioned previously these are the parameters of the prior distributions those so those are the two gamma distributions so for lambda the purchase rate we have R and alpha and for mu the lifetime parameter we have s and beta and we give these non informative prior distributions and there's an interesting paper about how this half Koshi is convenient for that and here we attach those priors to the individual parameters and the the key thing to note here is we've added this shape argument we've specified that and that is equal to the number of customers in the data set so we have one lambda parameter for every customer in the data set and the same thing from you and then here we reference that custom distribution and then we start our sampling here and after a while it completes and then we get some diagnostic plots here and then we've also included a couple of functions that allow you to compute the key quantities of the model including the probability that a customer is still alive i t because these is a transaction model we don't know for sure whether they're going to come back or not so you can calculate the probability they're still alive at some point in the future condition on their previous purchasing behavior and then the cumulative expected purchases in some future period so how many more purchases are they going to make with you so that's that's obviously a really actionable one you can use this to make predictions about future purchases that's right thank you Ben so just to get back to some of the actionable quantities you can get out of this model so what Ben is show you is how you can actually train the model and extract the model parameters but you may ask yourself so why do I go through this exercise of finding all these model parameters so why is this useful for my business to train a model like this so there are a number of actionable quantities you can actually extract from the model and as Ben mentioned one of the most important one is the number of future purchases that each one of your customers are going to make so this is one example here where you have a customer who's made purchases in the past we have trained the model between the initial purchase and the time now and what we're trying to do is to forecast the number of purchases that this customer is going to make over a period of time DT and so we'll do this process for each pair of lambda mu that we get in our MCMC chain so we get a vampeer of lambda mu for every single customer and every chain and MCMC sampler so what we trying to do is to value the number of future purchases and so you can think of it as being essentially decomposed into two expression which is the number of future purchases given the customer level latent parameters so these values of lambda mu and the fact that the customer is alive at capital P times the probability of that customer to be alive at t so this is represented here by these expression where n is really what we're looking for is the number of purchases made in a given amount of time DT in the future given your model parameters and your observables so these expressions actually these two components on the right can be derived analytically and I won't go into the details it's pretty easy to derive actually but these expressions are very simple and you can evaluate these two components for every single value of lambda nu for every single value of the chain or chain iteration number I should say so let me just mention something that the probability of being alive actually is very useful if you are trying to determine which of the customers are out to churn so it's a way to assess that as well and so I think let's just jump into the notebook a little bit and take a look at those actionable insights sure so in the last section of the notebook in the model checking section you know we can use this kind of prediction function that we talked about to do out-of-sample evaluation so we we can use lifetimes also provide the utility function for generating this basically the RFM with an additional calculation of what actually happened in that holdout period here so let's take a let's step through an example for a single customer so we'll choose the customer at the 150th index and so if we look at their RFM record we see that in the calibration period with this sub cow they made two purchases and then the most recent purchase occurred 17 weeks after their first their first initial purchase so that gives them a recency and then we see that their frequency in the next 39 weeks of our holdout period ended up being 3 so they made to repeat purchases in our observation period and then 3 in the subsequent period so to make a prediction for this customer we take the we extract the posterior distributions for lambda and mu from our trace we get the entire 2000 data points for that particular customer and then we plug that in along with the 39 week window for our holdout period into the predict function and we can see for this customer we ended up predicting they're going to make one additional purchase so how did we do we did ok they actually made three purchases we predict one but we did at least get that they were still alive in this case so we were on target there so now that we've stepped through one let's take a look at the entire customer cohort we have here so I'm just wrote a really kind of crude loop to loop through the entire array of posterior parameters and then I added that I took the mean for each customer the posterior mean predicted purchases and place that back in the data frame under predicted here and so how do we do there we actually did pretty well so we observed 1788 repeat purchases in the holdout period and we predicted 1724 so that's a pretty good result in aggregate and you can actually plot that this is a standard diagnostic plot in a lot of the papers on this topic we're going to look at Bend by number of transactions in the calibration period we plot the average number of transactions in the holdout period for observed and what the model predicted and then the dotted line represents a simple heuristic of well we're just going to observe kind of based on the previous rate and you can see that the the model is able to do a really good job of picking up the kind of shape of purchasing and life time here and most of the distribution most of the action is actually in most customers don't make any repeat purchases and you know so the down here is where most of the observations are and the model does those really well out here once you get out here in many of these cases there might only be one or two customers but even given that the thickness of data in terms of number of customers we've observed so much behavior from them that we can actually fit their purchase frequency very well in this data set and then we can calculate the RMS see across all the individuals at about one point four and then we have a bunch of links to further reading on lifetime value here and in general these are some great books for more Bayesian methods I wanted to point out also that our platform customers designs are complex customers have access to a customer lifetime value library that Ben and I have been developed over the years where you can have a series of models that you can use in different business contexts that we we validated over the years so in summary what we've learned today is what clb models are and why they're useful and also that there's really no one one-size-fits-all here that CLV models are really applicable in very specific business context and that give you an introduction to the Petaling BD model how you can train the model and I can extract actionable quantities from it thank you [Applause] we have time for some Q&A I think so any questions this whether you would extend your model to take into account more information about the customers for instance like the types of things that they're purchasing or other features yes so the question is about how we can incorporate additional information such as customer features yeah that's that's actually a great question and one of the in one of the references we linked to another proposed model where instead of the two separate Gama priors on the lifetime and purchase rate parameters instead that's a multivariate normal prior that then has a linear regression component of that that is that you can include those kind of features so then out of that you can derive based on the fit of the regression you can get actionable insights into what are the specific attributes of the customer that drives the results of those parameters the estimates of the parameter in this case the relationship between lambda the purchasing great mu is a simple linear regression model so you can extract coefficients there and kind of gauge the relative importance of each one of those parameters so people think as lambda and she'll be literature as the engagement parameter and you as a loyalty Holloway all the customers it's a measure of MU in this case yeah have you thought about two issues one is seasonality the other is generate growth it has been hectic for me so I was wondering yeah yeah that's that's another really good question and we've actually come across this in our work so Steven ality and just the general kind of growth trajectory is definitely important regarding seasonality one way to help with that is to make sure in your training period you observe a full seasonal turn if that's not available for whatever reasons you don't have the data available or your company just been around that long you know you could take some other fit kind of a say a time-series seasonal model on top of this and then adjust your forecast using that and that's also what I would recommend for if there's any kind of exogenous shock say you spend a bunch more on marketing that might change you know customer behavior in the future which wouldn't be reflected in your training period and again you could use say a time series model to do an adjustment on the predictions that are generated by this model yeah they're also partially using micro models that have addressed that particular issue of seasonality where the value of lambda and you actually change it can change over time so if you're in period of like the holiday season for example what you're purchasing rate on average is going to go up then these models can account for that and generally what we have service that and the cases where your customers make only a handful of purchase or less it's very hard to actually get that signal there's models out there like the Pareto GGG model which is a variant of this model where actually it can allow for the Poisson assumption here is it's kind of lifted and then you can assumes periodicity and the purchases so people make purchases for example every week every two weeks the Pareto DGG actually can account for those for those patterns I got two questions one was real quick though you guys once you guys use gamma prior is it just because it's a conjugate yes okay that printer and BD model is almost 30 years old at this point I mean three add that it's convenient for conjugacy and then the second one is going back to the two dimensions you mentioned for proving contractual non-contractual and continuous first non continuous what if your business model is like a combination to do for example if you're running a game which is a monthly subscription but you also have in that purchases how would you model that that's a great question so I think that in this case you really are falling into this particular category where you're in a contractual setting right and you can make discrete purchases at any given time right continuous purchases sorry so this is sort of in this particular setting here and I don't want to go into too too much details but we have a slide at the end of our slide date you can access to where we provide you with a series of models for every single one of those four quadrants so you can look at what has been done in the past so I think that you would fall into this particular case here but there's a whole other dimension to your problem that we haven't talked in this presentation is how do you assess monetary value because customers will make transactions and will make transactions with different monetary value attached to them and so in this case there are several models out there and probably the most known one of most well-known one is the gamma gamma model we can talk about this later if you're interested definitely any other question all right thank you guys thank you [Applause]
Info
Channel: PyData
Views: 54,084
Rating: undefined out of 5
Keywords:
Id: gx6oHqpRgpY
Channel Id: undefined
Length: 36min 26sec (2186 seconds)
Published: Mon Jul 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.