Marketing & Retail Analytics | Marketing & Retail Analytics Tutorial for Beginners | Great Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
most of the times business entities sit on a huge pile of data about consumer choices and preferences but lack abilities to translate that wealth of information into better decisions fundamentally interpreting data is a big challenge that marketers face when it is in volume marketing and retail analytics course is designed to help marketers to gauge the marketing performances and its effectiveness in maximizing return on investment this course teaches you everything you need to know about marketing techniques that are used to determine quantitatively which customers are the best ones by examining factors such as regency of purchase frequency of purchase and how much a customer spends you will leave this course with a concrete understanding of how to use marketing and retail analytics to predict outcomes and methodically to allocate resources before we get started i want to introduce you to great learning academy a free initiative by great learning where you can access over 200 plus courses with thousand plus hours of free content on trending high demand domains absolutely free register now to complete the course and get your free certificate of completion check out the link in the description of this video below if you have not subscribed to our channel yet i want to request you to hit the subscribe button and turn on notification bell so that you don't miss out on any new updates or video releases from great learning if you enjoy this video show us some love and like this video knowledge increases by sharing so make sure you are sharing this video with your friends and colleagues as well make sure to comment on the video any queries or suggestions and i will respond to your comments so hello everyone my name is uh raghav ramamurthy and i'm going to be talking to you all about marketing and retail analytics even though the course is con named marketing and retail analytics we're going to be talking yes a lot about domain about marketing and retail but the techniques you learn the applications you learn can basically be applied across multiple domains it's really domain agnostic industry agnostic okay so that's kind of what i want you all to keep in mind as you start watch these videos um even though like i said even though the examples would be from excuse me marketing and retail a lot of these cases a lot of these data sets can be really applied across multiple domains so as long as you understand the concept behind what is it that we're doing the business implementation of that and then i think that's where you will get the most value out of watching these videos okay so a little bit about myself um but before i get started let's talk about what is marketing right so you might want to pass a moment to consider maybe think about what is marketing in your own words you are all probably exposed to marketing concepts marketing strategies marketing targeting at probably multiple points throughout your day because you are a consumer every one of us is a consumer if you're human being you're probably a consumer so that's kind of why it touches all of us at every aspect of life any aspect of life that we do in our day-to-day lives we all affected by marketing and it's a mostly direct or even indirectly right so um pause for a moment to consider what all the different marketing strategies are that are targeted at you and you might be marketing yourself not necessarily a product it might even be a service right someone you you have to market yourself to your boss you might market yourself to a potential recruiter you might have to market yourself to a potential um you know for a potential employee uh interview right so think about what is marketing in your own words right uh multiple definitions available if you just google the term marketing you're probably going to get multiple hits on what is marketing primarily marketing is basically selling the right product at the right place at the right time to the right people so that's primarily what in my words um is marketing okay a little bit about myself i have an mba in supply chain management and i've worked in the us for about 10 years and i run my own consulting company i run my own consulting practice i specialize in data visualization that's my passion and i also consult with the companies on multiple aspects of analytics marketing analytics financial analytics i'm really not restricted to a particular domain but i primarily consult across multiple domains in supply chain like i said is my background and that's where i've worked for the most of my part for my if you think about my nine to five job before i started my own consenting company i primarily teach teaching as my passion i primarily teach and during my weekend gigs and when i'm not traveling for my consulting so i would say about 70 percent of my time is spent in consulting and about 30 percent of my time is spent in teaching my long-term goal is to maybe flip it around do more of teaching less of consulting as i grow older right now i'm happy consulting pays my bills i cannot complain right so uh that's me um and um throughout the video series that you're going to watch i will pull up some examples we'll talk a little bit about concepts we're also going to talk a little bit about some actual data i'll show you some data that i've worked with across multiple clients and maybe see how you can apply those concepts into a real life practical based example and so that's going to be the um intent of this course see how you can actually practically apply um your analytical knowledge into a business oriented skill primary thing when you're starting starting to work a domain i mean you have when you're looking for a job and you're trying to enter into a different domain which is analytics i'm assuming for most of you are watching the video um there's always this classic question of do i need to specialize in a domain or do i need to specialize in a tool or a technique right so that's going to be the classic question you're all going to face and it's not an easy question to answer okay um it's probably the toughest decision you probably have ever to make because the choices are so many but based on my experience again this is just my personal view okay based on what i've seen so far in the consulting world and based on how how i interact with my clients and my students i've learned that domain knowledge is good but maybe just get you uh uh give you the foundational knowledge right i mean this is specific to analytics if you're trying to build out a carrier analytics the concepts behind terminologies concepts behind technology concepts behind the statistical part of analytics is really really critical because once you understand the concepts data is data whether you apply it in a manufacturing setting whether you apply it in a service based organization whether you apply it in finance marketing retail anything data is data in its primary form what really differentiates data from the domain is a couple of terminologies uh so different industries have different terminologies in terms of what they call there are some standard acronyms that will help you to kind of sound intelligent when you're trying to crack your interviews um and when you're uh there's some metrics some kpis that are specific to a specific domain specific industry it might be even be specific to a particular company right companies might be using a different metric that they call that they have in their own words so those are the things nuances that you might want to consider when you're trying to get that domain specific knowledge so in short what i'm trying to say is we're trying to decide upon a career don't focus too much on the domain part yes domain is important but domain is important just in its basic level where you're trying to understand the basic terminologies the basic metrics that are in play the basic acronyms and apart from beyond that then it's going to be how you apply your technical skills your knowledge of the analytical processes into practical real life application okay and i'm not i'm not saying don't learn domain and just learn the concepts you'll be good yes learn both but then you're going to get a lot of practice as you start it's an iterative process right so when you start to apply some of these techniques in a domain that is interested that you're starting to work and then you're going to build up that knowledge right don't expect to read three or four books of domain knowledge and then assume that you're going to be well prepared for the rest of your life in a particular domain that's not how it works however if you read up a lot of concepts and tools and techniques which is kind of what i'm going to do in this course in this series then that will equip you to apply those concepts those techniques into a specific domain that you're trying to work on okay so the reason why i'm saying all of this is since marketing and retail might be a little bit different a little bit new for some of you um let's get uh some quick terminologies out of the way let's talk about a few quick uh commonly used terms and that that you might come across either when reading marketing analytics articles or in talking about marketing analytics or even when talking to a potential recruiter you just want to show off that you know certain terms just throw out some of these words okay you'll come off a very confident person so let's quickly run through them i promise you i do not have a lot of slides in my presentation this is one of the very few slides where i have a lot of words most of our slides are pictures if not pictures it's going to be data if it's not data it's going to be a tool okay so very very few slides which are going to have text and this is just one of those few okay so just bear with me for a few minutes till we walk through the slides and then we'll start jump we'll jump directly into a data um on hands-on practice okay so first one is sku right if you think about sku skus probably most of you have come across it it's the single basic building block of inventory right stock keeping unit right so this one bottle of water is a single sku right it's one unit one specific product that has a specific dimension uh this is what a uh 250 ml water bottle um a 500 ml water bottle has a different sku right because different depending on the same product depending on different packages can have multiple skus or skus right what is fmcg cpg right so again as you watch these videos what i would recommend is just pause for a minute before i before you hear me talk about these definitions so come up with the definitions in your own words and see if it kind of corroborates with what i'm trying to say that way you're just strictly not listening it's not strictly a one-way conversation i know it's kind of hard to look at do that in a video recording but try not to do that try to pass before every terminology that pops up and maybe write it down on a piece of paper to see if you can come up your best guesstimate in terms of what the specific terminology means in your own words and then see if it kind of collaborates with what i'm saying okay if you're not satisfied google it well that should be a last option but try to think in your own words to see what each of those terminologies would mean okay so fmcg cpg what does it stand for what is the difference fmcg stands for fast moving consumer goods cpg stands for consumer package goods are they the same are they different any guesses well they are both the same right fmcg is how we call it in india fast-moving consumer goods cpg is how people call uh the same concept in the western part of the world like in the us and uk we call it cpg we call it fmcg that's the only difference other than that both are both mean exactly the same so these are high moving high volume low margin high turnover products that we do in our day-to-day consumables right uh that most of the items that you see in grocery supermarkets are probably going to be your fmcg items shrinkage what is shrinkage okay a couple of different thoughts come to mind when you throw out the word shrinkage most commonly it refers to inventory it refers to mismatches between what's on hand and what should be available based on the books so that's what inventory a shrinkage means and think about the causes for shrinkage okay again i want you to take a pause for a moment and think about what are the possible reasons why you might experience shrinkage in an environment right say if you're managing a warehouse a retail warehouse you are expected to tally the stock should be the math should be pretty straightforward right i mean you know what is your on starting balance you know what you sold that day you know what were some returns if you had a new shipment you knew what shipment came in so it's simple basic plus addition and subtraction simple arithmetic that should give you shrinkage but it's a pretty common problem and a very serious problem a lot of companies face where the on hand inventory does not tally with what is on the books okay so a lot of valid reasons right it could be bookkeeping errors it could be manual typing errors it could be process errors those are all accountable i mean most retailers do account for some sort of percentage loss when it comes to shrinkage because ultimately if it doesn't tally they have to write it off somehow right if they think they have more but on hand it's less it has to be written off ultimately it's it has a financial impact so with the financial impact becomes greater the impact becomes even more when the accepted shrinkage levels or the shrinkage levels goes beyond your accepted shrinkage levels right and there's a common common problem that a lot of retailers face which is called pilfridge feel free it's a valid threat to retailers right so people do steal stuff people our humans are very adaptive when it comes to working around processes right uh even if we take ten percent of that effort to work with the processes our data problems will be solved okay which is another story altogether anyways but by the way um so shrinkage is uh something important especially when you talk about it in the context of pilferage right so a lot of um effort a lot of analytics go into establishing what should be your baseline shrinkage level beyond which then maybe you it triggers an action for you to go dig deeper and find out how uh what what is it what are some of the reasons based on the data that there's a mismatch between on hand and availability if the data cannot prove it then you're probably staring at a pilfered scenario okay so um that was shrinkage rfid okay so think about two things i want you to think about a couple of things here what is the definition of rfid and what does it really mean okay rfid is a it's an acronym for uh radio frequency identification right it's a chip based technology um relatively uh mature in terms of how companies are using it was new for about a decade ago but right now nowadays it's being used um very extensively in the retail scenario right initially when the rfid technology was started it primarily was used to start tracking inventory right i know i if i need to know where exactly my items are where exactly my inventory is for a specific point in time how has it moved across my system how much do i have of a specific item do i need to reorder stuff do i need to hold on to my existing orders do i have a backlog so these are the things that were being um answered by people who are using the rfid technology but its growth has exponentially increased nowadays i mean a lot of retailers are finding more creative and more practical uses for rfid so more uh some of the more recent examples uh say for example a retail store let's say if you uh walk into a grocery store or department store and your shopping cart is fitted with an rfid tag so think about the possibilities so i can do your trip segmentation so meaning i can based on where you go throughout my store i can track your movements i can track your routing i can see where you go first where you go next or if you're one of those persons who keep going around the same eye lone or again cannot decide which product you want to buy that can also be figured out using this rfid tracking mechanism right so that's just a simple tip of the iceberg in terms of what you can do with rfid technology and how you can actually start using that to your advantage as a retailer right so that was rfid relationship retailing right so again this is a concept that a lot of companies are trying to embrace nowadays because consumers are kings they no longer kings they are beyond kings they are emperors now i mean a lot of companies have started to realize that a lot of companies have started to place a lot of emphasis on customer loyalty on repeat purchases and so on so this term relationship retailing has grown has got a lot of importance gained a lot of significance in recent days right the um idea of the relationship retailing is if you go to a particular store if you order something you as a consumer are treated as a transaction that was history nowadays it's not that you are no longer treated as a transaction you're no longer treated as a first-time cons consumer if i can track you with a loyalty program if i can entice you with some personalized offers i can create specific targeting targeted messaging specific promotions specific campaigns that are targeted to you as a loyal customer so those are some things that i can do from a retailer's perspective to build your brand to build my brand loyalty with you as a consumer that is one part other idea is i mean there's a long-term interaction right so especially coming to me as a consumer for repeated transactions i don't view you as a single transaction i treat you as a relationship i treat you as a relationship transaction i don't care if you just buy one rupee worth of product now as long as i have you in my system i guarantee i'm almost guaranteed that you're going to come to me again if you're satisfied if you like the experience that i provide i'm going to keep coming back again over and over again which throughout the lifetime of you as as my consumer throughout the lifetime of engagement with you as a consumer i'm going to reap a lot of rewards ultimately that's a win-win right you're probably going to get rewarded for your loyalty based on specific promotions specific um um you know price offers that i'm going to give you promotions targeted offers i'm going to provide you that's a win will ultimately right so this is uh um most of the travel programs reward a lot of the loyal customers they engage in a lot of relationship retailing okay and retailers have just started to pick up that concept and started to focus taking uh taking uh uh taking the cue from the travel industry hotel chains are known to um excel in this relationship retailing concept right so one of the uh renowned chains is the marriott chain which places a lot of emphasis on relationship retailing so there's a story i mean it's probably anecdotal i don't know how true this is but i've heard it uh being mentioned in a couple of cases and couple of professors have also talked about this so it is really valid but uh i'm not sure about the exact date but here's how it goes right so there's this guy who went to ritz carlton i don't know which part of the world but ritz carlton somewhere that scotland happens to be one of the premier brands of marriott so they checked him and then he had a sore throat so he didn't know what to do he called his mother and she said you know what take a glass of hot milk and turmeric so that should take care of your sore throat and then [Music] you don't need to have any other medicines and so on so he thought nothing about that he just ordered a room service he ordered a warm glass of milk and a turmeric drank it um sure enough the next day he after he slept slept over it his sore throat was gone um great thought nothing about it went about his work checked out and uh went back traveled back to where he was from a couple of months later he came back to the same hotel okay and uh he checked in he was all greeted fantastically and then he was sent to his room and once he was inside his room he got a call from the front desk or from the reception and they asked him sir would you like your favorite class favorite uh drink of hot milk and turmeric right he was taken aback natasha was not that he was a big fan of milk and turmeric but the fact that they remembered his choice that he made from a pre previous visit and the fact that they acted upon it made him a ritz carlton for life right so think about this all of a lot of companies collect data about consumers they collect data about when you come what you do and they know the exact nature of the transactions at the minute test level however think about how many company companies take action on it so that's what differentiates a cons a company from good to great a good company collects all the data a great company collects all the data and more importantly acts on that right so this is something this is an example of relation relationship retailing again when you have a consumer think about different ways that you can entice them based on nothing but simple historical data i mean you're not doing something extraordinary right so all you're doing is just collecting the preferences collecting the habits shopping habits the transaction habits from a previous visit and then you're tailoring that to give a unique experience to a personalized consumer okay that's kind of what relationship retailing is okay so next metric we're gonna be talking about is private label again think about what private label is again take a moment to if you've not heard about the term uh take a guess private labeling is a concept that talks about the same product how do you package it differently right a lot of examples of private living so this scenario is where let's say i have a fantastic product a physical hard product and the product is so good that multiple consumers multiple brands want to own them right so i can just make the same product and depending on what the um who pays me i give them their sticker brand and give it to them but the a lot of [Music] in-house brands are basically private labels right so if i am a manufacturer let's say i produce sugar for i don't know i just take lines right so i just put a get reliances package and give it a brand and then give it to them i might also create a private label for an in-house store brand for big buzzer okay the product might be the same but then it might just be an in-house brand for a particular retailer for a particular consumer okay so that's private label uh again physical product is the same but the branding is different it can be a renowned brand or it can be a house brand which happens to be the local store owner's brand pos point of sale so point of sale is the probably the single biggest uh volume of transaction that's collected at a retail setting right so think about when you would collect pos when you go to a grocery store buy your fruits vegetables milk cereal whatever it is at the point at which you swipe out um your billing either you pay cash or swipe a credit card mission the billing system captures each and every item that you bought the time that you bought if you did have a loyalty rewards program most of the counters they probably ask you for a phone number right that's probably what they're doing they're trying to track your repeat purchases if they're smart enough or if you do have a loyalty card that lets you give get some discounts maybe you type in that loyalty card which basically has everything all your demographic right where you live what you work where you work uh and so on right your marital status your age and your preferences and so on okay that's a loyalty card so at the point at which your data is getting captured at the checkout counter as pos which is point of sale product life cycle okay this is again important so the concept behind product life cycle is say every product has a shelf life some of the products have shelf lives that is measured in days in weeks in months or in years it depends on the maturity of the market maturity every product and so on but every product has to go through this product life cycle right there are primary four stages of product life cycle okay so one is the introductory phrase wherein everyone is so excited uh you have the early adopters who want to try out the latest iphones who want to try out the latest uh technology available right so those are your early adopters okay no matter what the cost is they'll be willing to wait in queue for 10 hours or they will be willing to burn the bitnet oil and have 10 computers open to so that when the online sale goes on at 1201 exactly they are already typing in with their pre-fill information right so you have die-hard fans so that's the introductory phase where you have the early adopters then you have the growth phase right so when you're looking at uh the growth which is pretty much based on word of mouth if the product is really good yes you can expect a steady growth which is not as exponential as introductory phase but it's still pretty steep okay the growth phase is pretty steep where you see a growth in terms of how sales is moving as time progresses then you have the maturity where you've almost you're nearing saturation so meaning people are not that excited to try out your product but they are still buying it because of very specific needs a specific um outcome that they have so they buy certain products based on like i said specific needs or word of mouth right so that's the maturity where it's pretty it's not steep it's not flattened yet but it's kind of reaching a flat state right and then the final phase is your decline okay so when you're not seeing double digit growths anymore you're not even seeing seeing single legit growth you're probably seeing fractions or decimal point growth and you're dropping off sales that's a sign for companies to do something drastic right so again the point i'm trying to make here is a lot of each and every life cycle is unique for a product a lot of companies try to make a lot of sense in terms of where their product is and which part of the life cycle of the product the reason that's important is you they want to adopt different marketing strategies different promotion strategies based on where that product is in its life cycle right in the early stages all you're caring about is word of mouth the buzz to build that energy around the new product the growth phase you're all talking about consolidating your early wins right so your marketing strategy should be based still based on high volume and maybe high margin because that's where you can skim your profits out of it right in the maturity stage when you start need to start thinking about cutting costs so maybe you want to maintain that margin but you have to have a balance of do i sell more volume at the cost of reducing a price or should i still maintain my brand image and try to extract try to uh you know extract as much pass as possible of the brand or the product till it reaches a decline phase so you have to adopt a different marketing strategy for that finally in my growth and my decline phase you've got to face reality okay you've got to face out uh timing is really really critical right you've got to understand understanding is one thing taking action is a completely different thing you have to have the courage the uh the wholeness to accept that your product is declining and do one of two things either revamp the product relaunch it as a different product or except that it is declining face it out gradually so that you are faced with optimizing your losses right so you want to minimize your loss as much as possible you can start reducing inventory start giving out promotion start giving out bundles and so on right a lot of companies fail to see that a classic case is jet airways right i mean i was just traveling uh this morning for this recording on uh on spicejet right and i was sitting on the terminal and i saw um some of the jet airway so last time i traveled i saw a lot of jet airways planes just grounded almost like the kingfisher airlines right they would grounded million million dollar planes assets just lying to waste being exposed to all possible weather uh conditions right but um so it's pleased to see jet airways the logo for jet errors are still there but then they just uh had a new sticker for spicejet right so uh um the recent example that just struck my mind but what um the point of trying to make here is some jet airways failed to notice a decline of their brand right and by the time they realized it was failing it was too late and then it's now a completely uh product is completely declined right so like i said it's it's easy to say it in hindsight but then you have it's the hardest thing is to implement it catch it even if you catch it the hardest thing is to actually go and take action on it and see how you can erase the slide okay so um product life cycle another terminology that you would probably come across in marketing and retail is wallet share um we've all probably heard about market share so market share is a percentage of your entire volume that companies can basically garner hope to garner right wallet share is coming from the opposite end of the spectrum instead of looking at this through the lens of consumer so a glimpse of a company or an organization looking at a share from a consumer's perspective right so this water share is primarily the total potential that i as a consumer is going to spend okay so take an example let's say i buy this mouse from chroma okay so i'm branded as a consumer who's an electronics consumer and maybe i buy this however my wallet is not just limited to this so meaning if i spend i don't know 2000 rupees on this chroma cannot make me a 2000 rupee consumer i might also end up buying a keypad just because i don't buy this keyboard and chroma i could just go and shop around buy it and reliance digital or maybe shop it from flipkart and do it online right so um i might buy different pieces from different stores so i as a consumer have a completely different view of my market my spending potential is good i might just buy one laptop from amazon and my water share grows exponentially right so it's very important for consumers to understand what consumers are doing not just in their particular store but what is the predicted potential of their spend across multiple stores throughout their lifetime for a specific category right so that's a concept of market or wallet share how much is the consumers what are basically trying to get at predicting what is the consumer's net worth right how much is the consumer worth how much can we basically extract from that consumer in terms of revenue that's the ulterior question that retailers are going to try and answer another example is market basket so market basket is a uh terminology that probably come across you've heard about it somewhere again pause for a minute if you want to and think about in your own words write it down on a piece of paper what in your own words is market basket okay so the idea behind market basket is pretty straightforward it calculates or it's a prediction of what you would buy based on what you've done in the past right so take this example i'm going to a grocery store i buy milk cereal honey water eggs and so on so if i buy milk and honey and eggs then in all probability i will also buy bread okay so that's kind of how i behave that's how my you've detected my consumer my consuming pattern how is that useful think about the power of this think about what i can do with this so if you are a company and you know that i've already put in a basket milk egg and water or milk eggs and milk uh yeah bread or water whatever honey then i may not end up buying bread but if you detected that i bought bread in the past you could entice me with a coupon you could entice me with an offer or you could just entice me with a simple reminder by placing bread on right next to milk or by placing bread right next to eggs right just that simple visual cue will help me buy that okay so these are there's a lot of techniques that retailers adopt to uh make use of this market basket analysis but primarily that is a concept conceptually if you're in your basket in your total basket out of a total of 10 items if you bought five six seven items what is the chance that you end up buying the seventh or eighth item as well so that's the concept of market basket this is the basis of recommendation uh recommended systems as well right so if you think about movie recommendations netflix or if you think about online transactions if you've transacted online on amazon or flipkart when you put something in your cart then before you go to your checkout counter they probably you get a message saying hey other consumers who have bought this mouse have also bought this keyboard keypad as well or keyboard as well right would you like a bundle would you like an offer to bundle both of these together and here are some possible options other consumers have bought okay so these are some things that retailers employ and play with and kind of give you some offers to optimize and maximize your their profits their sales and maybe in the in the process maybe also give you a better deal right that's a market basket we'll talk more about market basket basket specifically later where we have to take actual data set and see how we can track how we can calculate the market basket come up with a valid recommendation that you can basically use from an example from a pos from a basket transaction so i said i'll have i had just one slide with a lot of words right so this was just one slide with a lot of words i must confess i lied to you all i have two slides okay with a lot of words but promise you no more than that okay so that was just a lie to just to make sure you're i got all your attention but i promise you this is the second and last slide which has a lot of words without any data or without any uh pictures okay so continuation of terminologies okay so let's get the terminologies like i said out of the way and then we'll be ready to deep dive into the data set mystery shopping okay so that's something that is pretty recent at least in india some bigger cities we start seeing these trends but pan-india the trend is still picking up again if you don't know what mystery shopping is um those of you who love shopping you'll love this okay so the concept about mystery shopping is again it's it's very popular in the uh the western markets uh the idea is um let's say i'm a company or i'm a consumer right and you are a company and i get paid from you as a company let's say maybe hundred dollars or something and they you as a company give me a simple task what is a task you tell me go shop in this particular department store sounds good then my next question okay what's the catch well uh you tell me there's no catch use i give you 100 dollars in credit you can buy whatever you want and guess what the icing on top of the cake is whatever you buy with that hundred dollars it's for yours to keep you do not need to return you do not need to do anything with it it's for yours uh it's it's it's for you i'm skeptical right i ask you again and again okay what is the catch it has to be some catch why do you give me free money to shop and you give me all the goods that are sharp for me why do i not watch a catch then you come back and tell me well you just need to do me a simple favor now i'm thinking oh that's a catch there's a cat that's that's what's coming and what is coming then you tell me just write out or uh follow a script right so um some sort of a script that tells me checks off and as you start shopping give this particular sales series a hard time or try buying a product and returning a product and gauge how responsive the consumer customer service was or go look at multiple aisles and ask questions did all the store assistance help you properly to get your products right so these are some things that you are looking at these are some things that you will start tracking from a consumer's perspective and that's what you tell to the company in return right so that's they make your hundred dollars uh work for them so that's kind of the idea of behind mystery shopping right and they're different variations right you could they sometimes they have a checklist but most of them they do not have a specific checklist it's not done in very visibly it's done mostly in the background where in the associate who's helping you does not know what you're actually doing so that's what mystery shopping is right so again depending on what the script is i mean you have to follow a script you have to either talk to them but in most cases the uh consumers the the sales associate who's helping you does not know that you're mystery shopper right so that's the fun that's that's that's a real way to gauge the uh actual things that's going on inside the company right inside the store um so i don't know there's something that's uh similar in concept there's a tv series that's called undercover boss i don't know if you all have heard of it if not just google it i think it's available netflix or amazon but check out this uh series undercover boss so talks it's mostly based on the us but it looks at a lot of companies where in their ceo so the top management goes and this guy spends a week across multiple chains across multiple acro throughout the country talks about them and then finally they all come back and they don't know that the guy is uh is a ceo they go and disguise and they spend actual quality time on the floor and try to find out what is the problem with their store is there a cultural shift is there an employee attitudinal shift problem and so on so kind of similar to a mystery shopping but just done in a more elaborate way but um mystery shopping right um in fact i when i was working in the us for 10 years like i said i was a mystery shopper i just wanted to just do it for the sake of it and of course who would say no to free money right so i signed up for it i realized uh i bit more than i can chew right it ended up taking a lot of my time more than what i anticipated and by the time i wanted to um walk out it was too late i was too committed in the process i don't even know what i bought but it was a horrible experience just because i didn't have the patience to go through so sorry to break the bubble but even though as glamorous act as it sounds mystery shopping is a lot of hard work okay especially for someone like me who hates shopping right so maybe those of you guys who are watching the video who love shopping maybe it is for you but be prepared to work a lot okay so let me let me give you a brief idea of what happened to me right uh so it's not that straight there's an online application i signed up for the online application it asks basic questions about demographic race uh your location and your occupation income and so on right so they have some filtering criteria some basic five or six demographic characteristics so obviously i was qualified so because i was called into a an online assessment and then once i cleared that round there was an online assessment and the online assessment had some questions right some about 50 questions that took me gosh close couple of hours uh to decide because they were all choice based questions is scenario-based questions they said okay if you spotted this what would you do and if you were presented with these three choices what would you buy and so on that should have been my first clue to walk out but i didn't so after i won the first round i was like well you know what i'm i felt like i was lucky right i was among thousand participants i was selected uh in the top 200. i guess you know what that's probably privileged maybe it's like almost clearing your interview rounds maybe i'm doing i'm special you know i had this fat ego like uh ego building of my mind maybe let's just walk through the process but in hindsight that should have been my first clue to walk out because it took an entire two hours of my weekend um when i had to answer all these 50 odd questions so but there you go but anyway i cleared that apparently because i was next called for a focus interview which was an audio call and uh from the 200 participants it was narrowed down to i think about 100 or so not 100 maybe uh 60 or so but then we were split up into like three or four sessions of focus groups that should have been my third clue right but apparently that still did not deter me from dropping out i went through the process so they gave me a choice of options for timings and i chose a time like after work so maybe at nine o'clock in the evening and it was for an hour and a half so from nine to ten thirty in the evening um i had to block out my time for this mystery shopping um call so i was on this call with like 15 other guys 15 other people on the phone and the moderator was pretty harsh um almost reminds me like of a milder version of arnab goswami right i don't know if there's someone milder than arab gorsami but anyways um think of a milder version of arnhem because uh when he was not talking were like uh sean we've not heard from you over the past two and a half minutes do you have anything to say to this like i was like what i don't want to just blabber right and you have like 15 people all talking at the same time it was uh hilarious uh now it's hilarious but at that time it was like i didn't know what i got myself into again that should have been my fourth clue still didn't work out still did not walk out of it i still went ahead of the process finally fortunately for me i was not selected right so that was a good thing but then i heard from my friend who was selected and they had to go through another another two rounds wherein they had to fill out the questionnaire and then when they were given out the actual gift card to spend in that store it took them an almost entire day to walk through the store layout to answer specific questions on the checklist and then keep checking with their lead person to make sure they've turned in all the paperwork before they can go ahead and shop and just do it so bottom line long story short mystery shopping is not as glamorous as it stands but if you're in for the long haul if you're nothing else to do on your summer holidays or if you're uh taking a day or a couple of days off from work go ahead and do it if you're offered okay um but anyway that was the end of mystery shopping for me but it kind of raised some important questions for me because i was thinking my mind i mean why does a company go through all the hassle of going through a mystery shopping experience you spend a lot of time energy money to kind of gain why can i not gain that same information from service right because surveys are not that authentic it's still cheaper for companies to give money and to spend time and energy in this process because they can get more actionable insights from a particular store and so on so i thought that was interesting okay moving on spent enough time on mr shopping i'm uh done with that okay uh loss leader think about the term loss leader and have you heard about it anywhere loss leader is this idea behind selling something for an extremely extremely low price at such a low price that it grabs the attention of all consumers right that's the idea of loss leaders right uh take an example if you are buying this phone right so this is i don't know what phone forgot it's like redmi something right note 6 note 7 pro or something right so take for example i buy this phone i go with this online and i get three or four choices and the top two choices i get from my google search results are amazon and flipkart right so flipkart charges me or charges me for this phone fifteen thousand rupees all right net everything included fifteen thousand rupees i think to myself such a great deal amazon charges me fourteen thousand four hundred rupees i like can i can this really be true when i'm getting the same exact product if i buy it at an um brick and mortar shop this phone cost me over 17 000. but flipkart is giving me 15 000 amazon is beating their price at 14 400 rupees okay so what would you do no brainer right same product there's no difference they're the only differences in who's selling it so i'm going to gravitate to amazon instantly i get it for 1400 14 400 bucks and i'm extremely happy i want to start using the phone but i just stopped with the phone no i need a case i need a back cover case because i want to protect my um fragile uh you know color and i'm prone to dropping i'm prone to i'm not clumps i'm very clumsy i i i'm prone to scratching so i need a scratch card do i stop with that i need a car charger i need an extra wall charger right i maybe need another case maybe this uh charger uh or maybe need an extra set of headphones maybe need an sd card okay so think about all of these extra accessories that you're adding to your phone so now your accessories since you're buying it from amazon at fourteen thousand four hundred dollars your accessories puts your bill close to maybe 18 000 rupees that's still okay but what you probably not realized is all those accessories that i bought from amazon were probably a lot more expensive than if i had bought them from flipkart think about this right so in this case this phone was a loss leader for amazon right so they priced it low because i naturally gravitated to their website they know for a fact that i don't probably will not just stop with the phone i probably go end up buying the accessories so if you just do a simple next time you're buying something when you're doing comparison shopping right you probably check the same product for both uh multiple retailers you're probably going to see one product which is slightly lower price the other products are going to be slightly higher price the same exact store right so this is how retailers compensate themselves right so the small trick that i learned and uh but that's the idea of lost leader it's nothing new it's not ethical uh sorry it is uh it's it's it's not non-ethical right so meaning there's uh uh nothing unethical about it right it is a very clever strategy i mean you it is your responsibility as a consumer to go uh price shopping you did price sharp for the first time i mean what stops you from price shopping for the next accessory that you buy for a phone so it's completely up to us to kind of avoid this right um anyways so that's the concept of loss leader right a concept that's very similar to loss later is called cross merchandising so retailers when you buy a specific product specific item you know for a fact that it comes with a bundle you know for a fact that you buy something you also end up buying something else again these some of these concepts as you start working on it you can see there's a fine thread that connects all these dots together right so loss leader cross merchandising market basket which you talked about earlier so all these ultimately will come together okay when you start working with data right so again cross margin is that similar concept where retailers try to bundle associated products together right so that you can maximize your revenue or optimize your consumer experience inventory turnover is a slightly different concept right so it is more of a financial metric but it's still valid and a lot of retail establishments right so the idea behind the inventory turnover is it's like i said it's a financial matrix it's usually expressed in the form of a number uh like 1 2 4 5 and so on so let's say if i throw out a word the inventory turnover for a specific product is five the term or the number five means that i turn that specific inventory five times during the year in other words i buy the product i sell all of it i buy it back again sell all of it buy again sell it repeat this process five times in one year right so that's that gives me an inventory turnover of five that's kind of what invented turnover is it's a financial metric it basically is a measure of couple of things it's a measure of how efficiently you are turning your product so it has cost flow implications uh sorry cash flow it has cash flow implications it also has a popularity or stocking implication from a retailer's perspective as well so it touches multiple aspects it touches finance it touches uh inventory management which is supply chain analytics it touches financial analytics also touches marketing analytics because it gives you an indication of how popular your product is selling the rate at which your product is selling okay so this one single metric like i said affects multiple domains or multiple functions of analytics supply chain analytics financial analytics and marketing analytics right pretty powerful concept right obviously you want to have an inventory turnover of as high as possible right and uh the intent of supply chain folks especially is to optimize find the right optimum balance wherein you stock just right amount so that you turn it over in the exact time with which it gets consumed so that you don't lock up a lot of cash and you are ready to order at the right exact point before your next order comes in so that's how you definitely turn over right and uh the benchmark is different for different industries right so some industries i worked in industries where even an inventory turnover of one is supposed to be pretty fantastic right and i also worked in industries where inventory turnover ratio of 20 is still low right so um the numbers are all relative the actual numbers can be relative depending on the industry depending on the actual markets that a company is working off of so that can be pretty subjective so it ultimately depends on which industry you're working for which market you're looking at what is the competitive environment and so on right so inventory turnover uh so benchmarks is something that you need to be keeping in the back of your mind shown rate one of the most important marketing metrics okay not just marketing any metric any analytics that touches consumers or customers involves churn rates okay in fact any business cannot survive without calculating this churn rate it's become a household name in analytics in any domain okay because ultimately all domains irrespective of financial products or supply chain products or banking services you are ultimately making money because of who because of just one entity who is a consumer whenever a business involves a consumer no business can run without the involvement of a consumer because ultimately your consumer is your end user who generates your revenue consumer if it involves a consumer or customer it has to involve a churn rate okay so what is the churn rate simple definition rate at which my employer consumer churns rate over twitch my consumer length of time the consumer stays with me right so minute i lose my consumer from my website from my services from my products then there's something bad going on i need to take care of it so it's a very very important metric we look at it a little bit later where we're talking about multiple ways to calculate churn rates so a lot of companies spend a lot of energy to predict churn rates so that they do couple of things right so one is we calculate churn rates as a historical measure to use it to predict what is the rate at which consumers are going to churn and more importantly try to do an optimization or try to be more prescriptive about it not just predicting it but be more prescriptive about it and understand what are some factors that will cause our customers to churn and hence are there some factors that we can control as a company to stop the churn rate or to reduce the churn rates okay that is a pretty significant metric one of my favorites four ps of marketing okay i mean if you've uh done even like any basic business courses or if you've done basic business readings you probably come across four piece of marketing right uh this concept was established a long long long while ago right early 60s early 70s uh the idea about four ps um but it's still very relevant in what we do right now so four piece of marketing they are product promotion price and place right the idea behind these four ps i think of these has four pillars of marketing so any marketing strategy is not complete unless you take care of all of these four piece you have to have a product you have to apply the right promotion strategy you want to make sure you sell it in the right place and then product price promotion and place right and then you've got to make sure you price it correctly right so these are the four piece of marketing right nowadays there's a lot of acronyms that have come up uh there i've seen a lot of uh publications where they've invoked nine p's or 14 ps and 16 ps and so on they've just expanded the list but the concept is still basic ultimately whatever variation of the four piece you're starting it boils down to these four pillars of product or processes that you are looking at okay so when you're working with processes substitute product with processes right uh like olaf as a service right it's not a physical product it's a service so in case of the product you just substitute that with service you still have the same to deal with the same problems of uh people price and promo promotion as well right so those are four piece of marketing clv okay clv is a short form for i'll pause for a minute to see if you can guess it it's called customer lifetime value okay so what is customer lifetime value okay so this is a metric that takes or transforms transaction-based marketing into a relationship-based marketing right so if i am a consumer i go and buy a product or a service i'm evaluated not just based on that single transaction but what if i'm assigned a score i'm assigned a metric and that score or metric or number is not a measure of that one single transaction but it's a culmination or it's a combination or it's it's a collection of all possible future transactions that i will make as a consumer throughout the lifetime of my association with you as a company or as a product or as a service think about that how powerful can that be right so you basically assign a monetary value to math to me not just based on one transaction but based on all possible transactions that i'm going to conduct throughout the lifetime of my association with you very very powerful so how do we so we'll actually talk about that there are some formulas that calculate the customer lifetime value specifically but more importantly what do companies do with that number what do we do with it so what yes i can typically clv is measured in the form of a monetary metric uh like a rupee or a dollar okay whatever unit of measure you're using some equivalent of a currency metric right so the idea is what is the entire worth of you as a consumer right and based on that i can adopt different marketing strategies different pricing strategies based on your association with me right so clv is a very very powerful metric customer lifetime value by the way the idea before i go to the next one the idea of clv one of the elements it uses to predict the lifetime value of you as a consumer is churn rate so you're probably going to find churn rate being inserted into some sort of fashion in any analytics that you do with consumers okay so cle is one of them there's a direct relationship between churn rates and cla because churn rate prediction goes uh plays a big big role a very significant role in trying to predict your clbs or customer lifetime value pli is um a similar metric to churn rate um the full form for pli was um probability of being alive okay so as morbid as it sounds don't be alarmed p alive is essentially a uh idea that hey yes churned it is something that's happened previously historically right pli is something that i can predict could happen based on historical patterns because i'm not sure if a consumer is still churned or not so that's where this concept of pli or the probability of consumers still being alive is basically um calculated or basically that's kind of what the concept is right um so if i know based on your historical patterns based on recency metrics frequency matrix your inter dormancy period i can pretty much calculate with some simple probability calculations what is a probability that you as a consumer will still make a transaction and hence are alive in my system for the next period and we'll take an example and this will become uh clear as we walk through an example but just to kind of give you a conceptual understanding what's a pll means let's say i'm a consumer and my data has been tracked for the past 10 weeks so i shopped for something during week one i did not come to you during week two weekly there was a pass i suddenly came back during week three i also came back during week four there was a pass for week five i did not buy from you from week five do not buy from your week six week seven goes i still do not come and buy any transactions or make any transaction at your website week eight to come back again week ten and come back again so now we are at week 10. i can use this peer life to kind of predict what is the probability that i will still buy or make a transaction during week 10 at your website that's the idea behind pli okay so it's an extension of churn rates like i said it um it it is a predictor of uh churn especially even before it's happened right so it's it's a great indicator and i'll give you an example in fact i've worked with one of my clients to actually implement this in a completely different setting in a healthcare space which i'll share with you guys in a bit as well okay as you go through those videos so a lot of applications if you think once you get most of these analytical concepts once you get the concept that you can have fantastic uses any problem that you have any process issue that you're trying to solve uh you you will have a great repository of think of this as a toolkit think of the multiple toolkits right you can draw from your quiver of different arrows to shoot and tackle the problem right so plm is one of the things that you can use okay same thing with clv and turn rates as well segmentation think about the word segmentation very common commonly used term in marketing multiple ways you can segment your consumers you can segment your products you can segment your services um it's basically think of segmentation as a way to group things right um the group things based on certain characteristics or you can group things based on certain behaviors okay so a lot of applications but ultimately the idea behind segmentation is to have 50 000 products not all of them are equal and there's some sort of priority that's the underlying concept of segmentation how do i assign the priority you apply one of these segmentation tools in fact we are going to start off with one specific such segmentation which is one of the most commonly used segmentation techniques which is called rfm which is your next slide that i want to talk about um but anyways um and then we'll move on to some advanced concept of segmentation where it involves clustering okay so clustering is another form of segmentation but ultimately i mean just know that segmentation is a very broad term multiple techniques you can use to segment items things that you can segment are either most common segmentation techniques involve segmenting based on consumers or products right so those are going to be the most common ways to segment consumers like i said this can be applied across multiple scenarios you can segment employees so there you go hr analytics right so same concept applied in a different setting instead of consumers your consumers are employees now you're suddenly an expert in hr analytics right so that's kind of how it works um segment mutual funds suddenly you're a financial analytics expert right so same concept right so apply one concept like i said if you apply a segmentation strategy in one domain just take a different domain data set apply to a different data set you will become an expert in that specific domain right so goes back to the example that i was alluding to earlier how each of these techniques is um applicable across multiple domains you can just pick and choose and just uh it's like a cookie cutter example just pick and choose and just apply to different data interpretation is what is more important right so um you will learn the you probably have learnt this by now right so the techniques are all not that straight that are not that great right i mean yeah you can learn the techniques and tools from any video like a video like this right the more important thing is how you apply a technique how do you apply those specific techniques into um an implementation right so that's kind of what i'm going to be stressing upon throughout the rest of these video series okay so like i said next one we're going to be talking about is rfm which is basically uh it stands for recency frequency and monitoring and that's one of the simplest ways to segment data so we'll start off with a simple way of segmentation and try to get your comfort levels up to handle data and then we'll move on to slightly more advanced and more nuanced ways of looking at data from a marketing and retail analytics perspective so let's jump into the rfm analysis so rfm like i said earlier is called recency frequency and monitoring the idea behind rfm analysis is it's one of the ways probably one of the most common ways and one of the simplest ways to achieve segmentation right so like i said earlier you can segment either products or you can segment consumers this case we're going to segment take an example of a real data set and try to see how we can segment apply this rfm analysis to segmenting customers so um it primarily has three elements to the analysis so one is recency the other is frequency and the other like i said is monetary right so the idea behind recency the rfm is um let's say if you as a consumer you're going to buy something from me right you come and buy and you you pay some money you you go out so if i can treat you as a transaction so if i were to treat you as a transaction all i would be concerned about is the monetary value that you spend how many rupees you spend at my store that's the monetary value of it right so a lot of companies want to go above and beyond the transactional nature and then just do something more about relationship right so that's where you talk about other these other metrics that track your recency and the frequency so that you get a fair idea of not just at the transaction level but more from a relationship perspective right so you are a consumer who you are not based on the transaction that you make but two other things how many transactions you make in a given time period and also how often you come to me and what is the most recent time you visited my product or my store or my processes right that's the idea behind rfm how um recently you came how many times you came and ultimately what is the monetary value that you make how many what's the monetary value that you um more some monetary value for transaction that you make in my store or my business that's kind of the rfm analysis okay so let's try to do this with an actual data set okay and see how we can come up with this rf and metric and once we do this we'll talk spend a few minutes talk about applications even though we're going to be talking like i said talking about a marketing application this rfm analysis can be extended across other domains other industries as well so i'm going to open up this data set just kind of give you an exploratory overview of the data set so data set is in this folder so this is a customer data at the transaction level let me just zoom the screen a little bit so every record contains the items at the transaction also meaning a specific item was purchased and this was a transaction for that particular item right so it has an invoice number and the item also has an sku a stock code and obviously each of these stock codes are associated with the description in terms of what this really means the quantity bot of that particular sku and the date at which it was bought okay what was the invoice date the unit price and the customer id so meaning one consumer could have bought multiple invoices that could have multiple products so every transaction at least most of the transactions are associated with the consumer customer id and then finally the country which country do they belong to so most of these transactions are european based i believe uh no all over the world but for the purpose of this exercise we're going to limit our transaction to just consumers from the uk okay let's see how we can do that okay so i can look you can do a couple of things okay one is uh if once you get access to this data once you're opening up this data set i would recommend this um pause for a minute and try to do some exploratory analysis because that's kind of what i'm going to do for the next few minutes i'm going to open this up in tableau and try to do some exfoliating analysis to kind of understand the data set you can use whatever tool you want you can maybe use excel itself or maybe use r if you're comfortable it's completely up to you but it'll be good exercise for you guys to pause for a minute and actually uh do a basic descriptive analysis of the data just kind of understand what the data is telling you how many simple things like how many countries are there out of them how many uh countries are there within uh because consumers are there just for uk what's a total average unit price so just some basic stats basic exploratory analysis okay so i'm going to do that instead of tableau like i said so i'm going to open up tableau okay so i'm not going to give you you guys uh oreo tablet i'm assuming you already worked with tableau before so if you're not you might want to pause and maybe consider learning tableau at least basics but this is based on the assumption that you all have finished tableau or at least know how to get started with tablet please have a working knowledge of tableau okay so this is an excel spreadsheet that i'm going to open up so microsoft excel so the online retail file i'm going to open up and by the way if you do open up the excel file there is a tab on the data description that gives you a brief overview of what each of these columns mean so that's your data types and the data definition okay so i'm going to drag and drop online retail that is where my data is i'm going to do an extract because this is a relatively larger data set i'm just going to do an extract for now so when it's loading it for the first time it takes a couple of minutes but once it's loaded the interactions should be pretty seamless so that's the advantage of doing the extract so we don't do the extract especially for a large data set like this every time you drag and drop you're going to face some lag right so that's uh the extract okay there we go so first thing first i'm just going to get a total count of number of records right i'm just going to drag and double click on number of records get a value it looks like a little over 500 000 right almost half a million records of data so that's good to know then i'm going to look at the total number of records and look at it based on each and every dimension one at a time so if you uh remember what i said earlier i'm just going to do this rfm segmentation for consumers just from uk right so because i'm just going to filter that data but i'm interested to see what other countries there are so i'm just going to swap my view arrange these from high to low let's give it text values so there you go so out of over 500 000 records looks like a majority of them are uk based customers i mean 486 divided by 531 is like almost more than 80 90 percent of my data is for from uk so that's good i'm not excluding a lot of data so that's good to know so i'm going to filter for uk um let's also do one more thing so also going to look at customer id i want to see how many customers i have and since i am filtering for just uk i'm just going to drag and drop the country as a filter and gonna click on uk as a filter click ok and i just scroll over i just give this text box it looks like yes there are definitely four eight to six thousand because that's kind of what we found out earlier right so 486 000 records for uk within the 486 we're going to see how many consumers customers we have that's what sheet 3 will tell us so this tells you based on total number of rows there is 3919 rows so that means there's that many customers because each row is one unique customer id 3919 marks so i have a question for you all very first bar which is a little over 100 000 records one lakh and 13 000 records null what do we do with that null think about that okay so um again pause for a minute to think about there's think about what you do with null when you get blank records or null records you're going to come across this scenario in your real life more often than you like whether you like that or not that is going to be a fact so you almost you have to think about strategies to deal with nulls sometimes it's okay to exclude them sometimes it's okay to force fit them sometimes you have to go back and take care of them and impute a value okay so you always have to decide what to do with nerves so there's a lot of literature i mean there's a lot of heated debates in this statistical world to do what to uh in terms of what to do with nulls okay but from a purely layman's business perspective based on what i've seen there's two simple ways to handle nulls okay so one is you've got to decide whether the null is from a dimension or a measure okay so again tableau one-on-one what is a dimension what is a measure a measure is something that's numeric that can be aggregated okay if you can do a sum of something an average of something if you can do a standard deviation of something you can apply a numerical average or numerical aggregate to something it's a measure a dimension is something that you can slice and dice if you can slice and dice your data if you can things like your name place region account um city state these are all things that you can slice and dice by sales or by another measure those are called dimensions categories dimensions measures are numerical variables that can be aggregated right so you have to decide whether your null falls into a dimension or a measure then you can take appropriate action if it's a measure then you have two options you can either exclude the outlier but before you exclude it be aware you've got to think about how you might want to um treat it okay if it was really a mistake really an anomaly then treat it outside or if it warrants to be in a separate group by itself i mean create a separate group for itself okay or the other option for you if it's a measure you can compute it or you can assume a value either the mean the median or the most commonly occurring variable for that for a dimension pretty straightforward at least as far as this scenario is concerned what would you do null is a category by itself right so since my objective is to create segments of customers there's nothing i can do about it right don't force yourselves i mean all of most of us when working with data first time get attached to the data very emotionally that's where a lot of debate very hotly contested debates about what to do with nulls remember this very very clearly you do not have a responsibility to fix data that is not your job get that out of your system right early in the system early in the process when you're starting with data right don't take ownership of the data you are given a data set the cascading effect of nulls has to be handled in your previous workflow the process owners have to take ownership of the data not someone who's analyzing the data yes it is your responsibility responsibility to spot or the nulls yes it is your responsibility to account for nulls that does not mean you take ownership of the nulls don't try to massage the data don't try to create the data because two things can happen one you do not have enough knowledge about the process at the transaction level to create it accurately second thing is it's going to mess up your analysis you're going to end up creating false analysis if you mess up your data if you don't know what to do with nulls so the safest first go-to option is to just exclude it okay now some of you might be asking me you know what i have hundred percent of the data if what if 40 percent of my data is nulls what do i do if i exclude it then i'll be uh my i'll be i'll be stuck with just 60 of the data my answer to you all is so what how does it affect your analysis in fact your analysis is going to be cleaner if you have incorrect data of 40 percent would you prefer 100 impurity or would you prefer 60 purity that should be the philosophy you take right so that does not mean you're i'm asking you guys to be irresponsible it's just that take ownership of what is your expertise your expertise is on the 60 percent of the good quality data because guess what if you prove with your analysis how much you can showcase what are some of the great things that could come out of the 60 percent of the analysis that you give with that clean data you can take it back to the business with the business owner said process owner said them hey here is a magic that you can create with good quality data if you really like this go ahead and give me cleaner data so this starts at iterative dialogue this starts at conversation between the data owners and the process owners and the analysts so it becomes a fruitful dialogue where no one is it's not about passing the buck it helps them to realize what a good quality data analysis can uh give them because ultimately you're there to help the process owners data owners right they can see the merit in the good quality data then it starts off a cultural dialogue a cultural shift and start they can start taking ownership of the data i've seen this personally across a lot of clients that i work with so to help kick-start that process i create what's called as a quality dashboard where i give them a metric i tell them hey based on the initial analysis even before i start up any work on a dashboard or on an analytical project i tell them give them a quality score and give them hey these are the fields that i need for the analysis but here's all the junk that i've noticed in your data and guess what based on that i assign your data a score of 70 so meaning whatever analysis i churn out it's going to be based on 70 of the data because there's 30 percent of the data that i cannot use and i will not use because if i use that my data is going to be muddled again at the sake of repeating myself i'd rather be stuck with 60 purity than 100 impurity okay so that's a philosophy but anyways so in this case since our customer segmentation is customer segmentation what do we need for customer segmentation we hate customers if customer ids are null ruthlessly without any feelings get rid of the nuts i'm going to exclude them okay so exclude nulls so i'm left with 300 3918 customer ids so that's my um list to work off of great so i'm going to move on i'm going to duplicate this because i'm going to retain these filters next thing i'm going to try and understand is what is the description because in all probability one consumer one customer can have multiple uh transactions so each description is probably a an sku so i'm gonna remove the customer id from this level and i'm just going to take the sku and the description together just kind of give me a sense of the uh high volume transactions or high value high volume skus so it looks like let me just rearrange them from high to low just restart this go with the description based on height oh no sorry uh let's clear this start go with the high to low okay so looks like there are some um same stock code could have different descriptions that's kind of what you're seeing here uh that's okay since we're not doing it going to do anything at the product level detail this is good to know uh this is not going to affect your analysis but still good to know you can kind of see get a sense of uh roughly 3800 different skus so looks like it's about the same right so 3900 customers about 300 800 skus so pretty even match that does not tell you every customer buys one sku they could potentially one customer can have multiple squares and so on right so r1 sq can be shared across multiple consumers as well so uh just just fyi and we also have something with dates so let's duplicate this so that i retain this filter remove the stock code description i'm going to get the invoice date just to see if there's any what sort of trend i'm looking at so expand the year into quarter quarter into months case looks like it's uh data spans one year so not a lot to study in terms of seasonality but still there is some mild seasonality in terms of quarters it looks like uh the first every quarter from the first to the fourth quarter we see a magnitude increase last quarter number is the highest and lowest happens to be the next month after that okay if we did have two years worth of data we'd probably start to think about seasonality but for now uh looks like we do not have that so i'm just gonna collapse the year from the quarter and just look at year and month and i can also do one more thing and get the um average number of orders by drawing a reference line but i'm not going to there but this is good enough so this is just fyi like i said just basic descriptive stats okay then so i'm done with all my uh dimensions then i'm going to move on and look at the other ones invoice number is too much lower level of a detail at this point in time we'll look at quantity and unit price so obviously looks like um well let me look at it one time right so if i look at quantity i can i'm going to go back to the previous sheet i'm just going to duplicate this so that i retain those filters so that we're not replicating the filters again so i'm going to look at instead of sum of number of records we'll look at quantity and let's look at customer id so since i have the quantity i'm just going to place a number of records right next to quantity and also maybe the unit price so this gives me a perspective of the relationship between quantity uh ordered the number of transactions that they've ordered from consumer's perspective and the unit price okay so if i sort this from just a simple sort order will tell you um some important give you some important clues right so one thing i can do is i can just place this on the entire views so you can kind of see this together again at first glance this looks horribly cluttered but do not worry we'll just make sense out of it in a bit all right the quantity i'm just going to restart this again and see if there's any pattern but looks like most of the high volume quantities are pretty much synchronized with a lot of number of records and also unit price same thing with number of records if i just sort them from high to low um that correlation is on that apparent right so meaning i do have high some high high volume items but there are also some low volume items which garner a lot of volume a lot of quantities right so there's quantities some huge quantities which not necessarily have a lot of transactions the number of records again remember is transactions quantity is the actual quantity consumers purchase right and also looks like not the quant all the quantities translate into uh high dollar or high monetary values right because if you sorted based on unit price there are some high value items and definitely some high quantity items so if you do buy a lot of them even if you buy a few of those you would end up your revenue which has been quant uh multiplication of quantity times unit price is going to increase exponentially but obviously you want to uh scroll down the list you can see there are some high value items which we do not have a lot in terms of quantity so there is some scope for revenue optimization right so that's kind of what this tells us um i think that's pretty much it we don't have uh any geospatial data because since we're filtering for the country level we're already looking at just one country which is which is uk consumers right so i think that's good from a descriptive perspective okay so uh let's move on so this will give you a good sense of like we could do a couple of other things as well which i'm not going to do but uh if you do have the time and energy i would recommend do a histogram for each of these quantities to kind of get a sense of the min the max the uh variation of the or the distribution of the quantities and the unit price as well as the number of records so you can kind of get a sense of the distribution that will be another thing that you could possibly do but i'm going to skip all that because uh it's time for us to get into the real meat of what you're trying to do which is the rfm analysis right but you've got to do this simple exercise again uh since i was talking a lot it sounded a lot but this should and all practical for all practical intents and purposes it should take less than five minutes of your time okay just input the data set quickly drag and drop just look at it in a tool like this i mean you can use excel if you want you can use r if you want uh tool is completely up to you i'm comfortable with comfortable with tableau i just use that if you're comfortable with any other tool feel free to use whatever tool you want right but that's the idea of doing a basic descriptive analysis okay got to do this it's almost like a practice this is almost like a wake if i wake you up in the middle of your sleep give you an unknown data set this is the first thing that you should see before you jump into any analysis okay a lot of clues can be uh gain a lot of clues can be discovered by using this simple uh technique so let's move on to the meat of the analysis which is nine nine is primarily a uh workflow type of software right so um it does not need any complex coding uh it's primarily a visual software it's a it's a node-based item so you just drag and drop nodes and you connect them in workflows so think of it as a workflow type of an analytical tool right um the reason i personally got hooked onto nime is because it does not involve a single line of code okay by the way i don't know if i mentioned this earlier in the introduction but i'm allergic to programming okay i'm terrified of coding i do not know codes okay so for people like me this is a fantastic option okay it gives you a great overview of how to um run the analysis away the focus is not on the programming or the coding but the focus is more on the inside okay how do we interpret an output of an analytical analysis so that's kind of the intent of this and which is great for people like me for consultants who focus more on the output than focusing on how to get the piece of code right or to fine tune a specific line of code and get all of that code so that's not my 14 okay so that's kind of the idea behind them and one of the reasons why use name is because of this very specific fact i do not spend a lot of time in creating these workflows because once i set them up the focus is more on iterative impact i can quickly change models i can quickly uh build multiple models which i'll show you later how you can create multiple models compare the efficiencies of models how to create prescriptive analytics how to do what-if analysis okay so all these are great uh options where you can do in nine and as an added benefit it also does a lot of etl which is extract transform load so that's a good a lot of data manipulation as well and the tool is very versatile you can do text and text mining you can do you know machine learning as well in this tool which i'm going to show you all you can do regression analysis you can do clustering um unsupervised learning unsupervised learning a lot of things i mean basically all it's a complete very versatile analytics capable tool right so uh when you open it open it for the up for the first time by the way this is free uh it's an open source software so any of you can download this you do not require a license you can just download nime and just basically run with it right um the first thing i need to do is i need to set up my workspace correctly because by default it opens up all of your all of the windows but you really don't need all of these okay i'm going to start minimizing the ones that i really don't need so this one i i just need four workspaces to work off of right so one is this name explorer which is good i definitely need this i want to keep them the workflow coach i'm going to minimize it i do not need that the node repository is good i don't need it because this is where i'm going to start dragging and dropping node dragging and dropping nodes the workflow this is the workflow bench so we definitely need that i do not need the outline so i'm just going to minimize that minimize this i do not need the console i'm going to minimize this as well and this is what i need okay so i need the note description this is a good uh note to have because you'll see as i start bringing in notes this will kind of give you a brief overview of what these notes are you give you a brief description of some of the input options that you need and what to accept from the out expect from the output and if it involves further uh reading it gives you links uh to uh some additional reference material as well okay so that's what the node description is for okay and in case for some reason if you made a mistake and you've minimized something that um i didn't ask you to do right and maybe you ended up with something like this very straightforward very simple just this one click option that rearranges everything and resets all the views so if you click on the view option and if you click on this option that says reset perspective just say yes you get back to the initial screen that i started off with okay so um it's okay to make mistakes only if you make mistakes you're going to learn so do not worry if you make a mistake i'm just going to minimize the ones that i don't need i don't need the outline i don't need the console as well okay so these are the four things that i need great first things first i need to create a workspace right so this is where most of the action is going to happen so i'm going to on the right i'm going to right click on this place where it says local workspace and say new 9 workflow okay so that's what this means i give it a name so this case since we're going to do the rfm analysis i'm going to say rfm analysis and we have a checkerboard kind of a screen a layout here so this is where like i said the action happens and let me zoom this a little bit so they can all see it a lot better i'm just going to say maybe 100 110 so you can zoom this or zoom out depending on what your screen resolution is so this is where we're going to start adding in nodes okay and the node repository is something that's available here you can basically explore all possible options that are available from each and every node so these are called this first one is input output there are some nodes for manipulation which is basically talks about different etl or extract transform load options and also nodes for changing the views which are basically your charts my recommendation is when you get are just getting started with this do not try to explore all possible combinations because pretty soon pretty quick you're going to get overwhelmed okay there's just so many nodes to kind of go through and don't get frustrated if you don't know what some of those notes mean and uh it can get pretty intimidating pretty intimidating pretty quick right so don't um just good good to know they are there but don't spend too much time in exploring each of these notes i'll give you a shortcut one of the easiest ways to find nodes is to just start typing them in the search box here right so first things first we need to first import a data set right so the data set in the form of an excel sheet so if i type in the word excel these are the options that pop up based on the filter so this acts like a filter it's a dynamic filter as you start typing in words the appropriate nodes appear right so right now for excel i have four possible nodes and if you notice they are pretty intuitive right so i can read with an excel reader or read excel sheets or i can write since we are going to import a data set the closest match you have is an excel reader so i'm going to click on this now notice the minute i click on this excel reader the node description pops up so the note description on the right side basically tells you what this node is about it gives you options on the dialog box it gives you what files to read what sheets to read and these are the options that you will be faced with after importing a data set and if you scroll down it gives you a complete overview of what to expect what this node is about what are the input columns what are the output and so on right and like i said some extra reference material if you want to know more about the other nodes right so this is a node description so i'm going to drag and drop this xls reader into this workspace so just drag and drop this just place it right over here okay so when you import a data set for the first time when you import a node excel reader is available you see a traffic light system okay you see a red in this case which is red once i do something it will turn into orange and then finally green so red means the node is available you've not told the node what to do in other words you need to configure you need to tell the excel reader basic things like where is the data set how many rows should i read does my data have headers uh which sheet to read and so on so these are some basic things you need to tell the xls row sorry the excel reader node to function so once you've configured the input parameters your color will change from red to orange or sorry or yellow right so it turns yellow that means the configuration parameters are all set up well they could be correct or wrong it does not tell you anything about an error but all he knows you uh configured all the required parameters okay that means it's ready to go then you execute the node it turns into green so right now it's red let's first configure the node so i'm going to right click click on configure and first thing first i need to select where the data set is i'm going to click on browse navigate to where my data set is rfm it's going to be my online retail store okay so click on that click open after that so when you import the data set for the first time there's a minimum of two things you need to do at least from an excel spreadsheet perspective so one is this first thing by default it selects the first sheet to read okay so you always almost want to make sure you select the actual data set that you want to do right so don't accept the default because you do not know which sheet is the first one with data depending on how it was configured my data description could have been my first sheet so you always want to choose specifically what spreadsheet has the data so in this case it's going to be online retail i want to click on that and the other thing is i need to select column names right if i don't select the column names what's going to happen is the first row which contains column headers will be treated as data just go back to my data set so the first row here has column headers right it has invoice it has stock number it has customer id and so on so if i don't choose this option or if i do not check this on this will get treated as data and you are bound to get errors okay so let's choose this one click on this thing that says table contains column names in row number one and everything else is good i don't need to change any other option so i'm going to scroll down scroll down if you see this red colored message most often red is something bad right so that's kind of how our brain is conditioned and in this case it is correct uh we do not or we cannot proceed further before correcting this this is a simple fix it just tells you uh by default whenever you input a data set for the first time it gives you a generator preview right so in this case we change some default settings the default settings are not yet carried through to the preview so all you need to do is just click on this refresh option and your preview will get refreshed with the configuration settings that you've just created so notice the headers are now um available in the in the place where they belong so i also have uh just a quick cursory glance looks like your data is good so all of that all of that is good i'll click ok and notice your color changes from red to yellow it means your node is ready to go you've configured the node now we're ready to run it so let's right click and execute click on execute something happens or something is appearing to happen once it's done processing you'll get a green colored light so when you do it for the first time since it's a lot of data sets it takes a few moments because we're importing remember we're importing over half a million records right over 500 000 records so yes it's done we have a green light so now if you right click and you want to know what the output is so you've run it the data is imported into this node it's ready to move on to the next step if you right click and click on this option that says output table it just confirms what you just did it tells you the these are the um column names inside the preview it gives you the exact gives you a total number of records that you have if you look at the top left side where i'm just clicking with my mouse it tells you the exact count of the number of rows so here's more than 500 000 rows of data and there's one additional thing that you can also see if you look at the specs this gives you a nice overview of the data set so it gives you the metadata information right it gives you the column number it gives you what type of data it is whether it's a string number date and so on and it also gives you a lower bound and upper bound so kind of the min and max right that's what this means in case it does not compute as not ever or nulls it gives you question marks and categorical variables it spits out each of the different categorical variables here okay so we'll go back to this uh thing and you can also get get at the metadata right from here as well so notice the invoice number right next to inverse number there's an s associated with it so that means this is being treated as a text or a string similarly stock code has this s associated with it this is a string as well description is also a string quantity is an integer you can see the i associated with this and then invoice date you have a calendar icon that's associated with the invoice tape okay so great so that's done um the import of the data set then i want you all to pause for a minute to then uh think about what is it we need to do next okay so again i want you guys to take a piece of paper this is the best way you can learn after importing again keep the overall objective in mind ultimate objective of this workflow is we're going to do an rfm analysis and i told you rfm was recency monitory and frequency right so we imported the data set next thing is before we can start to calculate the recent c and the monetary and the uh frequency we need to do something with it think about what it is so i'm going to go back to the descriptive analysis part so the initial data set has 500 000 records we need to filter it just for you know uk customers and since they're all customers we need to filter out for our filter for just a country uk we also need to filter for customer ids without blanks okay so these are two things we need to first before we can even proceed so let's do that so there is a node that says filter so we're going to import a filter type in the word row filter we're going to filter the row okay so again um you might be wondering whether we use column filter or row filter so we're not filtering out the entire column we still want to read in the column called country but within that column we need to filter out specific rows so that's why i started typing in row filter so as you notice the minute i start typing row fil this option pops up which is a row filter so i'm just going to drag and drop this onto the node here so now i'm thinking about a workflow right so workflow is a series of connected events so we imported the data set the output of the reader should be as the input of the next node which is a row filter so you see the individually it's pretty easy to follow as well it's pretty intuitive so this outward arrow gets connected to the inward arrow to the next node okay so let me undo so if you click on your with your mouse on this arrow start i'm still holding your left click start dragging and dropping till you see this arrow and connect this with the next arrow and wait till you connect both of seed both of these two lines join let go of your mouse after that okay so now these are connected now you still have a red dot because we've just connected the dots we've not done any configuration yet so let's right click click on configure and my column to test is country because i'm going to filter everything uh any customer who belongs to uk right so that's what i'm going to use and i'm going to say exclude rows but let's say we're going to include rows right because we're going to include consumers who are just uk it's going to say include rows by attribute value that's what this means so my pattern matching is going to be united kingdom make sure your spelling is correct because i believe it's case sensitive okay or you also have the option to choose this from a drop down which is pretty easy right so that way you're not following the trap of spelling you don't need to remember spelling i just choose this from uk from the drop down so that's done then i click ok so the node turns from red to yellow let's right click and execute this and what should you expect think about what you should what you should expect from 531 000 rows your total count should now go down to somewhere close to 486 000 records right because that's kind of what we are expecting from five thirty one thousand we are going to see four eighty six thousand let's see if that really works so yes this is done a green light has come let's right click click on filter and sure enough everything looks okay in voice number the data has not changed looks like all the columns are good and sure enough the total row count is 4 lakh 86 000. so yes we've done the filter has done what it is supposed to do it's actually filtered for uk and it's returned returned us back the account that we're expecting so i'm going to close this so now what do we do next again take a pause you might want to pause the video and see if we can think about what to do next again take a piece of paper and see what is it you're going to i gave you the first step think about how you can move on to the next step okay so the next step is like we filtered for country we also need to filter for customer right because just remind you go back to the excel spreadsheet that we saw so think about or the tableau workbook right that's easier so we said we filtered for uk within uk we also filtered for specific consumers right so if you think nulls right so we want to remove nulls so that's another thing that we need to filter for inside of nine so we'll still use a um row filter all right so we'll add one more node of the row filter so we're going to pass on the results of the previous row filter to the second row filter and let's configure this by the way you can name your nodes right so because when you're creating the workflows it's pretty it looks simple you can follow along but believe me once you get the hang of the tool you're going to start creating multiple nodes and pretty difficult to keep track of what does what okay so if you want to rename those nodes i can just just double click on the actual node default numbering i'm going to say filter uk customers right so that's what this node does that's why it gives you a reference point of what each of these nodes do and this one i'm just going to say filter out null customers all right so let's right click configure and my column is going to be instead of country i'm going to choose customer id right because i'm going to click on customer id and i'm going to say exclude all the nulls right so i'm just going to say exclude rows by specific attribute value and there is an easier way instead of typing in null or finally to change it from here or seeing if there's a drop down but just say missing values match okay so that's basically going to say it's going to scan through each and every record and if it finds a missing value it's going to exclude it because we've had the exclude criteria set up right so that's kind of what this is going to do for the customer id column let's see if this works i'm just going to click ok let's turn into yellow right click execute so again remember how many records we will get back if we did it correctly from 531 we've narrowed it down to 438 our thousand records from 438 000 odd records we came down to 354 340 records exactly right so that means all our color uh your filtering out filtering options are indeed working right so let's look at our tableau workbook just confirm that our numbers are actually correct so sure enough we do have at the customer id level we do have three if you look at the bottom sum of number of records we do have three hundred and four fifty four thousand three hundred and forty so three lakh fifty four three forty which kind of matches up with our number here as well okay so filtering is done what is the next step okay so we filtered out customers we filtered out uh uk customers uh we filtered out null customers the next step is to start doing our rfm analysis all right think about this the rfm has to be done at the customer id level but our data is at the invoice level every record is an invoice so we need to aggregate the invoices to the customer id level in other words every customer has to be aggregated for all of the transactions right so one customer can have multiple transactions each customer's transaction has to be jumbled has to be grouped together has to be combined together to create one single node or one single transaction view okay so that's kind of how we are going to do this so we're going to basically create a pivot but what are we going to do a pivot on i'm going to take going to take but i'm going to take you back to this example right so we need to create three different columns right so one is for recency the other will be frequency a third will be monitoring right so in order for us to create these to think about what fields you would use so the frequency will be that's sort of the easy one the frequency will be nothing but a count of transactions so if i take a count of invoices so one consume one customer can have two invoices three invoices four invoices if i do a group buy and base it on count of invoices i can pretty much get the frequency metric i can do a count of invoices by each customer for the monetary value i have quantity i also have unit price so i'm just going to multiply the unit price with the quantity so that will give me the monetary value or monetary impact of my measure so again if i summarize this i can do a sum of the monetary value which gives me the unit price and quantity a third thing which is kind of the uh i would not say difficult but slightly more involved metric would be the recency metric so recency metric is going to be some sort of a metric that is based on uh dates right so think about which column you would use here so we have the invoice date so i'm going to use that invoice date to kind of create a number of days metric right so think about this for me the recency metric if i'm trying to calculate that it is more important for me to get a consumer who's transacted with me more recently than a consumer who's transacted with me way back in the past right so let's say um two consume two consumers one customer both of them are with us for the past one year but then one customer has bought from us just last month whereas the second customer has bought from us last when the customer has bought from us was three months ago who's more valuable that's kind of what the recency metric tells us so how do we calculate that again some sort of a static metric you can either use invoice date subtractor with the last possible date of the transaction or you can use invoice date and have a fixed reference point and say every transaction subtract the invoices from current date or today and get some sort of metric so one month most recent transaction if it's one month then it's going to have a smaller gap most recent transaction if it has three months i'm going to have a slightly different gap a wider gap most recent transaction was a year ago then i'm going to have a larger gap so the larger the gap is against this fixed point the lower or weaker my recency metric is so that's kind of what the recency metric is measuring so i need a fixed reference point that's the key here that's ultimately what i need a key that fixed point can be anything it can be either one point update that you choose from a transaction it can be last transaction or the last possible date the first some fixed reference point like today right so recency metric will be based on um today's date minus invoice state here so that's kind of what we're going to use for the recent symmetric so to order in order to calculate these three we need to create some formulas in our data set the good thing is you don't need to create a formula for the frequency because since this is just a count when we do a pivoting table the pivot option will give us a count automatically so we need to make sure the invoice is available for us as a field to choose and then you can do a count of invoices so let's start with the easy one i need to create first a calculated field for monetary values go back to nine and there's a node called math formula i'm going to click on math formula so drag and drop math formula so by the way if you're running out of space you can do one of two things you can keep going expand i can keep going to the right you can just get the scroll bar but i'd rather prefer you guys see everything on the same page so i'm just going to drag go down and i can follow it along with the arrows as well so i'm just going to grab this arrow connect it with the math formula arrow and if you want to also connect the chains the way that the arrows are or the lines are connecting just click on any one of the any point in the line and notice there's a dot that appears somewhere around the middle of the line so that dot will help you to stretch out the line and you can change the shape of the line to be square circle and whatever right so i'm just going to click on that and you can see how i can edit the line so that it is more easier to read across the so once you click on one line it keeps adding multiple dots right so if you stretch out it keeps adding multiple dots you can kind of make your workflow easily readable right so it's just a minor cosmetic uh tip that i want to share with you guys okay this looks good enough let's look at the configuration of the math formula okay so i need just to understand what uh we need we need the monetary value which is basically multiplication of unit price times quantity so just start typing in w so the way the math formula works is uh this is very similar to excel right so you can you can hear all the functions that are available uh there's if you click on each one of them description appears on the right along with the syntax right um sorry the syntax uh it gives you the syntax of the uh formula it gives you what you need what meaning it is and then like the dollar sign and so on and so on so forth right slightly different syntax than excel or even tablet but concept is very very similar right so um and as this is where the expression building happens so you can drag i don't know if you can drag and drop but you have to start double clicking it and then you can just type in the values here for the expression so that's kind of how this works great so i'm going to multiply double click on unit price and notice it puts a dollar sign before and after that that's its syntax and i'm going to multiply it with quantity so click on quantity i'm just going to double click on quantity and there you go so i have another option that is pretty important i can either replace an existing column with a calculation which in some cases it helps if you have a large volume of data if you want to minimize your data processing time i would recommend that but if you do not know what you're doing especially if you're doing it for the first time i would say just append the column that way even if it's similar that is you can compare it alright so i'm going to append this column and call it um monitoring so this is just the name of the column um i'm giving it right so i'm just naming the column and click ok after that right click execute right click again and look looks like look at what we've done if we click on the output data first things first buy habit you want to make sure you've not excluded some rows it looks like yes we did get back the original number of records we got back we're not filtering anything yet so we three fifty four thousand records of data and the monetary column has been created for us let's make sure simple cursory glance point eight two times three thousand something gave us three thousand 3202 which is kind of okay similarly point two point one times three thousand one hundred and four gives us sixty six thousand five something which also seems directionally accurate so looks like our formula is working and a quick cursor glance through the entire data set tells us that yes our formula has indeed worked okay great so the next thing we need to do is go back to my excel sheet here so this is done let's give it a different color code like i said the frequency is a count of invoices we are not going to create a separate field for that because we're going to do a summary of that let's create the recency metric so for the recency metric it's today's date minus invoice date so a math formula will work except that our data type the invoice state is not a numeric function so most software programs differentiate between dates and new numeric functions because they are different data types right just like how string is different than numeric types date is also different than the numbers so we won't do a math formula because math assumes that the num the data type is a numeric uh type since this is not a numeric type you're going to do something called as date type in the word date and time difference so this is what we're going to do so the minute you notice as i start typing in date all options connected with date show up so i'm going to choose this one that says date and time difference let's drag and drop this so let's connect these two and let's right click configure and by the way notice on the right side the description also appears right so what it tells you what the duration and what it really does what this node really means let's right click configure and first thing first it asks you to choose what date and time column you want to choose so in this case it is invoice date now in case you want a difference for the invoice 8 with another date column you can choose another column but in this case invoice it is just invoice rate but here's where you can also have you also can choose other options right so remember our objective was to look at the invoice rate subtract it with a current date so we're going to say current date and time you can also type in a fixed date when you can specify a fixed date right here as well anything any option will work it's the same thing just for ease of understanding let's just use the current date and time okay you also have a specif an option to specify the granularity right uh you can say you can take the difference in days weeks whatever date part time uh date part you're interested in let's stick with days it's kind of easy to understand and that's good enough and you also have the option to change the default column name as well this is kind of what we're going to do so let's instead of calling it as date and time difference we're going to call it recent c okay let's click ok after that let's right click execute what you should notice is on the output table you have a new column that's called recency and you can just do a quick math 2011 22nd of february minus current date is 3 000. so we are recording this uh what on may 27th right so um this is the difference right so it seems to be directly accurate again this number really does not matter as long as all of them are being compared against the same fixed reference point okay this can be any number but as long as it's fixed okay so we've got the recency metric we've got the monetary metric now it's time to start combining our data sets let me explain what i mean by that right so let's i'm going to take a quick simple example here let's say let's see if we can find a good data set yeah this one is a good one so let's say so notice these well these rows that i've highlighted um even that's a different country that's fine i just want to just let's take an example now all these let me go to home page and color code this all these four four records belong to one customer id right they all belong to the same customer id but each of these have different transactions so and they're fall from the same invoice right so same customer id same inverse inverse doesn't matter but it has different skus and it has different quantities same days but different unit prices as well right so this is what i meant when i said one customer id can have multiple transactions you can also have uh that particular customer id with multiple invoices as well this is just one invoice but my data set can potentially have for the same customer i can maybe have in another time frame another set of invoices for a different set of products or for the same set of products doesn't matter right so um ultimately our key unique key or the key driver here is a customer id so that's why i said we need to consolidate or aggregate the data from a customer ids perspective so that's what we're going to do next we're going to create a similar structure like a pivot table we're going to compress all the pertinent information at the customer id level so let's do that so there is a function that's called group by which is equivalent of what we would do in an excel pivot i'm going to drag and drop this group by and connect the date and time difference to group by let's right click configure and i need to specify a couple of things here in this argument so one thing i need to tell what the group would buy since or what is the level of of grouping so in this case it's going to be grouped at the customer id level right i click on customer id i want to be grouped at the customer id level then i also have another option that says manual aggregation because we need to specify what type of aggregation to do for what type of metric so if i don't select i can i have an option i can either select one at a time right i can do that or i can just select uh everything all at a time and do the aggregations i prefer select selecting all of them because i have more control over what can do because if i don't select them then the uh whatever i don't select might get a default aggregation which may or may not work for me so that's why i want to make sure i select all the possible aggregated columns here so invoice number so if we go back to your example here so i can potentially have for every customer multiple invoices right so doing uh first or average does not make sense in this case i'm just going to do a count account of invoices because that will also give me the frequency metric which is the transactional metric right so that's the count the sku again it does not give me um any meaningful information right so because sku yes i still can do a count because if i leave it at any other metric it's going to group it for customer id level and the sku level which i don't want so i'm just going to do a count of the sq as well same thing with the description i'm just going to do a count a description quantity uh i think average will work out because you want to get the average quantity sold per customer right i mean this is good to know nothing uh great but just good to know we can get the mean of the uh thing the invoice date again uh just the invoice date alone does not give you anything right because you can do a first of invoice date but then if you leave it at first what's going to happen is for some for every customer for every set of invoices is going to give me one record so i'm going to get duplicate customer ids so in this case i do not want first i want i can take account count as easy one because even if they're duplicates it's going to get bundled up into that one customer id again if this is confusing pause for a minute and look at the data set just take an example of one customer id with multiple invoices you will see what i mean if you take the first invoice date then you probably will get just one invoice date for the first customer which is correct in this case but for some reason um if you get multiple invoices or if you're adding one more grouping level you could get confused where you could have two sets of invoice dates for the same customer id so that's why if you're in doubt just do a count and it should be pretty straightforward unit price is going to be mean that's good country doesn't matter it's just a one country so first our whatever option count both are the same here in this case monetary value let's check out what aggregation we want to get in the monetary value so i want the sum of monetary values remember the monetary value here what we've calculated is unit price times quantity so if i aggregated at the customer id level i'm going to do a for every customer id i'm going to for every transaction i'm going to get one unit price times quantity which is the value i need to sum all of them up so that's what this means i'm just going to do a sum so it's down below and think about this is pretty interesting this is the most interesting thing you guys got to think be thinking about recency metric what measure should i choose what level of aggregation should i use from a recency metric okay so i want you guys to pause for a minute and write down in a piece of paper what is a metric and see if it matches with what i'm about to say i'm going to pick minimum why if you figure it out it's great if not think for a minute so the recency metric is basically today's date minus invoice date right so i can have for one particular customer i can have multiple recency dates multiple recency metrics based on different invoices different skus right but who is what is the most important metric so let's say going back to the example earlier one consumer who's both two consumers both of them are with me for one year customer a who has last transacted one month ago so if i take a difference between today and one month ago it's gonna be 30 days customer b who's still with me for one year who's transacting multiple transaction but who's done the last transaction with me three months ago so today minus three months is 90 days right um so well we'll scratch that since as you install customer b let's say customer a so both customer a's multiple invoices one customer for the customer a first invoice 30 days ago customer a again first invoice 90 days ago do i care whether the consumer came 90 days ago and 120 days ago or beyond that in this specific instance ultimately yes i do care but in this specific instance what is that one single important metric that i need in the for the recent symmetric the last time that the consumer came which happens to be one month ago which is 30 days right so i can ignore 90 i can ignore 120 i can ignore the other time period as i can consumer came i just need this 30 day to capture that so hence min because ultimately i'm going to have 1390 120 150 140 whatever i just need that one min of the customer date where that'll give me the last time that the customer visited my store or bought my product or bought my service so that's done i think we are ready to run this just one last thing column naming okay this column naming it defaults to aggregation so i would say do not do that because we leave it as aggregation the column headers is going to come as minimum of recency sum of monetary first off country mean of unit price we don't want that right because we our columns are pretty clean so i don't want the default aggregation so i'm just going to change this and choose keep original names click ok after that i think we're ready to run this it turned yellow let's right click execute and it's running it's done let's right click and look at the group table again i should have asked you to pause for a minute before you see this because i wanted you to guess the number of records that we get back because if you've done your descriptive analysis correctly you should know how many records we get back because each record is now one customer and if you go back to our tableau analysis yes three nine one eight is what we were expecting and sure enough three nine one eight rules is what we got back okay because i aggregated at the customer id level okay let's see if our metrics really work so looks like uh the invoice count by the way i should have renamed this as invoice count but this should be the what you're seeing here is the invoice not the invoice number but the count of invoice number since we let the default value to not be named as aggregates it's retained the original column name which happens to be invoice number but just know that this is actually the count of invoices so this customer id the way we read these records customer id 12742747 has conducted 103 transactions with a total monetary value of 4196 uh dollars or euros or whatever measurable that is and the last time that they came was 2727 days ago since this data is from 2011 and now we are in 2019 so that's kind of reasonable so we subtract that date with the uh most recent date that they came okay so that's what these three numbers are what we really need everything else is just fyi right so let's just make sure we understand this correctly every record here is one unique customer id call right so if you're trying to compare this customer with like i said the invoice number the monetary value and the recency let's try to compare another customer say one two eight three zero with similar numbers so this customer compared to the previous customer has transacted only 38 times but given us a monetary value of 6814 and they were the last transaction was uh 30 days or close to 50 days or 40 days later than the previous customer because previous day was 2727 2007 27 minus two seven sixties were approximately 40 something right so uh around 35 to 40 days later than the previous customer so they are more recent like customer one one two seven four seven is more recent than this customer by 30 days so that metric is better but and also the number of invoices better number count of uh number is better but their dollar value the second customer has a higher dollar value than first consumer so this is how you read the r and f and m metric right so you basically compare these three metrics and try to make sense of the recency frequency and monetary metric so then next thing is how do we translate this into a segmentation so we're almost done right so we've calculated the math is done now it's just a simple matter of combining these and making sure we get the right metric so a couple of things couple of concepts i want to quickly touch upon here right so let's see if i can open up a blank ppt and that can help you understand this better when you're looking at the more traditional way of looking at rfm we assign scores right so r is assigned a score of so we have r r is assigned a score of one two three four or five depending on how good the score is similarly frequency metric is assigned a score of 1 2 3 4 or 5. similarly monetary metric is also assigned a score of 1 2 four five so this is typically how an rfm analysis works now um five being the highest one being the lowest right so five is the highest and one will be the lowest score one will be the lowest score okay now when you're like i said that a traditional way of looking at rfm analysis is assign these five to five to five numbers and then you would then assign a com depending on where the data falls right so i can assign depending on how i can bucket them i can say okay uh recency scores of less than 20 percent will fall under one recency scores that between 20 to 40 percent will fall under two recency scores greater than 80 will fall under five and so on similarly frequency scores of uh 10 20 will get a score of 1 frequency scores of 0 to 20 to 40 will get the score of 2 and so on so that's kind of how i translate the metrics into the labels so these ultimately these are labels we're going to form labels so the label is ultimately presented in the form of a three-digit score because rfm has three numbers for three letters each of these three letters is represented by a three-digit score which looks like this so when your boss is talking about rfm scores or when you're sitting around for an interview and talking about uh rfm scores when they are referring to triple five or one one one this is how this is what they mean or you can show off your knowledge by saying hey do you have an rfm a customer who has a perfect 555 and then uh you can actually start the conversation with those words right so anyways the highest score i can get for an rfm value is five five five okay that means this customer is extremely uh high in terms of monetary value and very frequent for the recency frequency metric meaning they keep buying from us extremely often and they also have a high recency score meaning they bought from us most recently that's kind of what the rfm means the lowest i can get is what again take your cheat sheet and see if you're able to correct it what is the lowest score you can get for an rfm value it's going to be 1 1 1 that's the lowest i can get for an rfm value everything else in between is up for grabs right so combination of i can get to 551 515 and so on so then let's take a small example two examples right so one is let's say i have one customer who has a score of five one five and another customer who has a score of 5 5 1. question this is customer a customer b who is more valuable okay take a moment to write down which customer is more valuable and more importantly why you feel that this customer a or b is more valuable than one or the other the answer is your classic mba answer it depends more importantly it depends on what okay yes 555 is the highest one one one is lowest but that does not necessarily mean that these translate into numerical comparisons so what does 515 mean 515 just means customer a has the highest in terms of this customer has bought from me most recently but whenever they do buy they uh not they bought from me in bulk so meaning they don't come that often they come very intermittently but when they do come they buy extremely high right so they buy a lot in terms of monetary value that's what 515 means so but what does 551 mean 551 means yes this customer is most recent right so it got from most recently and they keep coming very frequently like every month every week i see the customer dain and they are they're extremely frequent they give me a lot of transactions volume is lot however when they're buying they don't buy a lot the monetary value of the transaction is extremely extremely low depends on how you want to evaluate your customers so that's kind of how you look at your rf scores right so don't get confused these are ultimately labels these are not numerical values that you can compare against from one to the other these are numerical so these are labels that you can attribute to specific characteristics okay great so typically when i look at rfm i when i work with clients i don't generally use the more traditional way of looking at rfm which is a five fifi one one one score reason is simple right so i can it's about ultimately choices right so if i do a five one five i can potentially have like uh five times three factorial design uh options right so i can say five five five basically all these possible combinations right i can pretty much have five values for r five values for f five values for three so like i said five times three factorial right it becomes pretty complex pretty fast so i can have five one five five five one five uh one five five similarly with four four one four four five four four three four four two and so on so this can go on right as you can see becomes pretty complex so i don't want to do that that's why again most clients at least whom i've seen or i've worked with do not like a lot of work they want a pretty straightforward implementation that our actionable implementation that they can run and take off and start implementing and to be fair that is this is not their job their job is not to narrow down the list of uh 500 possible combinations and tell them which one to use so it's up to us to kind of give them some sort of an easy digestible output that they can take action on so um i rely more on a non-traditional approach where instead of looking at 555 444 333 and 109 i look at high medium and low right so when i do my rfm analysis i tend to stick with high medium low it does two things for me so one is it drastically narrows down the list of choices so yes it's still i'm still looking at a three time three factorial design right um i'm still looking at multiple combinations but it's still manageable the other thing it lets me do is it helps me prioritize my actions to the consumer or to my client let me show you how so and the first thing you need to know to bucket these is like i can have hml for reason c same thing for frequency i can have high medium low same thing for monitoring i can have high medium low so combinations you're going to be faced with our hhh will be the highest l l will be the lowest anything in between is up for grabs no all posh options are possible okay and this is a lot easier right so um when you think about what is the metric that you will apply so i'm just going to switch back to the data set here real real quick and come back to this so you have the monetary metric the recency metric and the frequency metric so if we were to take this from this point and translate it to something like this what is it that we need to do so think about next steps again i'm purposely going back and forth between concepts and implementations one thing is you can you should understand the concept second thing is immediately you should be able to uh or should start at least thinking about how to implement the concept into a practical real-life example right so one of the things i use to calibrate are to dissect my customer base or to dissect any list is percentiles so a lot of businesses don't understand probability a lot a lot of business managers do understand the normal distribution okay so whenever you draw anything with a bell-shaped curve it's pretty easy to convince them right they all understand this so when you say most of the random distributions are distributed on normal that we see in the universe you get the retention right so generally what i do is to when i uh segment my customers ultimately i want to have three levels right so i take one cut off here which generally is mine 25th percentile now i say percentile and y percent time and not percentage so percentile again think about this percentile help me account for outliers because if i do percentages or averages then i could potentially have run into the problem outliers percentiles is pretty straightforward i can say i can say a cutoff point beyond this point one category before this point another category so that's what percentiles will let me do so uh anything less than 10 percentile i'm going to call it as low anything between 25 and 75th percentile i'm going to call it as medium anything beyond this 75th percentile i'm going to call it as high okay now the reason why i do choose this cut-off i generally use this as a starting point but nothing stops you from iteratively changing these thresholds these are thresholds this will give you a good starting point for conversations most often than not it's ultimately an iterative process so you start with thresholds you look at the number of items that you have see the number of customers you have who are in the high bucket then discuss with the business owners to see if that's manageable and then move on and back and forth right the idea here is um no one likes to hear out of all the 4 000 customers all of them are valuable okay then what is the point of doing this analysis right so you want to make sure you are giving some actionable insight right you want to provide a handful of customers out of the 4000 odd customers i want to give a reasonable target a reasonable list that helps prioritize that's why i choose a 75th percentile which gives me just a cream of my customers but still it's big enough for me to take action it's not big enough so that i'm distracted from priorities similarly lows okay low is also very important because you have to know whom to let go of again going back to the example not all consumers are just like how we don't like to hear all consumers are good similarly not all low performers are bad as well there's some specific categories that isolate low performing customers and we don't want to spend more and more energy and resources and time and money on low perform customers it's okay to let go of some customers okay so that's kind of the strategy there the bulk of the data is reserved for the middle because that helps to serve as a priority right so like i said once you get down depending on the budget constraints obviously i mean i might not have enough money to target all my high performing customers but in case i do i'm successfully covered all of them then i'm looking for a second list or second year then i slowly dig into my medium list and then i'm definitely not going to go anywhere with my low and so on so that's kind of the idea behind how you want to group them into high medium and low so these are the thresholds so how do we translate this concept like i said it's nice to hear it in concept but then how do we actually do this so there is a function inside of nine that's called as a winner so if i type in the word auto auto binner that comes up so i'm just going to drag and drop this let's connect these two right click configure and this option tells me what to bin i'm just by default so by the way let me undo what i did so i probably went too fast here so by default these are the options that come up so that this means these are the values that are going to be filtered that are going to be remaining in our model so i'm going to by default i'm going to put all of them to the left side by clicking on this double arrow going left let's do that so i need to tell the model what to bin so i definitely need the recency metric so i'm going to click on that click on it once click on monetary click on that and remember in our data set the invoice number was basically a count count of invoices i'm just going to click on the invoice number and take a right side of this so that way i have the invoice number the monetary values and the recency metrics great so now the next thing i need to do is once i fix the columns the next thing i need to do is i need to tell how many number of bins i need okay so i can specify the number of bins based on equal width or frequency but since i've already know what type of uh bins i want i'm going to base it on the quantiles gramuar i said 25th percentile 75th percentile i'm just going to click on sample quantiles and by default these four tiles are available so 0 to 25 will be 1 25 to 50 will be the other 50 to 70 will be the other 75 100 would be the other but i don't want four tiles i want three because i want high medium and low so what do we do so i'm going to refer back to this diagram i drew so i want just 0 to 25 from 25 i want to jump directly to 75 and then from 75 all the way to 100 so what do i do here i just get rid of the middle portion which is 0.5 so just delete that now we'll be creating three bins zero to 25 is one 25 to simplify the other 75 to 100 is the other okay then we also have different options to name the bins i'm just going to keep it default like the bin 1 bin 2 and bin 3 so that's good for now we're going to come back and rename them anyway later right and it gives you an option to replace the target columns i would not do that yet like i said since you're just getting a hang of the tool let's just not do that i'm just going to keep the default option not replace them click ok let's right click execute and look at the output right click and look at the bin data that will be your output okay so the output again is at the customer id level every customer has a bin number for each of those three metrics for invoice number which is basically your um frequency you have a bin number so they will be named one two or three okay so that's your binning so you can just quickly verify this bin one two or three bin one two and three bin one two and three so we are almost done just one more step so think about we need to um rename these bins which we can do um later um but you know what let's do that right now let's rename those bins so that when we are looking at the next um final recommendations we know where each of these bins belong and how many customers are belonging to each of those bins and groups of bins and let's just quickly run through this right so i have for low medium high i have bin one two and three for recency right so bin one corresponds to this metric which is less than twenty five percent bin two is this middle portion which is uh medium bin three is the third portion right so these are the three different bins that correspond to l m and lo so we finally need to translate bin one as l bin to as m and bin three as h but that is not correct this will work for just two of those metrics but not the third one so think about which one it will not work for so one of the metrics this will be flipped again put on a thinking cap pause for a minute write down on a piece of paper there's one metric that this will not work this has to be flipped okay so again bin one is low definitely for monetary why because obviously anything with a monetary value of less than 25 let's say is greater than 75 25 and 100 right let's just assume anything less than 25 belongs to low anything between 25 percent belongs to medium everything return survey is 100 that is high and frequency is also like that right because let's say l anything less than 25 has a frequency of 50. anything greater than 50 has maybe a frequency of i don't know uh 250. maybe the highest one is maybe i don't know 700 right so anything less than 50 is low anything between 50 to 250 is medium this is greater than anything between 250 700 is high so that also works so this high also corresponds to bin 3 for recent c however a lower value of recency means a high why going back to the example a recency metric of 30 is better than a recency metric of 90 which is better than a recently metric of 120 right because recent symmetric we want them to be as close as possible to our current date so for recency alone it's going to be flipped so this will be high this will be medium will be the same as it is medium will be medium okay let me see if i can change the color for there you go so for all the blue colors are for recency so for recency it's going to be high for medium it's still going to be medium right and for recency across the last portion will be low it's just going to be flipped i'm hoping you guys understand this right because this is some uh this is something you're got to grasp before you actually go to the next step okay um a lower metric of recency means a better thing that's why okay that's simple arithmetic so let's go back to nine and see how we can do that and now at this point in time there are a couple of options available for you you can either export this to excel you can export it to tableau or you can just export it to a text file and do manipulation with it but since you guys are using 9 for the first time i'm assuming let's just complete this exercise with time and then you at least get a hang for how the tool works and what is capabilities right more often than not you're going to find more than one ways to do something right and you probably have a plethora of tools available tool choices that are available to you to do certain things it's ultimately practice it's your preference right so one tool uh can do something slightly better you are more comfortable with another tool so it's ultimately practice preference that's what drives the choice of a tool if it was me personally i would not even have done this binning here i would have exported once the auto winner was done i would have exported everything to excel and or tableau and then done my analysis from that point onwards but let's focus on this for now and then we'll see how this uh gets further just a couple of more notes more and we should be done with this all right so um i need to rename these bins right the way i'm going to rename these bins is first i need to um so there's not a simple way of renaming if i do a copy and replace it's going to do it for each record one at a time and does not do across all columns so there is a node called cell replacer right but in order for cell replacer to work it needs another input right so so notice it has two um nodes so i almost need to do a join i need to tell uh i almost need to give it a map i can need to tell the tool okay what does low stand for what has been so it it has been one min 2 and bin 3 i need to tell what bin 1 stands for what bin 2 stands for and what bin3 stands for so in order for me to tell that i need to create a data set which is like a map and i can do that by what's called as a table creator this will create a table which is nothing but a set of records create a table creator and let's right click configure this so let's just do a simple thing i'm just going to say bin bin one bin two bin three okay i'm just going to call this low um low medium high okay so this basically i'm just giving it a map then column one is going to be yeah i'm just going to keep it column one or you can um rename this that doesn't matter i'm just going to keep keep with the waiters column i know what column and column to mean so that's done so now i'm going to get the auto binner so i'm going to let's right click and execute this first if you right click and look at this manual creator table you're pretty much going to see what we saw in the configuration as well so i need to join bin one and rename it as l so i'm going to join table creator to the second node here okay this is this because if you hover around your mouse it tells this is the dictionary table right if i click and click on configure i cannot do it yet because i need to configure this it needs two incoming nodes and the order is very specific for some reason right so the bottom one should be a dictionary table which is our map that we just created the input should be the bin data so i'm just going to drag and drop the auto winner output the cell replacer on the top node which is the input table so right click if you click on configure so i need to choose which column i want to change these cells so i'm going to get the recency bend column let's start with well let's start with the easier one let's start with monetary so for monetary my input one will be the column one so that is where we had the bin one bin two bin three and so on the column i'm gonna the value i'm gonna replace it with is the is column two right because it's gonna look up first column column one has bin one bin two and bin three and column two has high medium low okay and let's just append a new column just in case we can see the uh so that we can see the differences i'm just going to call this um recent c high medium low let's click ok let's right click execute right click and look at the table with replace column and see if it's done that i i think i made a mistake here so let's right click and configure this so it's not recency we said monetary right so i want to make sure i match this monitory with the recency it's called call this monitory high medium and low click ok right click and execute let's right click again look at the table with replaced column and there you go right so we can check that real quick so bin 3 should be high which is correct bin 1 should be low which is also correct and bin 2 should be medium which is also correct so it looks like this is working our inputs do match we do have something to work off of okay so let's do the same thing for the frequency metric and then finally the recent symmetric as well okay so i'm going to one thing you can do in name is that you can just if you found a node that you've used and you just want to reuse that node just click on that node so i'm just clicking on it once do a control c and then do a control v again it's place an exact copy of the node okay so that way you don't need to do the search and uh drag and drop and so on so you can just do this copy and paste it can do the same thing so i'm going to do the follow the same process do the same auto winner because my data set is the same i'm just replacing another column right so resell replacer and my table creator dictionary is the same because for the frequency metric low is still bin 1 medium is still bin two and bin three is still high so it just changes for the recency metric so just gonna join these two nodes let's right click configure and it's a monitoring bin i'm just going to take the invoice number bend so notice my column settings are good so column one is lookup column car output is column two let's append this new column instead of monetary i'm just going to call it frequency high medium and low click ok right click and execute right click and table with replaced column i do not see that so there you go okay there it is frequency with high medium and low some reason it's replaced my monetary value as well okay let me just make sure i choose the correct one let's configure so my invoice number is a new column that is correct let me make sure my previous cell replacer oh you know what instead of doing it from the auto binner node there's a more efficient way to do this right because i've already done the cell replacement in the previous step i i'm going to delete this node this arrow and instead of connecting from the autonomous i'm just going to connect it from this output because we've already done the uh change for one column right i'm just going to append to the to another column so i'm just going to drag and drop let me get this arrow out of the way so just drag and drop this output to the next step because ultimately it's a sequential process right so this will be a second replacer so if i just configure that everything else looks okay let's execute now if you right click and look at the table with replace column there you go monetary is already available that we did previously frequency is now appended with the previous output okay and always a good point to check to make sure your frequency numbers are good so looks like the invoice number was our frequency right so bin 3 should be high which is correct bin 2 should be medium which is also correct bin one should be low which is also correct okay so it looks like monetary frequency is done so last one is the uh recency so for recency we cannot create use this table creator because for table creator this table creator we used high medium low as bin 1 bin 2 and bit 3 but we need to use a different definition for recent c and that's why i'm going to add a different table creator so right click configure so this will be bin one 2 bin 3 and this will be low medium and high sorry high medium and low so bin 1 should be high bin 2 will be medium bin 3 will be low right going back to this i'm going to remind you with this example here right so this will be bin 3 corresponds to low bin 1 corresponds to high okay so let's just rename this column so you can kind of also see what this is i'm just i just double clicked on the column so this will be the bin number and column two will be the rfm label okay then let's click ok after that so i'm done i'm just going to copy and paste this cell replacer again because that's kind of what we're going to use just like the previous one i'm just first let's execute this execute let's look at the manually created table yes bin 1 bin 2 bin 3 high medium and low let's connect the table creator to the dictionary table or dictionary node and the output from the previous cell replacer we're going to connect it with the cell replacer here right click configure instead of invoice number we're going to choose the recency right because this is what we're doing for the recency binning and yes we'll and the input lookup should be bin number right because that's what the bin number was it's going to give you bin number one two or three and the output is going to be rfm label that is correct so instead of frequency we just want to change it to recency high medium and low let's click ok right click execute if i right click and look at the table replace column there we go so now it's high so let's verify that the recency bin 1 should be high that is correct bin 3 should be low bin 2 should be medium great so that seems to work as expected so we do have now labels h m and low so now we're at a point to start uh summarizing this further right so 3 000 i mean look at what we've done so far 530 500 something a little over a million 530 000 rows narrowed it down to 430 odd rows then further into 350 000 rows of data then narrowed it down to 3918 customers this is still a lot i mean i can't expect to work with 3 900 data points how can i make this more manageable and this is where we're going to do something very simple next which is basically do a pivot off of this pivot right so basically i'm going to do a count of h i'm going to see how many customers who have hi how many customers who have who are h in each of these three metrics so ultimately i'm going to create a three by three matrix right so i have customers with h m and l in the recency metric the frequency metric and the monetary metric so it's basically going to be a three by three matrix your output so when you're trying to present something to your client or when you're trying to implement something at work after this point then it's about how effectively communicate a story so i'm going to switch back to my powerpoint so the performance analysis was what we just did right and then this is how the output will look like you have hi let me see if i can annotate this so all these accounts the numbers that you see on the metrics on the box are all counts these are number of customers if you add all of them up it should pretty much give you uh close to 3900 water reset this might be a slightly different data set but the point just get the concept here right so um our recency is my most important metric and it's not and it's there by design as the most important metric because that is ultimately the driver of my consumer loyalty frequency is the second important metric and monetary is my third important metric so and this reason why that's the reason why rfm are laid out as rf and m right the same order and the order of the importance so the first band here is the green band and that is basically your active customers right so think about this so these are extremely recent customers and who are most active right now because they belong to the highest top 25th percentile of my consumer base right these are my highly active consumers right the second bucket is my at risk consumer so meaning these are consumers who are not as active as my high but they're fairly active right it's not that i've lost them but if i don't do anything to them i am at the risk of losing them okay the third bucket is basically my losers right there's nothing else i can do about it i mean these guys are horrible it's okay to let go of them i mean they are inactive unless i have spent all the money allocated to me to take care of the active and the enact at risk and i need to really really debate to see whether there's any specific reason why i should pour money on inactive is there a really cost benefit on improving my inactive customers most often most cases there's not enough of a cost justification so it's okay to leave them again ruthlessly cut off those customers right so that's as far as a high level metrics goes so as you can see depending on which band my consumers fall under my strategy will differ right and even within those bands you can see there's some sort of color code happening here right so 402 these are my cream customers top of the line customers most valuable customers okay why recency is high the frequency is high and they also give me extremely high monetary revenues as well right so i want to do all it is at my at my uh power in my power to make sure i do not lose them these guys are the reason why my company exists these are the customers the reason my company even flourishes okay i'm going to do anything whatever it takes to make sure i retain them most loyal customers these are my gold customers then even though some of them are extremely most recent i also have a few customers who are into the next band right so these are my silver customers right these silver customers who even though the frequency recency might be high their monetary value and their frequency might not be that great right so these are my silver customers so yes still valuable customers but not as valuable as my customers but they are my second tier customers okay they are still good enough for me to invest a time and money to kind of understand them and make them happy third band is my bronze customers right so i have gold i have silver i have bronze these are my bronze consumers right so while these guys have pretty high recency metrics their other metrics are not that great okay so yes it still is worthwhile investing them so but i mean uh you've got to see again do a cost benefit analysis to see if it's worth bringing them in for the lower monetary value they are going to potentially bring in as you can see i mean there are different strategies adopted at the high level for all high recent customers there's a high level strategy within that depending on whether your customers gold silver or bronze you're going to tweak your strategy a little bit to make sure you adopt for them so you adapt your strategies based on that going down the spectrum these guys three four four no emotions attached do not without any guilt just get rid of them that would be my advice to my client they are drain-only resources nothing is important i mean nothing is great they have extremely low monetary values they have extremely low frequency and they come once in a blue moon i mean they've it's it's been a while since we haven't seen them right even if we do bring them up maybe if we ought to offer some discounts coupons promotions to bring them up the monetary value sucks the frequency at which they potentially going to come is not worth it right so even if you improve one metric it's probably not worth it right so these guys definitely not worth spending any any money on him unless there's happens to be an outlier who happens to be a one-off customer but i would recommend my client to completely get rid of them and look at attracting attracting new customers it might be beneficial or more cost effective get new customers than even spending the time and effort to even retain england right so that's kind of the idea behind segmenting right so it becomes manageable from in a single i mean for for business guys if you throw out spreadsheets output of analysis that runs into multiple pages spreadsheets word documents you lose them right so something very crisp something pretty straightforward one slide or two slider that you can convey the concepts yes you do have the underlying background data that supports this analysis but then the meat of your analysis the uh essence of what we're trying to recommend should be staring right in front of the face to kind of take some immediate action on right the intent of this gold silver and bronze analysis is to make sure yes no matter what the recency and uh recency metrics whether high medium or low the intent of all companies should be to boost get more of gold right so from silver i want to try and see how much i can get off gold how can i convert these silver customers into gold that should be the third objective similarly from the bronze levels is there a way you can move them up to the next tier to a silver level ultimately move that tier up to gold so we're always going to see a northward shift we're always trying to see how you can implement your strategies to make sure you go from the bronze level to a silver level to a gold level right that's kind of the intent of your analysis right um so that's kind of what the slide basically tells you so from a base or a bronze level move up to silver more of the gold level so all this is fine great okay this makes great sense in theory now you've recommended this to your boss your boss has talked to your client uh you the delivery team you've delivered your analysis and you wash your hands off right no you don't wash your hands off this is where you take accountability how how do you measure your effectiveness of your recommendations yes you've recommended a great analysis you've done a lot of work you've put a lot of effort to kind of come up with recommendations how would you measure your impact how would you measure the impact of your recommendations i rely on um i'm a strong proponent of lean six sigma principles right um one of the things that i really like is this concept of uh d-make right again those of you not familiar with this pretty straight it's this acronym seems um intimidating but it's a simple process checklist right so uh dmacc was again it was invented or it was put together by lean six sigma practitioners primarily to look at establishing some sort of a process control right the letter stands for do's these do or sorry define define a problem statement m stands for measure how do you measure a specific problem and then a is for analyze so once you measure it how do you analyze that particular measure to see how whether it's going good or bad i is implement yes you analyze you recommend some implementations you start making some tweaks and changes and c stands for check right how do you finally check the effectiveness of your implementation and that is where that is what i'm going to talk about now right so we've done the dma we have implemented this how do you check it how do you check the effectiveness of your implementation simple straightforward method so deliver your recommendation note that date wait for a couple of months for these strategies to kick in do the same exercise again after a different point at a different point in time then what would you do so if you've done the same exact process use the same cutoffs use the same exact data set after a different data point you would then compare your effectiveness against your previous recommendations how so going back to this example of moving your bronze to silver to gold so if you've achieved a lift lift meaning a percentage increase in your gold consumers or your brand consumers you've done your job or your model has done its job in recommending a good strategy right so that's your level of measurement you can measure it within the at an overall level or within each of these sub levels as well right so that's your level of measurement you want to make sure you increase the level of goals level of silvers and or even reduce the um bronze or base level as well hopefully it gave this exercise gives you a good idea of how to implement start off with a problem statement do some descriptive analysis and finally implement a an rfm analysis in terms of analytics this is pretty straightforward this is pretty basic right so because we didn't involve a lot of statistics we did not involve a lot of calculated mathematics but the concepts concepts are very uh interesting very valid in how you could use it and what a demonstrated in name was just to give you guys a flavor of how the tool works obviously the tool can do much more beyond than just simple arithmetics and simple etl and simple uh data transformations but i just want to give you guys a brief overview of so they get a feel for how the interface works what are the different options available how do you search for some things and so on so it was meant to be a brief short an hour long introduction about the tool as well as a simple concept so try to marry both of these uh together into one when you're looking at just a simple concept and also applying the simple concept to the simple workflow for with using a simple tool so upcoming topics we'll discuss a little bit more about other more advanced analytical concepts and see how this tool can perform model building exercise activities as well so in this module we're going to be talking about cluster analysis right so cluster analysis is a is the output i mean it's it's it's not a technique it's a it's a method right but the actual algorithm that goes behind clustering it's one of the techniques for clustering um is called k-means clustering and which is kind of what we're going to do it is one branch of segmentation right so we saw how to do segmentation using the rfm methodology there's also another way to do segmentation which is by going which is using the k means clustering algorithms so what is k means clusters and or what is segmentation so first think about why do we even need to do clustering right uh what is the benefit of segmentation using clustering the idea behind clustering is i mean you want to make sure you are creating distinct groups right so you know you don't know the groups beforehand but you're given a mix of all possible data and your given your task is to separate out or make them into distinct groups so that there's enough differentiation across the groups but not a lot of referentiation within the group so that's the main objective of clustering techniques right so where would you do clustering we'll talk more about applications later but just to kind of get you started on how to think about clustering what to do where can you apply clustering techniques again similar to the rfm analysis right so wherever it involves products anywhere it involves consumers anywhere you want to create groups okay so it doesn't need to just be products or consumers yes products consumers from a retailing perspective but applications can go beyond above and beyond just products and consumers anywhere you want to create groups okay it could be groups of employees groups of mutual funds could be groups of um people in a social network right so anything so you just take the concept of grouping applied to a different scenario with a different data set and you can basically transfer the techniques in any domain the output of a clustering analysis so when you're typically the way so we'll talk a little bit about concepts how to run it but then ultimately you've got to also be thinking about how do you want to present it in to your audience your audience could be your boss your audience could be your client your audience could be your boss's peers right so any anyone the ultimately the object of a clustering is so that you you take the analysis tell it to someone who's running the business and give them recommendations to improve the profitability or sales or whatever the objective is to ultimately improve the business objective right the output of clustering algorithm no business person would be interested to understand how you ran the algorithm what how many iterations you went through and what is all that right so ultimately they're more concerned about okay boss what is the benefit what is it that you're trying to tell me what is it that i need to do with the data to kind of make it actionable into a market into a market uh marketing example right so having said that this is a typical example of how a clustering output would look like right and those of you who are coffee fans or who are starbucks customers um i mean guys can probably relate to it but even if not these are some typical labels that were an output of a clustering study right so you have the regular person right you have uh he or she comes to starbucks each and every day they just walk in they exactly know what they want it's almost like a road memory right so you have to visit your starbucks coffee get a cup of coffee before you walk to your office or your meeting right a second type of customer is like the high dollar nickel right so you you have these people who want like two scoops of chocolate ice cream one scoop of vanilla ice cream and then extra whipped cream on the top with two cherries on the top and extra sugar and honey and so on right so they are like the overloaders right um third type of customers is like you know what i really don't care about your fancy coffee just give me plain black coffee and milk and sugar right so those are your regular coffee times okay or you have this guy who comes into starbucks it's all day orders a latte or cappuccino does not even drink it lets it sit cold works on his macbook all day long and just uses it uses it as his second office right or even his primary office right and the fifth kind of starbucks customers who are uh who have these secret menus in fact there is a dedicated website on how to whip up secret uh frappuccino menus right so you have like uh nutella frappuccinos if you want to make and if you just want to go with the available ingredients right half a quart of this one pump of this syrup and extra shot of this and all that we end up creating your own creations like and there are dedicated websites to pull out secret menus okay or you have the perfectionist right so this guy comes into starbucks stares at the cup all day long and has exact various exact specifications your coffee has to be one third uh the foam size and two shots of espresso and the milk has to be boiled at 99 degrees fahrenheit and he keeps staring at the barrister till he gets his order right right so you have the perfectionist or the talker right so these are typically uh the customer profiles that starbucks has or based on some sort of study right but the product i'm trying to make here is this will be the output of a clustering study right so you create labels that people can resonate with you create groups categories or segments that you can resonate with and obviously the marketing strategy for each and every cluster is different a very important decision a business has to make is to think about how many do they go after all possible targets or do they go after specific segments okay think about this i mean you might want to pass this video for a minute and maybe take your white blank sheet of paper again and note down whether if you as a company you've identified some segments of customers do you want to go and target all customers or all segments or is it okay to not target certain types of customers right so think this is a very important decision all right let me give an example maruti right maruti cars so even when they were commanding close to 80 or 90 market share in the 80s i don't know nowadays they're probably close to 50 or 40 percent but even when they were commanding 80 market share or even 90 market share there were a couple of when they started introducing their premium brands like in the 90s and early 2000s you had these uh maruti thousands or marathi esteem right and they were failures why again you don't think about this why were those supposedly high-end maruti brands failures right so when you think about the word maruti suzuki cars it's more like a common man's car right it's not meant the the perception yes they might have invested uh in technology it must it might could have a lot of features it would be comparable comparable with a higher class car on its own or even been better but the perception of the people right maruti is like a common man's car it's probably going to be your first car that you own and they don't equate the brand maruti suzuki with the high-end brand so whatever they do or whatever they did did not resonate with the general population right so what do they do to counter that and there's nothing wrong with that i mean that was just their marketing strategy i mean that's how they began and that name stuck it is now a household name but it's a household name for first cars not we don't associate luxury cars with marathi right so what do they do now you probably guessed it nexa right so they created this brand separate line of brands for these nexa brands where you have your bellinos your vitaras and so on right so they started marketing it under a separate brand so the brand is trying to distinguish itself against maruti even though it's under the parent maruti brand you have a separate brand of nexa cars right so that's kind of what they will um so the idea is yes it does not pay to attract all possible segments and part a lot of things drive that drive those um decisions right one thing is yes there are some specific activities that you are projecting yourself as a company based on initial marketing efforts but there are also some things that consumers perceive which just happened to be the name of the game right so it just stuck there like a common man's car so there's nothing you can do about it to break that perception right so you have to be very clear in terms of where your product or service is fitting in terms of consumers preference or consumers perception so when you're trying to think about [Music] how do we segment customers so when you're thinking about segmentation especially when you talk about consumers there are typically four types of categories consumers are going to fall under right so first one is demographics right so think about what the demographics are these are typically your age your gender your household income your education your marital status and so on so these are all your census type of data which are demographic data the second type of categories that i can get from a consumer is psychographic information again pause for a minute and try to understand what are psychographic information so these are typically based on perceived attributes or perceived needs and benefits right say for example i'm um you want to portray yourself as an environmentally conscious person you'll probably go by an electric car or hybrid car or if you want to show off that you have a fantastic job and you want to have high flying carrier you don't care about the environment you probably go get a gas guzzler right so things like that so in terms of how is it that you want to be perceived in your social group associated with the new social peers within your society so that's psychographic based segmentation a third type of behavior that's exhibited by people or you can group people based on what they do right so i make a specific transaction at a particular store my point of sale date is captured so you are who you are based on what you buy that's product behavior and purchase use product use and purchase behavior right so it's based on your spending habits it's based on your um [Music] financial habits it's based on what you buy it's based on what you use in a very tangible right it's based on tran it's at the transaction level it's a tangible measurement that you can extract at the transaction level so that's purchase behavior product use last type is a desired benefit and needs right so these are uh ways to characterize people based on what they buy it's not because what's available it's because people buy some things because they have a very specific need let's say for example i'm in the market to buy a laptop all right so if my intent is to buy a laptop that is lightweight and i care very i care about you know windows compatibility softwares um i probably end up buying a pc right and i'm comfortable with the pc and so on or on the other hand i have a very specific need that i have high graphics i i need something that is pretty lightweight and i don't care about the cast and i have a i'm very comfortable working with a macbook so you end up buying a macbook right so based on very specific needs very specific requirements you buy certain products so that's the fourth way of categorizing or grouping customers based on what they buy so most of the data that you will see on consumers can be broadly classified into one of these four buckets so when you're trying to run a segmentation analysis typically you're not going to find a lot of differences in demographic variables right reason is this i mean most of um demographic variables are pretty constant there's not going to be visible differences but a lot of chant a lot of times you're going to find key differences in psychographic variables right so generally as a thumb rule when we are doing segmentation analysis we typically take the psychographic analysis run the segmentation but use demographic variables to target people and to create personalized targeting messages okay so that's kind of how you want to use think about these four types of variables so let's then jump into what the clustering analysis actually does right so um like you see here the main goal of the clustering algorithm is to maximize the difference across clusters and minimize the difference within the clusters if you think about visually how this works right so let me see if i can draw this so let's say you have two groups of clusters that you want to create right you have the light blue diamonds and the dark blue diamonds so the first thing is it assigns a random initial cluster right so the red dots that you see here are two random points right so this is one random point this is the second random point then each of these groups are assigned to one of those two random points right so uh based on whatever is most similar right so obviously as you can see the light blue diamonds are forming one group the dark blue diamonds are forming another group right and then you rerun the algorithm again it takes the mean of these and it also takes the mean of the light blue diamonds as well so take the mean so once i update the means now i do another calculation try to find the distance between each of those points to the new means that you'd find out and then now i suddenly realize you know what this mean for the first for the light blue triangles we suddenly see a dark blue diamond that is close to the light blue mean similarly you probably see here that the light blue diamond here is probably closer to this dark blue diamonds mean than the light blue diamonds mean so what does it do it reassigns the individual points suddenly now this light blue is no longer light blue it belongs to a dark blue and this dark blue no longer belongs to dark wood belongs to light blue so it changes the assignment of the groups based on the updated means then again it recalculates the means because you change the groups the mean is now change right so it runs and updates the means and now we it makes sure that all of these individual points are closest to this mean and then all the light blue diamonds are closest to this mean all the dark blue diamonds are closest to this mean and it reruns this again and again till at the point at which it does not find any more differences right it stops that's the iterative process of how the clustering k-means algorithm works okay intuitively the good thing is you don't need to do this right most of these software programs do it for you but it's good for you guys to understand the concept behind how the mean assignment works in an iterative process right okay so let's take an actual data set and try to understand how to make sense out of this right so like i said the objective of the algorithm is to find um the least differences within groups and the most differences across groups right before i jump into data i just want to show you a fun fact okay again at the cost of making you guys hungry okay i have a horrible sweet tooth i'm going to take my one of my favorite childhood candies right cadbury gems okay this is one of my favorite childhood still is sometimes i just go on a binge eating spree and i cannot stop but anyways so for the purpose of the recording i actually had them buy uh three or four packets of this i'm probably going to use just two and guess where the other few packets are going to go to my stomach right after i finish this recording okay but anyways the reason i bought this is to kind of demonstrate what the goal of the clustering or grouping is so if you can see in this box i've mixed all my gems colors right so i have a collection of uh four different colors right pink yellow green and orange right four different so these are your data points so you have potentially four different groups that you could create from this right so the object so this is how the data looks like for the algorithm right very colorful very mixed and all that the goal of the algorithm is to separate out each and every one of these so that there's distinction between groups yet there's similarities within the groups right so what i'm going to do is i've i'm going to take the orange color gems right so these are orange colored gems i'm going to separate them out and i'll show this for you guys in a bit right so i'm just going to separate them out i'm taking the blue ones separately i'm going to place them in a separate corner and finally all the pink ones that are remaining i'm just going to place them in a separate corner and now this is the magic right so you can see four different clusters that i've created and these four clusters are unique in the sense that each four each of these colors are similar within those clusters like yellow similar than yellow pink is similar with yellow orange is similar with an all on a blue or similar than or blue but there's enough dissimilarities across clusters even though all are cadbury gems products there's enough differences between yellow and the pink enough difference between the pink and the orange enough difference between the orange and the blue right so that's the goal of the clustering algorithm enough similar similarities within clusters dissimilarities across clusters so just to conceptually this is kind of how what is the the aim of any clustering technique right so let's now jump into an actual data set and see how we can translate a data into something as fun as cadbury gems right which is called world tourism let me give you a brief overview of what this [Music] data set looks like the data contains let me zoom this a bit so you can guys can see it better so data set contains information about macro economic statistics about different countries so this is over i think almost 180 countries each row is for a specific country it has information like birth rates the economic indexes like tax rates ease of business environmental information like carbon dioxide emissions we have the energy usage you also have other economic indicators like gdp we have health indices um health expenditure per capita you also have in demographic characteristics like uh internet usage mortality rates life expectancies you also have technological data available like internet usage again and mobile phone usage and other demographic characteristics like population broken out into different buckets and also whether the population is urban or rural and also you have inbound and outbound tourism dollars and so on right so primarily macroeconomic indicators slash demographic data for multiple countries across the world right so let's say our objective is to do a segmentation analysis on this data right and let's say i'm creating this for a tourism company so i want to promote tourism i want to promote international tourism and i'm trying to determine which countries i need to set up shop or which countries i need to target right so in order for me to answer that question i'm going to make use of a k means segmentation because i'm going to segment my customers in this case my customers are countries i'm going to segment countries based on key factors and then use those key factors or segments to understand where i need to target my tourism services once i input this into nam i'm going to run the k-means cluster and since we are i would generally recommend if you guys are using multiple analysis for different options uh go ahead and create separate workspaces so that way you're not confused between what you did earlier versus what you're doing right now so i'm going to right click on local and click on new nime workflow we'll call this clustering and finish let me zoom this a little bit so that you guys can see it a lot better maybe 120 percentage okay so first thing first i need to import the data set to import a node or import a data set i'm just going to type in excel reader because this it is a excel workbook right so it's going to be an excel reader again reminding you guys again i need to change this from a red traffic light to a yellow and then file it to a green i'm gonna right click on configure browse to where i find or save the data set so located the data set i also need to tell it which sheet to use in this case it's sheet one i'm going to use that and also need to specify that the first row is a column header right so click on this one that says table contains column names and row number one and then you might get this error because since you've changed the default sheet to be the first sheet one instead of the first sheet you might want to click on refresh before you can actually proceed okay so the preview looks good click on ok and yellow light has come i'm going to right click click on execute and i'm done right so let's look at the output i'm going to right click and click on the output table so it looks like everything is good all the 180 rows are there and you can quickly give a quick cursory glance on the data looks like everything is good you also want to check the data type like the metadata type so country is a text field which is which is indicated by s birth rate is a decimal point which is indicated by a double and same thing here double um number of numerical formats i is an integer d again is a double uh number format so it looks good so again you just want to quick get a quick question glance about the data the data types and whether there's any thing that's going wrong with this looks like everything is okay here great so now we are ready to run this through our analysis so um type in k dash mean and as i start typing it this clustering option appears which is the k-means clustering so i'm going to drag and drop this k-means right next to the excel reader and i'm going to connect the xl output to the k-means output okay so this is my clustering input so i need to specify the configuration now let me refer back to the excel spreadsheet again a question you might all be considering is to say i have in this example 26 columns of data so i know my initial goal is to run clustering analysis right i need to create clusters or segments what are the columns that you will choose to run a clustering analysis right so again pause for a minute and think about this would you run all 26 columns for clustering the answer is probably no then the next question is okay then how many columns is a right number well that's even not a correct question to ask the correct question to ask is what columns should i use to choose the clustering algorithm right and that is something you have to give it a lot of thought because depending on the business objective you want to choose the most appropriate segments which are most appropriate columns to segment the data right um so like i said in our example the objective whose end user the end user is going to be a travel services company so and then the what would they want to do with it they want to target specific countries for tourism right so that is our end objective so having said that i definitely want to look at uh by country so i'm going to pick these variables to cluster i'm going to choose a life expectancy of females and the life expectancy of males my assumption is that i want to make sure i have people who are people from countries who have a high life expectancy because obviously the more the longer they live the more prone they are going to be for traveling right again it goes back to that earlier uh conversation we had about customer lifetime value right when we saw the uh definitions and we'll talk more about customer lifetime but again i want um because customers to live longer and hence it increases my chance of having a loyal customer base so that's why i'm choosing the life expectancy of males and females into my clustering algorithm i'm going to choose two more variables i'm going to choose population greater than 65 my intent my or my assumption is that people who are beyond 65 they have retired now they have a lot of money for their retirement and they're ready to travel the world and that will be a rich target clientele for me right and i'm also going to look at urban population because my assumption is that the urban population people who are living in an urban environment are probably more prone to travel especially since we're talking about international travel than people who are living in rural areas right so that's the reason why i'm choosing these four variables again nothing stops you from choosing other variables other sets of variables but what i'm trying to say here is you've got to put a lot of thought into deciding what variables to use and that is driven by the question or the business objective that you ask upfront okay so let's go back to nine and i'm gonna configure the k means node right click configure and let's say i want to there are just two arguments you need to specify here i'm just going to create four clusters in this example and by default it includes all the variables right we don't want that so i'm going to first exclude all the variables so i'm going to put them all to the left side and then just pick and choose a variables that i think will make sense so what i'm going to do is i'm going to say life expectancy of females i'm also going to choose life expectancy of males i'm also going to choose population 65 plus and finally population of the urban category okay so these are the four different options that i've chosen uh like i said um the reasons i said is because of this right so you've got to decide that up front so let's click ok after that most of the analytics techniques are combination of data and managerial relevance right so this is where again domain knowledge kicks in this is where business aptitude kicks in right you've got to have a good combination of um the working knowledge of a business or think of it from a businessman's perspective from a layman's perspective of someone who's managing the business and as well as from an analytical perspective so often the best analysts are ones who can combine the business part with the analytical part okay so let's execute let's turn to green if you look at the output there are two types of outputs that are available so one is if you look at the second last one that says clusters so click that it gives you the means of those clusters right so it gives you three four clusters like we asked for cluster zero one two and three and it gives you for every column that we chose it says the life expectancy of cluster 1 is 51 for 2 it's 63 for 3 it's 81 for cluster 4 it's 74. right similarly it also calculates the average of the life expectancy the population as well okay so um this is how this is one of the views right so where it averages all your clusters with the variables that you asked to build the clustering model another output is also given here right click and look at the labeled input okay so the label input you can see that the labeled input contains all the original data sets in addition to the original data sets it also gives you the actual cluster where it belongs to okay so if you think about this cluster so how do you read this row number zero which is country number one belongs to cluster zero rule number two belongs to cluster one and what is row number two let's take a quick look at this so yemen and if you look at vietnam which is the third country that belongs to cluster three and so on right so these are the labels that are assigned back to the original data set it's appended to the original data set and is given back to you okay now so this is great right so i can do a couple of things from here so one is i can export this to an excel spreadsheet right so since we've not done that let me actually show you how to do that and also show you how to uh maybe use it from here as well um so just like how we imported data into an excel reader you can also export it right so there's a node called excel writer right so just click on that accelerator i'm just going to drag and drop this right next to this and you can you have two options you can either write the first arrow or the second arrow right from the k-means outputs and if you don't know what these mean if you hover your mouse above the arrow and hold it for a couple of seconds you can see what the first one is so the first one is a label input which is the cluster which is the original data set appended with the cluster assignments that we just created right and the second one if you hover your mouse long enough it gives you the clusters and it gives you in uh towards the right side how many rows and how many columns of data so that's also a clue for you to understand how many what the data set is right so these are four rows and four columns which is the cluster assignment so we're going to take the original data set back with the assigned labels so i'm just going to take the first one and connect it to the excel writer we need to configure the writer right click click on configure we just need to specify couple of things so one is just like how we specified where to import it from we also need to tell it where to export it to so click on browse i'm just going to navigate to where i've saved my previous files and finally give it a name so by default it gets the path here so just delete whatever path it's written and just give it um world tourism clusters okay so then save it and just like the import file we also need to specify to the output file what is the first column mean first row means the first row here again it's the column headers we'll make sure we add the column headers as well and i think that is it so you can choose whether to output all the columns or exclude some columns since it's a relatively small data set let's go ahead and include all the files or sorry all the columns click ok right click and execute so it's done now if we go back and see where it is exported so it's exported this here and open it up sure enough it has the data sets that we had originally along with the cluster assignment okay so the cluster assignment is given here and now we are ready to make our next analysis so clustering is done pretty straightforward right so like i said you don't need to worry too much about the algorithm but just conceptual understanding of the algorithm is good enough um because there's tools or most tools will do it for you but the next biggest question you're going to try to answer is what to do with these clusters how do you take these clusters and make it actionable okay that's really the key that's where analytics gets merged with your business insight so let's see how we can do that i'm going to create a simple pivot table i'm going to click on insert click on pivot table and let's click ok oops and i'm going to get the cluster on the first rows okay and i'm going to start picking and choosing variables that will help explain what these clusters are actually telling me right so i'm going to take the gdp put it in the values you think about gdp by default it gets the count i want the average of the gdp so i'm gonna right click let me step back so i'm gonna right click on this pivot table right click on the account of gdp i don't want to count i'm just going to say summarize by and click on average right so this is actually a dollar metric or a monetary matrix i'm just going to change the um format to a dollar symbol or currency symbol if i can find it there we go okay and then next thing i want to do is i want to pick and choose the average of internet usage again my idea is um just let's just take the uh average here the average and this is a percentage so i want to change this to a percentage here the reason i'm choosing the internet usage is um say now i have a slide as i'm starting to work with this i certainly have an idea so yes i'm going to my intent is to provide this for a travel tourism company but suddenly i can think about uh when people travel overseas uh they're gonna be uh want they they're gonna want to be connected right so connected technologically they want to maybe make phone calls or maybe use use whatsapp or maybe post some updates on their twitter accounts or their you know instagram handles and so on right so then they need internet for that so maybe i can bundle my travel services along with the telecom providers maybe i can establish a partnership with a renowned telecom operator and then maybe give them special roaming offers or something right so this is how i can think about it think about adding other factors now notice internet usage was not on my you and even gdp was not on my initial clustering um columns and that's perfectly fine you don't necessarily need to use just the columns that you use for clustering so that's the beauty of this type of analysis right so you create initial clusters based on what you think as a profile of the customers but then to target them you are free to choose other related variables as well which is kind of what we're doing here right so then i can also going to choose um the urban population which we had initially right so we it was part of our clusters right we want to choose lens it is it is not a sum it's going to be an average and this was a percentage we'll change it back to a percentage and then we also i'm going to pick and choose one more column just for illustration purposes let's just say two to some outbound right so this is the monetary value of what people spend on tourism from those countries so it's not a sum let's just take an average there we go okay so four clusters four groups of data i have a combination of variables that were part of my clusters and not part of my cluster right so that's like i said that's a beauty of this so now you know what this should be back to a percentage let me just change that there you go so now that i have this i can start thinking about which is my most attractive segment again i want you guys to pause for a minute take a piece of pen and paper again and write down in your own words which cluster or which group would you target if you were trying to promote a travel tourism company so i would say cluster number two is my most preferred cluster why so they have the highest uh tourism they have the highest gdp they also have high internet usage and their urban population also the highest so then if you are trying to build this for a client and if you are trying to present this to your boss or to your in-house um consultants right so i would then probably start typing in labels here so how to describe cluster two so cluster two has a high gdp so these are rich countries right so rich and then the average internet usage is 51 percent this is the highest among all four clusters right so 51 percent is high so i'm going to categorize them as highly connected so meaning technologically right and then if you look at the population 79 percent of the population are urban right so then i'm going to say urban and then finally if you look at the propensity for spending on an average outbound tourism spending they overspend by a whole lot against other clusters right so they are um travel friendly if i like a better word right so this is a simple way to create labels so this is kind of how i would create labels i would probably say um um you know urban highly connected extremely rich people who are uh who don't mind spending for travel right that's kind of how i would categorize these clusters okay so this is an example of how you create those labels again i can create similar labels for each of those each of the other three clusters as well but cluster two would be my target profile for coming up with um my travels tools and services so this is how you take a data set and translate the clusters into something actionable right um let me actually show you an example of another clustering output one of the pitfalls about clustering is there's got to be some specific actionable items that you need to be thinking about before creating effective segmentation right so the idea is [Music] just because you do clustering and you do you choose certain variables and you provide an output does not mean that it's going to be very effective right there's got to be some things you need to be thinking about in terms of what works well for a clustering technique how does um how does it actually translate back into a into an actionable insight right so let me talk about one real quick tip here for you guys okay so there are actually five things you need to be thinking about okay identifiable is a cluster identifiable through is it reachable right through demographics through uh sensors how do i actually identify specific clusters right so based on a certain demographic identity characteristics second thing i need to be talking about is sustainability are the clusters big enough to be sustained i mean is it really worth my why is it really worth my efforts to go and target those segments okay you want to make sure you use uh you don't want to spend more money than what the cluster what the group of customers is going to return back to you right again ultimately it's about a cost benefit analysis right so make sure that the size of the segments is sustainable to for a specific business operation right third thing is accessible right so how do we got to think about how can we reach those segments through promotions to uh effective um distribution right are they reachable are they accessible yes we've identified it identify the appropriate customers in the first step but are they accessible right so you've got to be thinking about logistics got to be thinking about promotional items thinking about physical ownership and delivery and interaction of the product and processes the fourth thing is actionable okay so got to be thinking about the motivation for them right going back to that our earlier discussion about a psychographic and desired needs and benefits so whatever marketing message you're creating based on the customers will it be actionable will the clusters or will the segments respond to my unique messages will they respond to my targeted promotions will my messages create an action for them to go ahead and buy or procure or interact with me whatever the objective is right can those actions can those thoughts can those projects be actionable right so that's kind of what actionable talks about finally response responsiveness okay so will they react uniquely to my promotion right kind of very similar to actionable but then it goes uh taking actionable into the next step to see how responsive they're going to be yes one thing is it should appeal to them on an emotional or a tangible basis yes if it does then how well are they going to respond to it right and do i provide the means and the at the ends to the means to kind of help them be responsive to my promotions or the promotional activities right so these are five things you need to be thinking about in terms of what are effective segmentation principles you need to keep in mind even before trying to implement something take an example of this advertisement for coke zero take a look at this second one which i'm going to show you for diet coke okay i'm going to scroll back and forth between these two advertisement copies and i want you to again think about it for a minute and write down in your own words what is the difference between the first one and the second one so this is a result of a segmentation strategy i'm going to be flipping back and forth as i talk so so that you can see the differences see if you can get it the product two products are being marketed this is an output of a segmentation analysis and there is just one segment that is being used one unique variable that's being used to differentiate between segment one and segment two so by now you should have got it the only difference between these two advertisement copies apart from the product the use of one variable which is gender right so coke zero is targeted to men diet coke is targeted to women okay extremely simple segmentation strategy men and women right just two groups two clusters two variables for my segmentation extremely simple but very efficient in terms of messaging in terms of marketing right i'm trying to make here is your algorithm or the clustering techniques that you use do not necessarily need to be complex okay it can be complex it can be simple really doesn't matter but what really matters is your implementation should be pretty straightforward okay a lot of cases where segmentation has failed not because of their algorithms but because of the way that they've implemented it right so um you want to make sure whatever recommendation you provide is easy to implement it's uh have a real road map in terms of how to implement it effectively right so ultimately implementation is the key because no matter how much hard work you do up front it really does not matter if you do not implement it correctly okay so those are some thoughts i want you guys to uh take away from the uh clustering techniques or k-means clusters right so k-means is just one type of clustering rfm was another type of segmentation that we talked about but ultimately all of the techniques talk about just one thing how do i divvy up my customers how do i develop my products how do i create distinct groups that i can then use for later types of analysis okay so a churn rate prediction is something that's very very um you know common in marketing analytics and it has a variety of uses across other domains as well right so because all of this all companies want to know what is the rate at which consumers are churning also rate at which consumers are what are the action rates and so on right even hr analytics we talk about attrition rates um this is used across industries not just in the marketing field but anywhere you have customers you're dealing with people you're dealing with contracts any engagement that you're trying to measure has this churn metric right pretty important metric across multiple industries multiple functions so we're going to try and take an actual data set and try to understand how to build a prescriptive model to see what are some variables that cost churn more importantly how do we control for them more and how do we calculations and predictions and control for the output now [Music] the technique used to prediction is we're going to use logistic regression right so think about when you would use logistic regression right [Music] let's say i give you a data set and let's say i have i have different customers and i have sales and maybe i try to predict sales my sales is my y variable i try to predict sales as a function of i don't know price maybe advertisements right or maybe um [Music] product launches factors i mean so whatever factors i think is intuitive i am trying to kind of look at the predictive power of each of these x variables and so on okay so next question is how do i know okay my objective is to predict sales would i use a linear regression or a logistic regression so that's the first question i'm going to try and answer so if you run a linear regression you can run a linear regression as long as your y variable is numerical has a value between negative infinity all the way to plus infinity right so that's a criteria for running a linear regression you're it should be a continuous number because you're trying to predict what is the sales on the other hand let's say instead of predicting sales i'm trying to predict the probability of median sales right so what's the probability that given all of these x variables i get above median or below median sales so that case my answer is limited so in this case this becomes my y room and with this becomes my y variable my answer is going to fall between 0 to 1 in other words 0 to 100 that's the range of my y variable okay so that's how you would determine whether you use a linear regression or a logistic regression again nothing stops me from running a linear regression on this column similarly nothing stops me from running a logistic regression on this column models do not know i can just feed it in a model the algorithm might run so i might just get nonsensical results right so if i run a linear regression on a logistic problem i might get values beyond zero or less than zero and greater than hundred percent right so the point i'm trying to make here is a tool will not let you choose what model to run it's up to you to decide up front what technique you're going to use and that is driven by the business question you're trying to answer in this case what is your y variable right um so like you said i mean i can use the same exact variable my data does not necessarily have to come so your boss is not going to tell you hey run a linear regression or run a logistic regression no one is going to come and tell you to run a specific analytical technique it's up to you to decide what will work best right so all you might have is just sales and price and advertisement costs and products right so this is all you might have when you're in a given data so it's really up to you to decide what is a business objective is it are we really trying to predict the sales as a number or are we trying to would it be better to predict sales as a percentage a probability of a percentage or something right so really and you can do both with the same data set right like how i said like i said calculated the probability of median that's kind of how you want to think about approaching model building for a for the market response models and churn predictions great next question is so when you run a logistic regression right you get an output and your output is a form of the coefficients so how would you translate the coefficients into a probability function right um let me draw again right so the idea of the translation is pretty straightforward right so you get a linear looking output so when you run a logistic regression your y is in the form of the intercept plus x 1 beta 1 plus x 2 beta 2 plus x 3 beta 3 and so on so when x1 is your variable the betas are your coefficients right so betas are your coefficients x's are your variables okay so the output looks like a linear function but it's really not so if you look at the linear function most of your logistic your choice based models fit into this s-shaped curve right so if you were to try and translate this s-shaped curve into a linear function you would need to do some sort of transformation to convert that x that s into a straight line and one such transformation that works is what we're going to use right and i'll tell you the formula for that in a bit so in other words if i were to convert this linear looking function into a straight line the transmission i'm going to use this y is nothing but your utility so i'm going to rename this as the utility function so this function where you add your intercepts with the coefficients and multiply the next variables is called the utility this is a utility function otherwise it's going to be called as u right the utility function so in order for me to calculate the probability given this specific linear looking function in order for me to calculate the probability oops i'm going to say exponential of the utility function divided by 1 plus exponential of the utility function that's how i calculate probability from this equation and let's see conceptually how this works right so this transformation let's test it out to see if we plug in some values for the linear function see if it holds good for the probabilistic function right so the linear function has a range right it ranges from minus infinity all the way to plus infinity so these are all possible values that my linear function can assume right so my y variable can assume either a value of minus infinity all the way through a positive value of positive infinity so these are all the possible ranges it can get what is a midpoint so what is the chance of me uh if i flip a coin if i um what is the chance of me picking up the middle point here right so midpoint is going to be 0. okay so let's convert this to a logistic function so from linear if our object was to convert this to a logistic function which is a probability and let's try to plug in this e power exponential of the utility divided by 1 plus exponential of the utility so just plug in these numbers so i'm just going to say exponential of the outer end of the spectrum is negative infinity so i'm just going to plug in the negative infinity into this formula divided by 1 plus exponential of negative utility negative negative infinity right because in this case my utility function is infinity that's my y variable so going back to your high school math what is exponential of a very very very large negative number it's going to be 0 right so this equation turns out to be 0 divided by 1 plus 0 which is nothing but 0 right so that forms the lower end of our bound which happens to be zero correct so if you plug in positive infinity the other end of the spectrum we should expect to get what value here one right it should be one because it's hundred percent because those are the upper bounds that's the upper bound of our logistic function logistic regression function so that's a hundred percent so let's try to x plug in these values so if we plug an exponential of positive infinity divided by one plus exponential of positive infinity what do you get exponential of a very very very large number is what it's infinity right so it's going to be infinity divided by 1 plus infinity so infinity is already a large number 1 plus infinity is almost the same as infinity so infinity divided by infinity is going to be 1 so which is what this number is okay so our choice base this decisions are going to fall between zero and one so what is the midpoint again here so midpoint is going to be the probability that a person chooses either one or zero so that's a 50 50 chance so right so if we plug in our numbers we should get a probability of 0.5 if we plug in this value of zero because zero here is the midpoint for the linear function if we plug in zero to the same transformation if it is correct we should get point five so let's see if that works so exponential of 0 divided by 1 plus exponential of 0. again what is exponential of 0 it is 1. right so 1 divided by 1 plus 1 which is nothing but 1 by 2 and hence this translates back into 0.5 so hope this is clear so this is how we transform the linear looking function into a probability function so this is the um math behind it but what you really need to remember is this exponential function right this exponential transformation okay so let's see how i can do this in practice okay i want to take a data set and maybe run a logistic regression and see if we can try to manually calculate it okay so i'm going to go back to nine since i'm going to create a new workspace you can save your current workspaces so we click on this file option and there's a save icon just under the file menu options so you can just click on this it should save your work so next time you open up nine the notes that you've created should be there for you readily available i'm going to click on one more uh click on local right click click on new name workflow so we'll call it um let's call it logit or um or market response because i want to do something a little bit beyond logic so i'm just want to change it a little bit let's call it map response models and click on finish so first thing first let's import the excel file right click click on configure let's browse to where we've saved this file so i'm going to take this file called practice ship so let's open up the uh sheet in excel in a minute i'm just going to click on sheet 1 and also click on this option that says table contains column names and door number 1 let's make sure we preview or refresh the preview good enough let's click ok and right click execute okay so let me just give you a brief overview of this data set so this data set contains uh information about the titanic passengers right so it has information about whether a passenger survived the titanic or not zero or one zero meaning they did not survive one being they survived name of the passenger what the passenger gender was and the numerical equivalent for the gender male is one and female is zero the age of the passengers and a sibling a spouse so this is a number right so if you go back to the data characteristics it kind of tells you what this is about so it basically tells you the number of passengers who are this person's brother or sister or spouse right and similar similarly the parent child number similarly it tells you the number of parent or children traveling with that particular passenger h talks about passenger class first class second class or third class and then column i basically gives an indication about the fare that they paid and column j is the uh embarkation point so where did they where was the point of origin that's what column j tells you we're going to create a logistic regression model and see how we can actually predict this using nine so we imported the data set if you look at the output model sure enough looks everything looks good we have 1046 rows of data before we pass it on to a logistic regression think about what you need okay so again if you've done the advanced statistics course you probably know this by now you or you should know this by now um you need two data sets because let me take a step back let's say you run a logistic regression model how would you evaluate the performance of your model right so you take a test data set compare run create your testing model and then train a data set train your model and then you have a small holdout sample that will test your model and then compare the results create what we call as a confusion matrix and then proceed right so that's how you test the accuracy of your model right so for that you need to have two inputs one is the test data set the other is a trained data set so in this case we have just one data set which contains all the data sets so we have to somehow split up this data set into two partitions so good news is we do have a partitioning node here so let's get that so i'm going to click on this partitioning okay click on partitioning let's drag and drop partitioning right next to this so i'm just going to connect the excel output to the partitioner node and i'm going to right click click on configure so you have two options you can either choose an absolute partition um you can take 100 so it tells you what is the first partition right the other is the second partition so whatever is not specified here will be the second partition so this is the absolute partition we don't want that let's choose relative partitioning right so typically think about what you do i mean um i generally do either a 70 30 split or a 75 25 split or an 80 20 split depending on the size of the data right so meaning 70 should be trained later and the other 30 percent would be testing data right because you have a small holdout sample for your testing so let's go with 70 30 right so i'm just going to say put 70 percent of the data into the first partition that will act as my training data set the remaining 30 will be my testing data set wherein i test the effectiveness of the model that i create in the first place right and we could do randomly right so i can just do a random sample that's fine and click okay so my color changes from red to yellow i'm going to right click and execute let's right click and you have two outputs that are available so first partition is as defined in the dialog which is 70 percent data right so if you remember initially we had close to 1000 something records of data so yes it does definitely work so we have 700 rows of data look at the second partition we have the remaining rows which is basically should be the other 300 odd rows which is correct 3cd 314 numbers here okay so that's good and you have two different outputs here all right so you have the top output if you again if you hover your mouse it kind of tells you what that data is so we treat that as our the first arrow as our training data set the second arrow as our testing data set okay great now we are ready to run it through our regression so we'll type in the term logistic regression we have two different options one is a learner so if you just start typing an log it comes up right so i'm just going to say logistic regression learner and a predictor we're going to use a learner to train our model and then use the predictor to get our prediction right so let's get the learner first we'll connect the partition to learner again think about what is it we're going to try and predict so if we go back to the excel spreadsheet i'm going to try and predict the probability of someone who's survived so right this is the column that i need to use i'm going to go back to the 9 platform let's right click configure first thing is we need we need to tell it what to predict so target column should be survived unfortunately survived is not an option for us to choose so you just have a name gender and embark you might ask you know what something's wrong here why is survive not available as an option again i want you guys to think for a minute and maybe answer it yourselves how why do you think survived is not available is it available somewhere else survived is available here but it's not available in the target column why the clue is your metadata look at the metadata behind survive it is configured as an i i stands for an integer but for the target column here target column all are s's so it needs an s or a string so we need to convert the integer to a string format because survive remember it was 0 or 1. for the logistic regression model to work here it needs to be converted to a string because 0 and 1 are considered to be strings okay so i'm going to exit out of this there is a node that is that will help us convert the number to a string so the node is exactly called that it's a number to string okay so i'm just going to take this number to string node and this is where again i need you guys to think as well right so i'm obviously not going to uh i'm not i'll not be able to pass it from the partition right to the learners i'm just going to delete that node the way you delete an arrow sorry not the node but the arrow just click on the arrow till it's highlighted you see this middle part here and then just click on your delete button right so click on the delete button and then just when this prompt comes up just click ok so i need to pass the node to number to string again this is something that this is where you work smart and not work hard right if i do this number to string from the partition here it seems to be the logical step but then think about this i need to do the same thing for the test and the train as well and i just end up creating additional steps so an easier option would be to just do it before it passes through the partition right so i can place it right between these two nodes so instead of this node going to the partition i'm just going to delete that node and directly from the excel i'm just going to pass it through the number to string node if let's configure this so this is going to if i leave it the way it is it's going to convert all of these into strings which i don't want so i'm just going to put all of these here the only thing that i need is survived so i'm just going to pick survived let's execute this so i have a number to string let's right click and look at the transformed input so i do have survived i have zero one one and zero it is coming up as a string let's look at the metadata here everything else is the same then it looks okay now i can pass it through the partition and i can still let me make sure the configuration settings are correct yeah nothing has changed so we can just execute this and now if i connect this to the regression learner node now if we configure this survive should be available as an option for us to choose so target column should be survived so that's good i'm going to change this reference category to zero for now i'll tell you why it really does not matter it just uh easier to interpret it later but i'll tell you later why um you'll see when you see the output you can see you can see differences between zero and one right um one uh really uh important thing for you to choose is what is the type of algorithm you need to choose okay i use the re iteratively weighted least squares because it's more robust and more importantly it actually gives me the predicted values it gives me the percentages because i that will help me to fine-tune the model later when i want to change the percentages and maybe change the thresholds and so on so that's why i choose the iteratively weighted least squares model okay so click on that and this is where model building happens right so i'm just going to exclude all of them to the left here and i'm going to pick and choose columns that i think will be useful for me right so i'm just going to say gender number i need i'm going to look at age look at maybe sibling or spouse maybe parent child passenger class okay so let's say these are the five x variables that i want to use to predict probability of survival so that's good i'm just going to click ok right after that then i run it so the two outputs that are available for us for right click and look at the coefficient and statistics this gives you the different equations right it gives you the intercept which is a constant you see the passenger class the parent child sibling spouse and so on so this first one is what i was referring to earlier as a reference column so if i the reference was zero so my output is for one if my reference was one my output was zero so here i'm trying to predict the probability of survival survival is one in this case so um all of these values correspond to probability of survival okay so it's going to close this for a minute we've got the output the other output is really not that useful i mean look at the model learning and properties it just gives you the behind the scenes values in terms of the log likelihood and then also the iterations it went through what really we want out of this is the coefficients and statistics okay because that's kind of what is going to get fed into the next stage okay now the next thing i'm going to do is yes we've created our model how good is it so this is where i'm going to incorporate the testing so i'm going to type in logistic regression and then this option comes up which is a logistic regression predictor so i'm going to drag and drop the regression predictor right next to this and it requires two input nodes the first input node is pretty straightforward i need for this from the second partition i'm going to drag and drop the arrow into the second largest segregation predictor that's my second node and these blue colored nodes that you see so these are this is where the model passes the learning parameters right so this blue color node here passes the parameters this blue colored node here so this is the internal language that nime uses this is called pmml so it uses that internal language to communicate to the predictor what is the model that has learned from the previous technique right so i'm just going to drag and drop this blue colored node into this blue colored node which connects these two models together right so that's it my node turns yellow let's make sure we our configuration settings are correct and this is where i was talk i was talking about uh which this algorithm lets me append the columns of the predicted probabilities because i need to know uh what are the predictive probabilities so let's just call it um predicted probabilities so click ok let's right click and execute it changes green let's right click and look at the predicted data and there you go your logistic regression is done it is kind of now giving you the test results so let's try to understand what this means so the last column here this is our actual prediction so row number one what does it tell you it tells you that in actuality our model is predicting that someone has survived and in reality someone did that someone did actually survive right similarly for the second one yes the model someone actually did survive and our model predicted that correctly for the third one our model predicted someone would not survive and the model also did not and actually did not survive as well right so it looks like so far our model is doing good i'm just going to pick one value that's okay row number 20 here so this one we predicted that that someone would survive but in reality she did not right so this is where then you need to be talking about is so our model performs good in some areas it does not predict correctly in some other areas so how do we then measure the effectiveness of the model so how do we consolidate all of these how do we aggregate all of these into a metric that will translate back into an accuracy metric so this is where we're going to create build our confusion matrix let's just close this so our model building is done our testing is done our training is also done we're going to try and measure how effective our model was i'm just going to make some space here there is a node called scorer okay i'm going to call it take this scorer node and drag and drop scorer right into the mix here so for the scorer i'm just going to take the output of the predictor drag and drop it and connect the dots here let's right click configure and i need to compare the actual to the plan right or actual to the predicted so the first column is going to be my actual which is survived this is what i'm getting from the original data set that is what my probability of survival is so click on that second column is the prediction right so we are predicting the probability of survival which is the last column in this case anyway so i'm just going to click on the prediction and that's it i'm just going to click on ok right after that and let's see if we can execute this right click and execute it's done there are two options for you to see one is the confusion matrix right if i look at the confusion matrix how do we read this this is the actual this is the prediction right so we see we've correctly predicted 153 items that were of people who did not survive we've correctly predicted 91 who actually survived and these are your incorrect predictions 35 and 35 okay there's also another view here so if you want to look at the percentage there's also one view that actually calculates that so this is actually a more comprehensive view for right click and look at view confusion matrix so it gives you the actual matrix here it gives you these survived and the prediction so this is the actual this is the prediction on the call the column wise this is prediction this is actual okay and it gives you the accuracy here which is the 77 okay now if you're trying this and if you're following along and do the exact same steps as mine you might get a slightly different accuracy so trivia question why do you think if even if you're using the same data set using the same variables using the same partitioning percentage why would you get a slightly different accuracy model than me okay again pause for a minute maybe write down a piece of paper why do you think you want to take a you might have a different accuracy percentage the answer is in the way that we partitioned so if you look at the partitioning settings right click and configure we are using a random sample and we're using a random seed here right so uh that's why it is doing that now if you want to make sure all of us get an exact same value you probably set the random seed and you would not use random sampling or probably take a linear sampling or take from top okay but that's kind of why since it's being picked in random you are going to see slightly different results than mine okay so let's come back and visit this coefficient in a bit but before that i just want to uh since we are you know how to run a logistic regression model right uh within just two or three steps here let me also give you a slight tip on how you can expand this and maybe run other types of models as well right because ultimately you can compare the accuracy of different models and pick and choose what model so that is one of the great advantages of a tool like this you can quickly visually use kind of a similar layout similar workflow and try logistic regression or even decision trees and maybe a random forest and try to see how we can quickly come up with a comparison right so let me quickly show that to you and then we'll come back and understand this conceptually if you understand one everything else is the same right or very similar but i'll just show you quickly how to set this up right so um take a decision tree learner but also connect the decision tree predictor and then one more thing i need to do is the scorer okay so these are the only three extra things that i need because i'm just going to use the previous setup i already have to compare how a decision tree works in this example right so for decision tree learner i'm going to connect the first partition to the decision tree learner let's configure this so for the decision tree learner the only thing i'm going to change here is the class column instead of embarked i'm just going to say survived and you can you have options to change the quality measure the pruning method and so on i'm just going to leave it the way it is right now by at the default level and then for the decision tree predictor i'm going to connect the second node to the decision tree predictor and connect the blues to the blues because that's where the model parameters get passed so let's execute the decision tree learner i think look at the output the decision tree model gives you an output which is a pml output which is what the model will actually translate or transfer to the predictor or our test case right so i'm just going to configure this and yeah this is good enough i'm just going to i'm not going to do anything here and just execute and that look at the decision tree if you look at the classification of the data right right click and look at the classified data it gives you what the prediction is just like how we did logistic regression column rows one and two give you an accurate prediction so do rows three and four and five so do six so this is where the first incorrect precision incorrect classification comes in so this one predicts that this passenger will survive but in reality he or this she did not survive so that's kind of the predicted output and if you look at how it went through about predicting it you can actually look at the decision tree okay if you look at the decision tree it's kind of a classic decision tree representation it gives you yes or no uh it gives you the variables for obviously the most important variable is gender and within gender you can see how this tree expands let me just put this in full screen mode then obviously it looks like passenger class plays an important role and then within passenger class you can see fair then plays an important role within fair can keep expanding down looks like age then plays an important role and so on and then you can go down the other tree as well uh for passenger class you're probably going to see very similar results passenger class and within passenger class you can see age and so on okay so this is a you can expand this multiple levels and primarily uh the decision tree works like this right so let's see how accurate our model is for the by using these uh nodes and our decision trees i'm just going to connect this to the scorer and if right click click on configure first column is going to be again very similar to the logit which is going to be survived second column will be the prediction click ok let's right click and execute and let's look at the confusion matrix so it was slightly less we have a 75 percent accuracy for the um decision tree learner anyway these are your statistics here so since we are in the mood for model building let's also i'll also show you one quick tip here on how to create a third model maybe by using random forest methods right so let's use a random forest learner we just need three nodes right because all the data is already there neither random forest learner random forest predictor so this is how it's this is how easy or simple it is to uh do model building you just start off with one you can quickly compare and look at the efficiencies across multiple models so for the random forest learner i'm going to use the first partition again let's right click and click on configure so my target column is going to be survived and let's tell it which columns to use i'm just going to say gender number age sibling spouse parent child passenger class that's it this is very similar to what i use in my logit so this is probably more comparable model to my logic and again other options are there available for you in terms of what to how to split it i'm just going to use siran static sorry the default options here click ok let's right click execute so it's done it gives you the random forest model kind of very similar to your decision trees it gives you multiple trees okay so there are 100 models that have been created so you can just scroll through this list to see what the different models all of the 100 models okay so you can kind of see variations with all of these right so based on all of these it comes up with the calculated value that gives you the number of splits for each of these variables and then i'm going to connect the model to the predictor or trained model to the testing model i also need to connect the second partition to the predictor as well let's see how well i modeled it in predicting it let's configure i'm just going to say yeah looks like this is good enough append individual class probabilities that is good enough let's right click and execute let's right click and look at the prediction output so there you go so you have the actual percentages um and based on the percentages it tells you which the probability survival again so these are the actuals these are the predictions based on our combination of multiple trees combined together you see the prediction here and let's run through this score to see how well our model dead let's configure so we're trying to compare the actual which is a survived to the prediction which is the predicted survived okay let's click ok after that let's right click execute let's look at the confusion matrix gives an accuracy of 80 percent this time right so if you recall the accuracy for the logistic model was 77 so this increased our model to 80 right so point i'm trying to make here is it's so easy to compare multiple models all you need to do is just drag and drop once you've created a partition it just requires three extra nodes for every model that you want to do do the same thing with even neural networks you can do the same thing with even random forest right decision trees so really really easy way to compare multiple models using the same workflow that you've created once okay um let's switch back to logistic regression because that's once you like i said once you know how to understand one the conceptual understanding can be taken across other models in terms of let me give me just a moment let me let me open this predicted data not this one the learner we'll look at the model coefficients in statistics so if it was a linear regression you can look at the magnitudes let me see if you can expand this increase of font size you can just look at the magnitude and pretty much estimate what is the most important variable affecting your prediction unfortunately this is logistic regression you cannot straightly take away the magnitudes and interpret it however one thing you can do and you should do is look at the size the sign of the coefficients so look at the sign of the coefficients and see if it makes sense as you know how things work in the real world okay because if not there's something going on it could probably be a result of maybe multicollinearity or maybe omitted variable bias or some sort of bias in the model that you're not probably properly accounting for okay so you've got to be thinking about a different model if the sign of the coefficients do not make sense you've got to go back to the drawing board so all the principles that you've learnt in linear regression still applies to creating and choosing the appropriate x variables in logistic regression as well okay so let's look at the coefficients uh real quick so we are trying to predict the probability of survival so one means survived zero means not survive so if we have a negative value for gender right so gender has either zero or one so female was zero male was one so then think about what this negative means who has a higher chance of survival so just plug in the values for zero and one and see who gets a higher value that's closer to one so if gender if it's female the value is zero so zero times whatever value it is it's going to be zero but if it's male it's one so male 1 times minus 2.63 is going to be minus 2.63 which is closer to zero than it is closer to one right so that means males have a lower chance of survival because their coefficient comes out to be negative females coefficient comes out to be zero hence females have a higher chance of survival clear that's kind of how we make sense out of it intuitively and which kind of makes sense so think about why it makes sense why do females have a higher chance of surviving the titanic if you all seen the movie titanic probably heard this right women and children first that's why right so it makes sense next thing is age so what does the negative coefficient of age tell you the older you are the lesser the chance of survival right so it makes sense right so again elderly people are probably not prone to you know coming out of the ship quicker they are not that agile maybe they don't pay attention and preference was probably given to women and children who happen to be a lot younger children are younger right so that probably would have biased this coefficient as well sibling or spouse if you were traveling with a sibling a brother or sister or if you're traveling with your spouse a husband or a wife you had a lower chance of survival why think about the dynamics right so you probably if you're going to wait for your husband or wife or your child to um um go with you to get out of the ship right so that reduces the chance of survival that also makes intuitive sense parent child um has a higher chance so if you were a parent and you had a child or if you're a child and you're traveling the parent you had a higher chance why again goes back to women and children first right so that also makes intuitive sense passenger class has a negative coefficient so that means passenger classes one two and three so who had a lower chance of survival those who had were traveling in third class which is three so three times minus one is more negative than one times minus one right so passengers who were traveling in third class had a lower chance of survival make sense intuitively why because think about a ship right the ship first class typically they stay on top of the deck to have those views of the ship and so on people who were in the third class were down below so it takes some time for them to climb up the ship and then come and escape right so all of these coefficients make intuitive sense the sign of the coefficients making into intuitive sense that's kind of what you've got to be thinking about when it comes to converting this to a business sense okay so the next thing i want to talk about is um again we saw how we could convert these coefficients back into utility function and i'm going to do that for you guys real quick here okay so let's see how we can um the scorer is done but it'll be good for you guys to understand how the scorer works just so that you can plug in your formulas into the actual function right um so i'm just going to take the second partition and i'm just going to copy this data set into an excel brand new excel spreadsheet i'm going to place it right over here so let's also copy the specifications like these guys okay so let's just expand this so it kind of makes it easy to read there we go i like to view my top row there you go okay so this is the testing data and our model coefficients for the logistic regression come out like this right we have gender and each of these are your coefficients so what i'm going to do is i'm going to paste these right and right over here let me just squeeze this up a little bit there we go okay so i can see all columns together there you go and uh let's just do this so i'm just going to insert another column right on the top so these are the coefficients right so i'm just going to copy and paste these coefficients right on the top side here very special let's just do a paste special and transpose i'm also going to take these coefficients paste it right on this side okay so these columns which i'm going to color code differently so these are the model coefficients so i'm going to recreate the confusion matrix right off of here so that you kind of know what are the nuances behind recreating and how to use them in a utility function here right so first thing is i need to basically get for each gender number i'm going to multiply this with the coefficient of the gender this is my x1 beta1 right so i'm going to also put a dollar sign before k and after k because y that's going to fix the reference cell because otherwise if i drag the formula down second one will have d4 times k2 if i don't fix the reference cell all right so that's why i want to keep the reference cell constant if i do it once more this will look at d5 times k1 right so if i don't let me actually back up a little bit and let me demonstrate what happens if i don't fix this that we kind of know what the reference cell means if i just keep it the way it is write d1 times k1 if i drag it down once look what happens d1 should d3 should be d4 that is correct but then look what happens to the k value here k should be always referring to two point minus two point six but it also comes down as interactive formula so that's why i want this to always be referring to k one right so then it should work but you know i don't want to keep typing in k1 every time so that's why i fix the reference cell and if you're a mac user i'm not sure what the combination is but for windows user if you just press f4 it puts a dollar sign before k and after k okay uh if f4 does not work you can manually type and type it in as well you can just put a dollar sign uh after k and before k it will fix that reference cell that's what this reference cell means so there you go so i'm just going to drag and drop this formula all the way down it gets filled up similarly i'm just going to do the same for all the other variables as well right so age times the coefficient of age i'm going to put a dollar sign before and after the coefficient of age which is l1 which locks that cell let's drag and drop it all the way down sibling and spouse this is going to say sibling spouse times the coefficient for sibling and spouse lock that cell drag and drop it all the way down do it once more for the parent child column g times cell n1 fix the cell drag it all the way down passenger class is going to be 3 2 or 1 which is column h times cell 0 1 i'm going to fix o 1 drag and drop it there we go so it looks like that's good finally the constant is just a constant itself right it's basically going to be 5.01 times 1 so in other words it's just going to be the same value itself constant is going to be constant throughout okay great so next thing is i'm going to calculate the y which is nothing but the utility function which is nothing but the intercept plus x1 beta1 plus x 2 beta 2 and so on right there's nothing but the sum of what i just created under these yellow colored columns sum of all of these that's what my utility function is because that's kind of my linear summation of my output so just click on that so i'm just going to drag this all the way down then finally i'm just going to the formula for probability is exponential and actually i'm going to break it down even further that way you kind of understand this if you're doing it for the first time if not you probably know how to do it you can just do it all in one cell itself um but i generally like to break it up into smaller steps so it's easy to understand i'm just going to get the exponential of the utility first which is nothing but exponential of the utility function here and then one plus exponential of the utility is my second column basically 1 plus this column here finally the probability is nothing but exponential of the utility function divided by the divided by one plus exponential of the utility function the idea of the exponent the utility function here so that's a probability so just one last step before we get the probability is basically going to be this value column r divided by column s and since this is a percentage which is our percentage right over here okay now so this is how logistic probability is calculated okay i just wanted to show you guys how to translate from the um coefficients how do we get the actual percentages if you've done something wrong or if for some reason there's a mistake here the way you would check that is let me get the min and the maximum there you go so the min should be no less than one percent right if it's less than one person then there's sorry less than zero percent there's something wrong here so min should be zero so which is kind of good right so min is two percent here similarly the max should not be greater than 100 if it's greater than 100 there's something wrong so this is a quick spot check i generally use if i'm doing this manually to make sure my formulas are correct i've used all the correct coefficients and my math is correct right so this will be a quick spot test now one more step before i can create the confusion matrix right here in excel right so in order for me to create the confusion matrix i need to get the probability of or my prediction so i'm going to use a 50 threshold i'm going to say if the percentage of survival or the probability of survival if that percentage is greater than 0.5 then i want to predict that the passenger has survived if not zero that means a passenger has not survived so this is needed because so the logistic regression will give me the percentage but ultimately how do i compare that my model has performed well right so i need to convert these percentages back into a binary function and this simple if formula will help me convert that into a binary function okay so i'm just going to center so you guys can see this a lot better there we go so now i have values of either 0 or 1 right so now if i do a simple pivot table off of this so this is my actuals my predict percentage is under columns and i'm just going to take any value here just maybe name so i can get a count there we go let's just uh look at the options here [Music] report layout let's show it in tabular form so that way we can kind of see the uh header names okay so this is basically the way you would create a confusion met exactly 153 is good that means we predicted um 153 would survive it would not survive and reality 150 did not survive same thing with the 91. so 153 is good 93 is 91 is good so correct prediction is 153 plus 91 which is nothing but 153 plus 91 which is 244 the total is 314 and the accuracy will be 244 divided by 3 1 4. 244 divided by three one four which is 77.7 percent okay so which is kind of what we got in our confusion matrix and name as well right when we did the first one let me just look at that there you go 7.70 pretty close okay so this is how you would translate it in from the coefficients back into excel so you really don't need any sophisticated software once you get the coefficients you can pretty much calculate the um confusion matrix all by yourselves right okay i'll also show you method so by default most of the programs use a threshold of 50 but there might be some scenarios where you might want to tweak the thresholds right and i'll show you how to do that it's a pretty uh simple straightforward one step um change okay so we'll actually do that using an actual real-life data set where you talk about churn prediction and all that but this is the foundation for you guys to understand how churn prediction works how to calculate churn rates and so on so let's now take an actual data set and talk about how we can predict churn rates um using logistic regression okay and talk about how to look at customer churn rates customer churn prediction right um so let's go open up a uh sample data set so you guys can understand this so this data set contains information about a retail and online retail grocery store right so uh very similar to amazon fresh or maybe a big buzz a big basket right so each record here is at the customer id level right you have data aggregated for the customer id level and this column b tells you whether the customer is retained or not based on historical data so one means retain zero means not retained created is the date at which the order was created or the or not the order but the customer name was created in the database column d tells you what the first order was when the order was first placed by the customer last order when the last order was placed by the customer each sent is a measure of number of emails sent so customer id for example 174 total emails were sent so email was a primary method of promotions right so they were being sent emails and product updates and so on a open rate is the rate at which they open up their emails and if you look at the data dictionary which is also provided in the data set you kind of look at the uh what each of these mean right so he emails number of emails rated which the number of emails were opened and then e-click rate was basically a number of emails that were clicked right so it's a subset of the open rate and this is the average order size okay um the monetary value how much on an average order's worth this is the order frequency so frequency again is a measure if it look at a data dictionary it's basically a measure of the total number of orders over the uh customer tenure right so that's what the order frequency tells you and then paperless is a binary indicator which sells you one or zero whether it's for uh they opted for paperless delivery of their communications like emails or bills and so on uh refill is one and zero so meaning refill was if they opted for automatic refill it was one so think about this if you are ordering grocery online groceries and you have the option to set up an automatic refill let's say you are you're always going to buy milk or juice or bread or whatever it is like bread and consumables every week right or vegetables and fruits and if you have a certain list that you already pre-prepared that you already need you you need to have those basic day-to-day essentials you can just set up an automatic refill and then no matter what every week you it automatically gets reordered for you and that's the concept of the automatic refill and then they also had the option to select doorstep delivery so they could either wait or pay an extra charge to get delivered at the door or if they said one they could just go and pick it up from a central location right that's what the doorstep means train ignore this all of this is going to be trained data or test data right there's just a way to classify this data set and the last two columns are the categorical variables that we have in this model we have the favorite day of delivery which is either monday through sun sunday right so one of those seven days and then the city in which the customers was residing right so there are four cities here uh bangalore for blr delhi dl hyderabad and uh madras or chennai so these are airport codes right so mas sansometras or chennai right so bangalore delhi hyderabad and chennai so four cities that they were that they had the data for okay let's import this data set inside of nime and see how we can try to understand what are the factors that go into predicting the churn so i'm going to open up a new workflow and call it churn prediction and let's import the data set excel reader right click configure let's browse the data set so here the data set is already passed out for us so we're not going to uh do a partition right so we're going to already use the trained data set to train our model and then use a test data set to test the accuracy of model so i'm going to use the train data set and specify which sheet it is and also make sure the table contains column name server number and by now you should this should come to you automatically since you're probably doing this for more multiple times these three steps should come to you automatically choose a data set choose this sheet and then choose the first one as a header and if you scroll down just one more step you just click on refresh right so it should not take you more than a few seconds click ok let's execute this and if you look at the output table sure enough it looks good so i'm just going to rename this node i'm going to call this the train data set and i'm also going to bring in another node and get the test data set as well so this will be a test data set same thing choose the data set table one contains column one as a header then click on refresh and click ok finally let's execute this then the next step is i'm going to run a logistic regression learner i want to get uh i'm going to get all the nodes i mean you probably know the drill by now i'm going to get a learner i'm going to get the predictor finally i want to get the scorer to evaluate my matrix right so let's connect the train data set to the learner let's configure this target column should be survived right quick again it survived an option why is it not an option for us to choose right sorry retained right so because uh survive was for the titanic data set but similar to the titanic data set here our binary variable is the retention we're going to try and understand what are the chances of a customer being retained so here retain is an integer so we need to do the same thing like before we need to convert the number to a string so i'm just going to delete the arrow and get the number to string node and also get another number to string node because we want to do the same thing for the test data as well right so just connect these two nodes let's configure i don't need all the variables i'm just going to put all of these to the left side let's just take retained and include that click ok so execute right click and look at the transformed input there you go looks like retained is now a category which is s that's good we'll do the same thing with the test data as well let's configure that i don't need all these variables just look at the retained click ok and execute now i'm ready to run it through the learner let's right click and configure a target column should be retained again my reference category i'm going to put it as one because we're trying to predict a look at churn predictions we'll change the algorithms or iteratively related least squares and i want to selectively choose the uh variables that i think are good here right again this comes with practice and it's iterative uh so i'm going to try and see which of these variables will give me the highest hit rates right i want to go with my intuition first uh created first order last order they're all dates so i can't do anything with dates if i just plug in a date into my as my x variable it's going to give me meaningless values i might want to consider if i do want to use those dates you might want to consider maybe taking a difference of those dates so difference and dates will give you a numerical value that might make sense but just using those dates alone as a string in your equation won't give you any meaningful results so i'm just going to take e-cent that is a good measure which is the number of emails sent let's also look at e open rate basically i'm going to throw in all variables that i think will help explain my model here right so last average order is also good order frequency paperless refill doorstep train is a dummy variable i mean everything is going to be trained so i don't need that also click on favorite day and city as well okay well actually what i'm going to take these two categorical variables out i'm going to come back to them later because i want to there's a specific learning i want to uh give you all on categorical variables but for now let's just use all numerical variables okay so looks good let's click ok let's execute this we'll look at the models or coefficient statistics so looks like the coefficients again we can't do anything with the magnitude of the coefficients but what we can definitely do is these sign of the coefficients yes e cent has a positive impact so meaning the more emails are sent greater the chance of retaining a customer similarly e-open rate as well and e-click rate average order and average frequency have an inverted non in and have a negative sign which is kind of non-intuitive right so think about why that could be um it basically tells you to greater your order the more you order the higher the chance of attrition similarly the more frequently you order greater the chance of um attrition right so greater the higher the chance of churning so that's something you need to be thinking about why could that be um occurring because we would think it's the opposite right and paperless refill all have a positive influence and so does doorstep constant is just your intercept okay so one of the reasons why the average order and order frequency is high is couple of things right so one is there is the average order and frequency think of this as a when when customer first comes in and logs on to their website um the order frequent average order size if you have this binge order right so i might come buy it in bulk and just forget about it and then i end up losing interest so that is one reason possible reason why the average order size is immersely related to retaining a customer right so these are the this is a binge effect uh the newness uh uh you just buy it all at bulk you don't come back again and and you probably forget about it order frequency you the more frequently you come so suddenly i'm i'm exploring a new product i am coming uh ex very frequently i just keep buying from you every week every week every week every week suddenly i'm like i lose interest so these are your uh fads right so when you when you come in and buy something new you're very interested very enthusiastic and then it slowly fades where you lose interest in that new concept novel concept and then you get uh you're away from it right so um that's something that you need to figure out in your frequency so just because a customer is extremely frequent does not mean that they are going to be retained so you got to be thinking about uh um paying close attention to these numbers right so obviously you don't want someone with an extremely high or frequency yes there could be some people who are genuinely interested they could be continuing but if you generally see a high average order size and a high order frequency what the data is telling you is be cautious about them they might be a short-term consumer so don't spend don't get too excited and don't spend a lot of money and marketing efforts on them they're probably going to fade off especially if you see an unusually high frequency rate and an average order rate so that's kind of what this means okay so now that you've made sense of the coefficients and sign of the coefficients let's try to see how well are modeled it in predicting it so i'm going to convert or get the test data into the model here pass the parameters from the blue node to the blue node get the scorer up and running and let's look at the predictor right click and execute let's look at the predicted data so it looks good we have come up with an actual prediction and let's look at the scorer let's look at the configure option for the scorer the first option column will be the re uh yeah retained right because that's kind of what we're trying to predict which is retained second column is what we actually predicted which is the prediction of the retained let's click okay so one thing i also need to do is i need to append it so i'm going to go back to this predictor i want the um predictor probabilities because remember i was going to show you how to change the threshold to calculate the hit rates so i'm just going to click on append columns of critical probabilities so let's execute that i'm just going to make sure right click on the predicted data if you scroll all the way to the right sure enough you see the predictions like i said by default it takes a 50 threshold but we'll see how we can change that so right click execute let's look at the confusion matrix wow look at this accuracy 94.39 okay so very decent model looks like we've got most of the variables again you can find your model you can exclude some variables maybe add other transform variables so i'll leave it up to you guys to see if you can increase the accuracy rate um a couple of my students have got it as high as 96 percent so i won't tell you what they did but they did some transformations right so um they did some transformations on the data set and actually came up with a high slightly high accuracy rate so if you can take a few iterations and see how you can build uh bring in certain variables and try to get it to higher than 94 percent so i'll leave that to you as a challenge or homework that you guys can probably want to do while you're watching this so but let's go back and try and understand how what these really mean okay so one of the things is again uh what i would also like you all to do is in addition to bringing it seeing if you can bring up the model to an extremely high percentage try to see if you can [Music] understand if there is one single variable that's contributing to the extremely high accuracy try to see if you can isolate that okay so again this is a clue i'm giving you guys so auto so if you look at your model you have eight different variables one of these variables has an extremely high influence on your hit rates and i want you to figure out which one it is so if you get it by so when i do this in a real life class so i basically turn this out into a competition divide the class into three or four groups and i tell which group uh gives me the highest hit rate gives gets bragging rights right so since it's an online class since it's a recorded session um i can't do that but again take a pause and see if you can build up the model to give me a higher than 94 hit rate so uh chances are if you stumbled upon if you're trying to do it in the more traditional way if you're trying to do it at one at a time you probably stumble upon something very interesting right uh just try this out for yourself just take this variable e cent alone and try to figure out what that isn't alone contributes to in terms of hit rates okay so where you would do that is i mean you go back to the logistic regression learner just go to configuration just remove everything from the model except e-sent just have e-cent alone okay and then i'm just going to run all uh so you don't need to run every step one at a ramp because since these nodes are all interconnected if i just execute the last node it's going to automatically run the previous nodes right so notice if i execute this it's going to run the previous nodes by itself so that is one tip so i'm just going to right click and look at view confusion matrix and look at that 92 percent right so just with ascent alone i was able to get 92 percent okay now the question is how would you be able to figure out just by looking at the coefficients let me put all the variables back again right the first order sorry not the first order but the the open e-click rate average order yes i'm just going to put all the variables back again and run it look at the coefficient statistics just by looking at these coefficients how would you know which variable is biggest so obviously the magnitude does not give you a clue right because these are like i said logistic regressions unless you calc follow it all the way to calculate the probabilities you won't get that so how do you actually calculate the um know which variable is more important so i'll show you a small trick so i'm just going to copy these variables into a blank excel spreadsheet so these are the variables and i'm going to get the coefficient and these are your standard errors so there is a statistic called as a walled statistic okay some software programs calculate this for you but some do not so in case you do not do not worry i'll show you how to calculate this pretty simple statistic right it's basically the standard error divided by the coefficient and then you just square it okay you just square the result of the division and you get a value right sorry it's the other way around you just do this coefficient divided by the standard error and then you square it okay so you get back a number so let me just uh yeah there you go remove all decimal points so we're comparing apples to apples there we go so this coefficient is a world chi square and what it really means it gives you the relative importance or economic significance of the model of the variable right so what i mean by the economics you've all talked about statistical significance which is a p the economic impact of the model is this so we can see just the highest magnitude which kind of almost over rules all the other variables is the e-cent variable it's close to 4 000. so that will give you a clue about what is the most important model that has an impact on the accuracy so if i take out esent it's going to the rest of the model is going to suffer in terms of accuracy if i include just ecent alone that pretty much carries the accuracy percentage for the most part so this is kind of a nice tip that you can use when you're trying to evaluate which variable works well in your logistic regression model right so the walled statistic so let's look at what the the constant is a trick one it does not make any sense so it is really not worth anything but the second biggest one is going to be your refill right so if refill has 50 which is the second biggest impact or chance of the model so as you start removing models or it's if you when you throw in all the models all one at a time and if you want to see which variable has a single biggest impact then you can kind of look at this technique so this is our confusion matrix great now one more concept i want to talk about when we make a confusion matrix there are some errors that you make and there's a cost for each error right let's try to understand what is the cost of these errors right let me open up a blank powerpoint and try to explain what i'm talking about so we'll draw a confusion matrix here right so let's say so this is your predicted values this is your actual so the predicted is either zero or one actual is also going to be either zero or one right so when we predicted everything correctly and the actual also turned out to be zero correctly it was perfect right so make sense everything is good and we predicted that if a customer is going to be uh retained and the actual is also retained then that's also correct we predicted that correctly now what happens let me maybe see if i can change my color here what happens in this box what does that tell you take a minute to write down what this means again don't write out type one type two errors because i'm a business man you're if you're trying to talk uh think of think of a role play here right so i'm a business guy i do not understand type one type two errors what does this mean in a real-world sentence what does it mean in english just put translate this box into simple plain english that any lay person can understand so if we if i were to ask you to do that what this green color question mark really means is i have predicted that this customer will not be retained but in reality the customer ends up being retained okay similarly for this second box i have predicted that the customer will be retained however in actuality the customer is gone okay so it's okay to make mistakes but we need to understand what is the economic impact of these mistakes let's try to quantify the mistakes and see if we can attach an economic quantity for these mistakes okay so for the first one think about what is the cost of making that mistake or what is the opportunity car so we are predicting that the customer is not retained but they end up being retained so in other words that is an unexpected revenue gain at least for the shortfall so this is revenue gain which might be a good thing in the short term but in the long term we are going to ignore these customers because we have already written written them off in our system we're not going to do any promotional activities to uh engage in them and it's ultimately going to drop off right so that's a long-term car impact for that and think about what is the cost for that so what is the potential revenue gain so i'm going to go back to the data set and see what is the average order so these are monetary values right so average order what is the average of the average order so which is about 61 rupees right so if i say 61 so this is 61 rupees approximately i'm just going to say let's say 60 rupees let's say this is 60 rupees that's the average revenue gain that i will get it's a potential gain in the long short term but uh potentially i could lose it if i don't pay enough attention what is the cost in the second box so think about what that is so i have predicted that the customer is active but in reality we've lost the customer so i'm going to assume the customer's still active and what do i do i spend a lot of time money energy on marketing activities in this case what is a marketing activity it's number of emails sent okay number of emails sent now how much does it cost to send an email probably nothing but let's say for the sake of uh just giving it a monetary value let's just say it costs maybe one rupee to send an email so in this case which one is a better alternative so i want to look at number of emails is just costing me just one rupee so it's okay to make more of those mistakes right so in other words it's going to cost me a lot less to send more emails than potentially lose out on a long term revenue prospect right so what would i do what is under my control that i can affect immediately to change that behavior based on the data set here that's kind of what i guess wanted what is something actionable here so this is where we take predictive models and turn it into a prescriptive model because you always want to think about prescriptive don't just stop with prediction yes prediction is great but so what i mean what do i gain with predicting yes i can predict but then when you're trying to give something actionable to business users you want to be thinking about how can you take that predictive power and translate it to something that is prescriptive and something that's um that they can actually um prescribe and take control of okay so here what this really means is if look at the thresholds here so like i said think about what is under your control so the thresholds was 50 so i said if something was greater than 50 it was one something that was less than 50 was zero right so if i reduced the threshold to 0.3 how would this turn out to be so that means this i am shifting this mistake into this bucket or this bucket if i shift this into this bucket which which is good i mean i've done my job correctly i mean i this is still accuracy uh the job done well i don't care about it anymore but if i do shift it into the second bucket because we don't have control over where it goes right it could go down it all i'm doing is shifting the threshold it could go to either one of the squadrons but if it's still shifted the cost of making that mistake is still one dollar i can afford to make more of those mistakes right so that's kind of how you want to decide about the economic impact think about it from a so this is a b2c environment wherein the cost of marketing is very less what about a b2c and let's say in a b2c environment sorry a b2b environment because of marketing is probably more it's no longer one rupee maybe it's more like 40 rupees because i need to invest in salesforce i need to invest in travel expenses and so on so cost of marketing is no longer and i'm constrained to a budget there's a specific budget that i need to work and may need to keep under so in that case my marketing cost is 40 rupees and i cannot afford to go over a specific amount of budget in that case what would we do instead of a 0.5 threshold we would increase the threshold to maybe 0.7 or 0.8 what would happen in that case let me change the color of this um thing here it may be blue so if i increase it from point five to point eight what am i doing i'm basically shifting this upward so it's going to go either here or here if it goes into the first quadrant very good well and good um accuracy is achieved if it goes into the second quadrant it's still okay because i'm still retaining my budget i'm still within my budget i get a potential upside there's a potential gain a short term gain that's okay long term i'm probably going to lose but that's the least of my concern because i'm still my primary concern is to be under budget right so this is how you would look at varying your thresholds to account for the accuracy matrix again ultimately what you really need to be thinking about is whether you need to be predicting more of zeros or more of ones and that is what will try to determine your threshold levels okay so try to formulate this mathematically in your mind and that will help you translate this into a business and set a lot better so again like i said should it be more of zeros or should be more of ones appropriately you can change your thresholds okay cool so we have talked about e-cent like we said based on this data set ecent has a single biggest problem here or single biggest issue here think about why that is why would ecent have such a huge impact okay the problem is this is a cl so this is so this basically tells you if i keep sending more and more emails then most of my customers would be retained which is true but is it is it really that uh too that good to be true right so you got to be asking yourselves is really is there any number of emails being sent does it really cause high retention rates or is it the other way around just because of the fact that customers are being retained they just happen to be getting more emails think about that this is very powerful all right so what we've experienced here again let me repeat is number of emails sent causing high retention or just because of the fact that customers are retained they happen to be getting more emails this is called reverse causality right it exists because of bias in the data structure you've heard about a lot of biases before i'm sure we've heard about bias and sampling different biases but bias can come in the way that the data is structured look at the data how it's structured every row is for a particular customer and the way that e-sent is calculated is just a total number it does not take into account one key variable which is tenure that's what differentiates e-cent among the other fields as well because everything else is accounted by tenure right e-open rate e-click rate uh average order size and so on e-synth alone is a total count so the first customer is 174 i mean they could have got that over a period of long long time versus zero or nine for a customer who's relatively new right so that's kind of what was causing this anomaly or this reverse causality so question is how do you account for reverse causality how do you take that into account pretty simple all you do is just account some sort of a tenure factor into this so i would probably create a calculated field that says e sent over a tenure over a period of years or maybe throughout the 10 year of lifetime of the customer and so on so that would solve for that um decent reverse causality issue okay let's talk about uh market basket analysis okay so um again if you've not heard about it um first time you're hearing about it we heard a little about it it's a pretty popular uh technique used in uh marketing but not really in marketing but anything so the idea idea behind market basket is if a consumer has made a specific uh purchase right specific set of purchases what is the probability that they will end up buying something else based on what they bought in a previous transaction so it's a it's based on a conditional probability right uh a lot of uses right so uh uh typically it started off in the in the marketing uh world because they're trying to look at the basket items so think of an analogy of going to a supermarket right so you are going to a basket right so this is a thing think of this as a shopping basket right and let's say i am in the mood for i just go and look at the aisles and i buy um i don't know uh cereal right this is cereal so i go buy cereal so this is my cereal box and then i go around shop again then you know what i suddenly i know what to order to eat cereal i also need to have milk then this is milk right so i just drop it in the basket which is milk then i go around shop around i'll walk through the aisles and i'll realize what my needs some bottled water so let's add water so this is water and then i shop around i uh not so i like some a little bit of honey to go with my so well with my cereal so we bought honey so we bought milk and let's say i also want some honey let me hear some honey right so this is my basket right now before and this is one such basket right so i'm done i swipe my uh credit card or a bite at my point of sale purchase i buy my transactions done and i'm gone next consumer comes let's just hold this basket just for just so that you guys can still see that just hold this for now right next consumer comes or you yourself come back again and next time and then you do the same routine again you start with bread okay or cereal so i start with cereal right so start with cereal then i also buy bread okay which is this one i buy come and my bread second basket and then i don't know what to buy i know i forgot something but you know what i don't know what i forgot but i know i forgot something what else do i need what else do i need okay fortunately there's enough data collected in the past that says hey consumers if you're buying bread and cereal historically people have also bought water so then i entice the consumer hey would you also like a bottle of water and guess what we have bottled water on sale previous consumers have also bought water if you bought these two categories let's buy some water here so now your basket now looks like this so what have i done i've kind of influenced you a little bit right so i've said okay based on historical pattern i've kind of uh jogged up your memory and maybe i've even created some need that you probably didn't have i've kind of en enticed you into buying something that you would not have probably bought or maybe you would have forgotten maybe once you go back to home and maybe you buy you think oh gosh now i know a new water then you'll not come back all the way to my store you probably buy water at a nearby store so i might lose some revenue that's the concept behind market basket right the idea is based on what someone has bought historically what is the chance that given these set of combinations what's the same chance that something else the same pattern is going to happen again that's basically the concept behind these recommendation engines okay so that's um the idea of market basket like i said the it started off in the consumer choice it started with market baskets but it's evolved more into uh other things as well movie recommendations if you've been watching netflix or amazon prime right so you'll probably have that message that says hey if you watch this chances are you'll also enjoy this because other consumers who have similar tastes like you have also watched this sort of a movie based on the genre and so on online shopping real-time market basket is done right so if you keep putting things in flip cut basket or your amazon basket you probably get a series of recommendations down below right same concept right they say okay hey historically people have bought all of these as you start filling in your basket with all of these individual one and one hey would you also like another basket of milk or another uh bottled water this is kind of what people have bought previously would you also like another uh i know there are some extra specials going on we like to bundle so that's also another thing that comes up with market basket you can uh retailers can offer bundles right i can say uh these two go together right so when i'm when i'm searching for one or when i'm searching for the other both milk and cereal come together as a bundle right so i can produce i can promote just this bundle combination all together and hence even minimize that risk of losing or forgetting one or the other right so these are some various applications that people tend to use when you're trying to look at the market basket analysis right so let's see how we can actually conduct or use an actual data set to conduct a market basket analysis but before we actually get into the actual algorithm just want to give you guys three main metrics or three important calculations you would call it metrics you'd call it but three uh terms that you need to understand in terms of how the market bracket algorithm works okay so let me illustrate with a small sample data set so this is a let me expand this so you can guys can see it better there we go so these are simulated examples of how a basket is being filled right so we have basket one someone buys cereal and honey basket 2 someone buy cereal water and honey basket 3 someone buys water and diaper basket 4 cereal and so on so there are 9 different baskets that you see as a transactions right so let's say i'm interested i'm going to create a rule so first the algorithm creates rules right so i'm going to i'm interested in the second basket i'm just going to bold this second basket is what i'm most interested in right so or i'm going to create a rule based on the second basket the way i create a rule is cereal and water together form a team and honey is my recommendation i'm just going to give it a different color code right and the way a rule will get spit out is set a let's call set a as serial plus water right so cereal so orange gems and water so this is set a and i'm going to say if people have bought cereal and water set b is honey right and honey is set b so it's lurking somewhere very close by so cereal water combination of honey okay so i'm going to say the way that i set up this rule is if people buy set a they also end up buying set b or let's since this is not a set let's just call it b okay since people buy set a they also buy b and this is how my rule is represented for the first for this specific transaction this is set a and this is b okay so this is how my rule is represented people buy set a chances are they're also going to buy honey now the key word here is chance because that will indicate the probability so then let's try to understand what is a chance what are some of the numbers that go beyond it what is is there a threshold and how do these rules actually come into play so there like i said there are three important terms that you need to understand in terms of how to understand these probabilities or terminologies or sets right so the first one is called as support right so support is nothing but set a divided by the total baskets okay in other words set a is one total basket is nine so i'm just going to do a one divided by nine that's my set a right because i have one item as set a as cereal in water right so there's no other cereal in water cereal water serial water does not come together anywhere except just this once so cereal in water is one and a total of nine baskets right so one two three four through nine there are totally nine baskets that's what set a divided by total means and what does support really mean so support tells you it's a fraction of the transactions in a data set that contains a product or the set of products right it basically implies the popularity of the products that's kind of what the support metric tells you okay higher the support more popular the particular set is that means the more it appears in multiple transactions that's what the support means next metric is confidence confidence is nothing but set a plus b divided by set a okay so in this case set a plus b together appears once right so seta is cereal in water b is honey so cereal water and honey wherever it comes together just once so again this is one divided by set a how many times does set a occur just once again right because set a is just serial and water so it occurs just once so i'm just going to do a 1 over 1 which is nothing but 1. so what does 1 mean so one confidence is basically the conditional probability that someone buys b provided they've already bought set a okay that's confidence it measures how often items appear in b in transactions that also contain a clear that's kind of what confidence means a third metric now support and confidence are used to create thresholds in our rules so which will you'll understand a bit as i import a data set in a name you'll understand how to set thresholds and what those really mean right so support and conference give you thresholds third and the most important metric is lift okay so lift is basically nothing but set a plus oops set a plus b divided by set a and the whole thing divided by b by total okay let me expand this or in other words if you think about what set a the first part of the formula really means it's nothing but the confidence in other words lift can also be calculated based on confidence divided by and for think about b divided by total look at the first metric which is support of something is that particular item divided by total which is or in other words it's confidence divided by the support of b so that's kind of how you can rewrite this formula as well okay so set a plus b the confidence is 1 that we've already discovered in the previous one so divided by b divided by total right so b is honey how many times that honey appear one two three four five six and seven right so it appears seven times so it's basically seven divided by total is going to be nine right so one over seven by nine which is nothing but 7 by 9 right so that gives you sorry it's it's nothing but 9 by 7 right because it's 1 divided by 7 by n is 9 by 7. so 1.28 is your lift so lift tells you if someone buys product a or set a what is the percentage of chance that someone buying product b will increase in other words what is the condition probability that offers a lift of buying the second item provided someone has already bought the first item that's kind of what lift tells you and this is often the most important metric when it comes to evaluating your multiple rules from the rules engine so it'll be clear when you actually look at a data set and actually try to see how we can establish thresholds but just note that support and conference let you relax the rules or give you lesser rules or more number of rules depending on how we set up the thresholds lift gives you an option to evaluate the effectiveness of the rules create a new workflow right click new name workflow so let's call this market basket or short firm mba right marker basket analysis let's create that and let's import the data set so this item is called transaction data so i'm just going to double click on transaction data and let's just get the first sheet with data table contains columns that's good so we look at the transactions each well actually let me open up the excel sheet it's easier to explain so each row is a basket in other words it's a specific set of transactions that the consumer has purchased now um the billing sets that you see here each number or set of numbers are items um that you buy from a grocery store so these are like um skus or bar codes or whatever those numbers mean each item means something 56 could mean cereal 56 could be uh 8 95 could be water so basically something like this right this is a basket so what you're basically seeing here in one is basically each record is one such transaction some might have more some might have less but basically this is your transactions number of transactions that they have right so that's kind of what and like this i've this is just one basket so think about this data set as a collection of multiple baskets like this that's kind of what this means okay so let's import this data set and let's right click configure i think it's configured already we'll just execute this and let's right click and look at the output table okay there we go so does look good so before we pass it through our algorithm to create the market basket rules we need to convert this to a specific format so if you notice right now this is all in just one single cell which is good but we need to convert these cells into a single set so it needs a set format which is kind of a square bracket kind of a format so there's a node that will actually help us do that it's called cell splitter okay so if i click on cell splitter let's just connect this and right click and configure so it just needs two inputs here so one is a delimiter we need to specify how the cells are how the values within the cell are separated out so if you notice they're separated out by space so just enter anywhere inside this box here and just click one time with your space bar do not click more than once because it will then not recognize it okay so make sure you just do one click and that should do it and then we have three options for the output we are going to select the center one which says as set because we need that as a set format before we can connect it to a market basket node click ok let's right click execute right click output table and here is your output of the sets that's it's created so you can see square brackets so this should be the format make sure when you're following along your output has two square brackets and it also has a comma now for some reason you don't have a comma then you probably messed up in your delimiter so make sure you press the space button correctly just once for your delimiter okay and these square brackets cool so the next thing is we are ready to run this through our um analysis module which is e so for the market basket there's a very specific node that's called association rule miner so if it just type in a is so start typing in the word association rule so this is the association rule learner okay let's just drag and drop this let's connect the cell splitter to the association rule learner let's right click configure so here's where i was uh talking to you earlier about the uh three metrics support so this is what will let us set our thresholds so i'm just going to set the support to point zero zero 5 and we can also set the association rules for the minimum confidence so i'm just going to say for the minimum confidence let's get everything with 0.4 okay so like i said these two will help me get the um thresholds right so this is what let me get the thresholds that's what i'm using here and the reason you might be thinking why i chose point zero zero five and point four there's nothing magic about magically uh great about them this is an iterative process you start off with a very restrictive threshold and see if you get enough rules um so this is i didn't get enough rules i wanted a lot of rules to play with so that's why you started relaxing these thresholds so they get more rules for you to later piggyback on it's a balance between getting more volume of data at the cost of accuracy right this is going to change depending on the business scenario the individual data set that you're working off of so you will get it with practice but more importantly you get it with iterations so uh so i would say generally when you're trying to find the minimum support and the minimum confidence start off with 0.9 or high support and see if you don't get enough rules you're not getting enough predictions then start to slowly reduce the thresholds right so that's kind of how you would um [Music] iteratively manage these two numbers so let's click ok and let's run this execute this and let's look at the output so look at the right click and click on frequent item sets association rules this is how it looks like okay so you have the support so notice how many records it comes backs with it it comes back with it has 16 000 records so each row is one rule okay so you have what this rule tells you is this 187 comma 80 implies 177 so this is set a this is b going back to our analogy this is cereal and honey this is water okay that's what this means so uh you can see it's created multiple sets of rules as you skip some of them have three some of them have four some of them have five let's create multiple rules depending on the threshold limits that we set okay so it gives you the lift the support and the confidence so these three are available for each of those rules that are available right and different recommendations these are the recommendations these are your individual basket items okay great now let's try to see how we can make sense out of this the next thing i want to do is i want to know what this 177 really means and what this is 192 really mean because ultimately that's kind of what you want right so because you just can't say okay customer you order you're buying you're buying item one okay what does item one mean is it milk is it serial how do you know so let's uh connect this with another data set and try to see if we can get some descriptive uh more description about these products and also we also have an option to get a price because we ultimately this is how my final product is going to work i'm going to say if a consumer comes and buys milk and honey i'm going to say hey consumer you total basket price for milk and honey is 10 rupees would you also like to buy water for two more rupees that's kind of how i want to formulate my end output so to do that we need two more things which is the price and the description so let's get that so if under the data sets you should see a data set that's called must data okay there's master data so mass data has a price and product name for each of these product numbers right so just to ease our lives i've formatted the data set so that you have either both a number format or a product number the text format because depending on what um you don't need to do additional conversions right so if it needs a product number in the text format it's available here if it needs it as a number format we'll use this one so we can use either one or um the other so it has the price the product name and so on so what we are going to do is we are going to import this file and this is going to be master data let's open this up so i select sheet 1 i also say table contains column names in row number 1. let's refresh this let's click ok let's execute this so i'm going to add a joiner because now i have this data set i need to join both of them right so you look at the output it contains price it contains the description of the products so i'm going to add what's called as a joiner wherein i'm going to send the output of the rule learner into the first arrow the second error i'm going to join it with the master data let's configure the join we need to specify how to join it so once you open up this message box click on this that says add row so i'm going to add row on the left side it's going to be consequent because that was the item that was recommended to us right so consequent and then the on the right side i'm going to see so we have here's where this will help us product number text formula because notice the s right next to this matches with the consequent because that is also an s which is the text format so that's helps us save one step and let's see if there's anything else i need to do so i don't need these i'm just going to click on this get rid of this because what it's going to say is okay do you need all of these in your output columns so this these are the fields that will come from my first table these are the fields that will come from a second table right so i'm just i don't need the product number because product number is nothing but the consequent that we already calculated that we already have so i don't need that let's click ok after that right click execute let's look at the joined table so basically you have the items and the recommendations so what you're getting here is item 177 stands for peppermint item 92 stands for nuts prunes item 117 stands for chocolate marshmallows and so on right and the cost or the price of each of them right so that's what we get um so you might be asking a question you know what how can we do this for the items as well we're going to get there but let's clean this up a little bit so that we can kind of we are not having a lot of confusion right so we're going to say product name should this instead of consequent we're going to rename this as um a recommended item number and it's a product name we're going to rename this as recommended product name and for the price we want to say is recommended product price because remember we're going to bring in the basket items and bucket price as well so we need to differentiate between the recommendations and the price so there's a node called rename column so we type in the word column rename column rename so i'm just going to drag and drop this so let's drag column rename so if we execute sorry configure this you can just specify what to rename it to so i'm just going to double click consequent click on this option that says change so this will be recommended item number product price we'll double click on price click on change and this will be recommended price product name will be recommended product name and that's it right click execute right click click the rename table sure enough now it makes more sense recommended item number is here recommended price is here recommended product name is here so now let's see how we can take care of the items here now these items are bundled together so we need to in order for us to join them we need to create them into this format and then rejoin them back again so i'm going to make use of a function that's called ungroup so i'm going to ungroup each of these and what's going to happen is this is going to get duplicated so because ultimately i need all of these items 187 and 80 and 80 83 280 all of them in the same column right that's kind of how i can then map it to the master data and get the price and the product description so there's a function called ungroup okay so i'm just going to drive the column name to this part let's just get the cosmetics right there we go so we're going to call this ungroup so let's configure how we want to ungroup this by so let's ungroup it based on items that's really i uh grouped by right now it's detected it automatically and make sure you do not check this on so check this off because what will happen is it will remove the items because we want this because ultimately we want to group this back again based on the recommended items or the basket items right so make sure you do not or check this off if you have it selected so click ok right click execute and let's look at the data table so it's ungrouped it right so let me just explain what's happened here so notice we in the previous example a previous node we had close to 16 000 rows of data now it's expanded to 47 000 rows how look at these first first top two rows one it's everything is the same except one column which is the items that is broken up here right so 187 is broken up 80 is broken up but they're still for the same rule it's still this same set it's still recommendation is still peppermint recommendation recommended price is still 0.858 uh the lift is still 1.963 lift confidence is still 0.47 so everything else remains the same except that it's duplicating not duplicating but it's breaking up the group into separate rows okay so you can see everything else happens the same way so that's why we end up getting more than 16 000 rules so remember this 47 000 rows so now it's easy now all we need to do is just use this column and map it back to the initial overview the initial master data set and kind of get a get the um the price and the description so let's we already have the excel reader so we're just going to get the joiner and we're going to join this top portion to the data set the bottom one we're going to join it to the must data okay let's configure this let's make sure we set up set it up correctly so in this case it's going to be the items because that is your basket recommendations and on the row id it's going to be product number text format here and let's go to the column selection again just like the previous one we do not need product number we do not need product number text format and so on and let's click ok after that let's right click execute right click look at the join table so there you go so you do have the individual items so 8187 stands for quail egg 80 stands for instant food something instant food products and sure enough it has a different price and food has 8.4 and quail eggs is four okay so now done so we've done what we've achieved to do uh what is set out to do so items is broken out now all we need to do is just re-combine all of these so that we come back and get the original 16 000 odd rules right so we're going to do a group by just like how we did ungroup we're going to do the opposite of group ungroup which is basically group by okay so let's just drag and drop these together group by let's right click configure so this is important we need to tell what to group it by right so what is a unique key that identifies each group and let me actually switch back to this data set right so if you look at this joined table the unique data set that identifies a group is obviously the items and the recommended items all right in fact let me go one step beyond one step further and look at the data set before it was ungrouped so this is one rule right each row here stands for one rule and the rule is made up of two things rule is made up of the items in the basket as well as the recommended item number so that's what each rule stands for so that's why when we regroup it back together we can't just use the basket items alone because chances are the same basket i can recommend multiple items right because going back to our gems example right by now you can see i'm a favorite of this candy right so one rule could be if i buy cereal and honey i end up buying water this is one rule represented by one row but that's not that could not just be the case another consumer could have bought instead of water and they could have bought maybe i don't know um diaper right so what do i choose these two are possible options these two have happened in the past so just because i choose cereal and honey always i need not necessarily always buy water i could also have a scenario where i buy something completely different that's what each rule tells you and which rule would you choose better hold on to that thought that's where we look at the lift metric okay the point i'm trying to make here is in order to get back to these original rules this is one rule this is also one rule so these two are separate rules so we need to combine all of these together to get back to the original 16 000. if i just group it based on the basket alone i don't get the rules back i just get the basket recommendations so that's kind of why when you look at this group by option we need to look at items as well as the recommended item number or recommended items okay so this is how we're going to group it by and let's set how it's going to be aggregated so i'm just going to select the aggregation let's select all so the support conference all of them are mean the lift is mean so if you think why because every value that you see so ultimately i'm going to regroup all of these back together right so let's say this simple example so the support 0.005 for both of them confidence because ultimately it's the same rule right because it's it's the same exact thing so all of them i'm going to take it as a mean so if i look back to the aggregation support conference lift all means implies is nothing it's just a dummy variable probably gotten rid of this it's just that arrow right we don't need that items look at items it is the actual number so how do we group these together how do we aggregate them we need to bring it back to this format so 80 comma 83 is how both of these should be grouped together so how do we do that this will be concatenate okay recommended price can be the mean that's fine recommended product name can also be first because recommended product name is going to be the same for this exact rule so that can be just one the price so this case is the price of the basket right so in this case eight point four six six and uh five point two four three what do i do with that a sum all right so we're going to add both though so that way that will give me the total basket price so if i add instant food products and uh the female-centric products together it's going to give me 8.4 plus 5.2 right sum of these product name again if i look at the product name i need to get back to this format 80 comma 83 except in sub 80 comma 83 i need to get the actual descriptions right so i'm just going to do a concatenate again so take a minute to pause and check your data set if you're not clear with why we're doing certain aggregations because this is important for you to understand because if you don't do these aggregations correctly you're going to get incorrect results okay so on the column naming i'm just going to say keep original names because i don't want the concatenated names i just keep the original names now let's execute this and if you look at the grouped table one quick spot check is make sure you get back the original 16 123 number of rows so that means your grouping is correct remember when we created the first table we had 16 000 something rules that's what this tells you so everything else looks good so we got the product name back we got the item numbers and so on and so forth now we can actually do one more thing where we can actually rename the columns so that this is not the price this is the basket price similarly this is not just a product name this is the basket items and so on so let's just do one more rename so that it's easy for us to look at the end product i'm going to do column rename let's configure this so i'm just going to double click on the items this is nothing but the basket item number the price is nothing but the basket price product name will be the basket product name click ok after that let's right click execute let's right click and look at the renamed table sure enough this looks better now so we get the total basket price and for recommended product and so on great so how would i read this rule so if a consumer buys chocolate marshmallow and best white tea the total basket price is 19 rupees or whatever it is we recommend that they buy frozen fruits at a price of 0.425 okay that's kind of and how much confidence we have uh that that is there is how confident we are in this rule getting recognized that is the lift that we're trying to measure okay so let's try to see how this works in action so let's bring in a new um simulation so we have something here called as new items right so this is just one basket so assume this is a scenario where a consumer is buying a new product uh or sorry a consumer is coming and performing a new transaction and they are going through these items one at a time so this is basically your basket items we're going to import this data set let's call this let's import this data set as an excel reader let's configure [Music] so let's execute this this output table is done so think about what we need to do we need to reformat this in such a way that we get back to the original rule so we so first thing first is remember how our original input data set was we need to format it in a similar way that that is the only way we can get it back to the rule matching we can we should ultimately we need to do a lookup to the rules so let's first group our data now if you think we should create sets think again because before creating the sets our initial data set was already grouped like this so we need to look at this output tables all of these form just one basket so we need to create one line per basket since in our case this is just one basket we will end up creating just one row so i'm going to say group by the basket number will be our group the manual aggregation uh it's going to be concatenate right because all the baskets should be concatenated in one row and we'll just keep the original name for the column names just click ok after that let's right click execute let's look at the grouped table there we go so basket a contains these sets of items so on very good now let's do the cell splitter so calculate the cell splitter let's convert the cell splitter into now remember the delimiter here should be comma because if you notice our concatenation produced a comma so it should be a comma and we're going to do it as a set click ok let's right click execute and right click look at the output table sure enough it's created it as a set so let's close this next step is we need to match this to the rule learner there's a node called subset matcher which basically does the job of a lookup so i'm just going to drag and drop the subset matcher and we need two inputs to it one is what is the actual data set we're trying to match up to which is the new data set and the original data set with all the rules and all that okay which is the um final one here so let's configure this and if you look at the configuration subset column there's nothing to choose except one the id column on the second the master the new items is basically going to be the basket number because each basket is one id right it's one unique record so basket number and then there's nothing to choose in the collection column and that is it let's execute this so now if you look at your matching sets so it's basically created it's told you for each [Music] transaction for each rule what is the recommended recommended sets all right so this is your uh basket and these are the rules that it recommends so we need to then do one more final step before we are able to make sense out of this so this is your recommendation of the rules so we need to see what these rules are recommending so 123 and 134 is one rule which is set a 123 and 80 is another rule which is set a again so these are all set a's so we need to finally do one more joiner where we join this with the original data set with the subset match over here and if you think about what we need to do for the row ids so this will be the items right because that is the rule that is recommended on the new ones and we're going to join it with the items in the previous uh from the master data as well because that's going to be our initial set of rules so let's click ok and see what it looks like it's executed joint table comes back something like this so this is basically our recommendation so for basket a it has created six recommendations okay and what are these recommendations if someone buys i'm just going to expand this so how do we read this let's say for the second one someone buys specialty fat and instant food products i would say hey customer you bought specialty fat in sinful products your total basket price is 69 rupees right so you bought these two products 600 rupees would you also buy like to buy a chocolate marshmallow for 18 rupees okay that's what this rule tells you right so this is the recommendation based on these two product items now you might notice there's some recommendations that have or some rules that have more than one recommendation i think it's meat spreads peanuts meat sets let's take these two items here so this tells me meat spreads and peanuts all right let me go back to this analogy meet princess peanuts uh i would say hey customer your total basket is 12 12 rupees 24 cents or 12 rupees 24 pisces would you i have option of buying either a chocolate marshmallow for 18 rupees or strawberry jelly for 0.75 rupees what would i recommend this is where our lift metric comes into play right so whichever is has a higher lift i'm going to recommend that first right so as the way that this would work in real life is as people start putting in meat spreads and peanuts i would say hey consumer would you like to also buy strawberry jelly at 0.75 because that gives me a higher chance of buying of them buying the other way i can also implement this is i would say hey based on like kind of similar to flipkart or amazon's recommendation hey if since you bought meats meat spreads and peanuts peanuts consumers like you have also purchased strawberry jelly they've also purchased chocolate marshmallow and so on so whichever has a higher lift i would put them in the first in order in order of increasing decreasing priority increasing decrease and decreasing lift first page would have highest lift then if you remember if you scroll across the next ones i might have recommended 19 items those are 19 recommendations i would just sort them based on high lift to low lift okay so that's kind of how you would take a sample data set implement it and kind of come up with recommendations and hopefully you guys have a better idea of how to implement and more importantly try to translate a data set and not just do the algorithms but actually try to implement it and talk about and think about in a from a business perspective so let's talk about one of the most important metrics that is called clv cl stands for customer lifetime value okay so i'll go through the mechanics of calculating the clv the math is pretty straightforward it's simple algebra simple arithmetic but the implications are very powerful okay so what does clv really mean so like i talked about earlier in the terminologies cle is nothing but a measure of how valuable you are as a consumer to me as a business and not just based on one transaction that you buy but basically throughout the lifetime of your association with me right so um [Applause] if i just take one transaction and that is my clv and i evaluate you based on just one transaction it's not accurate right so because you have a lot of potential right so keep coming to me maybe another month maybe you keep buying from me once more you keep coming back again so what is the net worth of you as a consumer is what saleway is so i want to try and predict what is it that you're going to buy and how much you're going to buy and what is the cost of that so based on that if you keep buying cadburys gems like this right uh if i was cadbury i was i would be very happy right so you probably have a horrible sweet tooth like me you keep on buying so i probably end up giving you some promotions right so i'd rather i'll end up trying to reward you than someone who just buys once and goes away right so i keep buying you i keep coming back again again that's the area of clv so how much is this worth so what that's kind of what cle calculates what is a consumer worth is basically a sum of all of these sum of all of these is basically the profit that you give me as a consumer and throughout the lifetime of your relationship with me right so the formula is pretty straightforward right so it has basically two components so one is the long-term multiplier the other is the short-term margin so the capital m basically gives is nothing but the net margin okay and capital r is nothing but your retention cost so meaning what all do you spend to retain a customer that's basically retention cost d is discount a discount factor a discount factor is nothing but um it is a think of it as a wacc in financial terms it's called weighted average cost of capital so it's nothing but so the reason why we're using a discount is because um think about this if i i give you two options i give you 100 rupees today to spend or i say i give you 100 rupees one year from now which one would you choose probably now right why because value of 100 rupees now is far greater than 100 rupees a year from now so that's kind of what the discount does it kind of accounts for future transactions in terms of current value of money concept very similar to another financial term called npv which is a net present value right so that's kind of the idea the net present value basically um incorporates the accounts for inflation accounts for uh interest rates and so on and gives you a factor that discounts all the future term of money future value of money in current money terms that's kind of what npv does and clv is a similar um metric right because since we are going to look at potential future earnings future profits the discount factor is important for us to tone it down and make sure hey we are not really 100 100 rupees next year is not 100 rupees now we are discounting it by a factor that's kind of what the discount factor tells you most cases it's primarily cons it consists of the interest component so majority of your discount factor is going to be the interest company but it also incorporates like the cost of capital uh which is like the inflation that also accounts for inflation and so on so it slightly um adds a little bit more than your discount your interest rates right small r is retention rate it's expressed in the form of a percentage so what is the rate at which your consumers are being retained that's what small r stands for okay so these are the four things that you need to know to calculate clb right and if you think about how it is used in real life in a simple real-life scenario right so let's say you are a customer of netflix and i predicted that you're going to be with me for 20 months and i've also calculated that the margin is 50 but i also incur some marketing costs that work out to be 10 so 50 minus 10 is 40. right so that gives me my net margin so to calculate what how much uh profit you give me throughout the lifetime of your association i'm just going to multiply 40 by 20 which gives me 80 800. so that's kind of simply how it works the basic concept of long-term multiplier with the short-term margin let's try to apply this formula in a real life scenario okay with some actual data sets so here's a data set i'm asking you to figure out and say okay if the uh one such scenario is this right netflix charge is 19.95 per month these are variable costs and you have marketing spending and then you have an attrition rate so discount rate is given how do you calculate the clv so let's try to do this in a simple excel tab so i'm just going to open up a blank spreadsheet here and just start typing in the values here so the first one is so let's get the facts first so we have well scenario one revenue is 19 per month right so we get the revenue is 19.95 per month and we have variable costs which is 1.5 per account per month so we have um variable cost which is 1.5 so since we have the variable cost i can calculate the gross margin right so gross margin is nothing but this minus this right then we have the marketing spend so marketing spend is also could be retentions because if you spend money for marketing that means you're spending money to retain specific customer on your platform so i'm going to assume market expend is going to be the retention spending okay so marketing spend is what did we say uh six dollars per year so let's get the year metric here so six dollars per year which translates to six divided by twelve which is point five dollars per year then we also have a discount rate of one percent it's a monthly discount so we'll get monthly discount of one percent and we have the attrition at 0.5 so attrition is 0.5 okay so if you look at you think about our formula we said m [Music] minus r times 1 plus d divided by 1 plus b minus r right so that's kind of what we there's one trap here we need we have the margin so that's available so i'm just going to see if i can yeah code this so this is the margin we have the capital r which is the retention cost we also have the discount rate but do we have the retention rate so attrition is given so think about this is a trap that you might want to think about about not falling into okay attrition is given but retention rate which is small r is nothing but one minus that okay so this one trap to keep in mind again the key here is you also had one more trap that you might do not want to follow fall into is make sure all measures are comparable you want to compare apple's chapel so either convert a red through a yearly format or a monthly format make sure your time periods are accurate or are the same so once this is done it's just a simple plugging in of the formula right so i'm just going to say m minus r um so let's break this down okay then since that way you kind of can correct it if you have mistakes so m minus r is nothing but m minus r and 1 plus d is nothing but 1 plus discount rate 1 plus d minus r is nothing but 1 plus d is already here i'm just going to subtract this with the retention rate okay so finally it all comes together my clv is nothing but m minus r times this one plus d divided by one plus d minus r okay so now it becomes easy let's see what the actual number is so 1208 dollars okay so similarly if you do the math for the second scenario i'll just copy all of these so that we can kind of replicate we just need to change couple of things here so for scenario two what changes is couple of things so retention spending gets cut down from six to three and then attrition goes up to one person since of 0.5 percent so let me just color code the ones that have changed well no yeah sure why not so this one has changed and attrition has also changed as a result retention will also change since you're 0.5 percent it's going to be one percent and you get a new clv so obviously this will not make sense right so you're getting lesser from a consumer if you implement the change more importantly what does this mean what is this hundred and 1208 dollars mean in a practical real one sense so think about going going back take a step back and what does this number really mean it is the profit that you can expect from a consumer one customer throughout the lifetime of their association with netflix let's see clv so how do we use that let's say if you're trying to increase your advertisement expenses say your costs have increased you want to make sure whatever additional money you spend or whatever money you spend for attracting one customer this serves as a benchmark right so you do not spend anything more than what you can expect from a customer kind of do a cost benefit analysis that's kind of what this is used for one simple use okay but very powerful um it gives you a sense of how aggressively you can go and retain a customer similarly it also gives you an on the other end of the spectrum it also tells you how much you can spend up to before it's okay to lose a customer so both goes both ways right so you can either talk about acquisition of a new customer or also talk about um losing an existing customer okay both acquisition as well as um retention okay so uh tells you how much how aggressive you can spend to acquire a customer or how conservatively can spend to read in a customer both goes both ways that's kind of what this means the other use this also gives uh is used in is in a lot of startups right so or even some companies which are not publicly listed right so typically most vcs when they are trying to evaluate how to buy a company and how much is it that they are willing to pay for uh they do some sort of a cla calculation okay even though they do not explicitly ask for it they get a sense of some of the most important questions that someone ask when you're doing due diligence is information about your customers what is the rate churn rate of your customers how much do you think customers are going to bring in what do you think is the lifetime of the customer so things like that so based on that you do a back of the hand cle calculation and get a net worth of your company so uh vcs use a lot venture capitalists use this metric to evaluate the net worth of a company so that's why when you see these uh relatively smaller companies getting acquired or getting bought by uh bigger companies and they're paying like five times more or ten times more than what the stock market thinks that's kind of what's happening they evaluate it based on what is a potential based on how much consum lifetime value customers are bringing it okay so this formula was used when you had a contract and when you're paying for the contract up front like netflix or amazon prime so you know uh you're spending it for the up for spending contact up front but what if a customer pays for the service after the fact like credit cards for example we don't our bank fees right we don't pay for them right up front we pay for them after the fact so that's a slightly different formula which basically looks like this the way we get to this formula so let me add one more blank point to this so that i can kind of explain how this works the way we get to this formula is pretty straightforward all we need to do is the only difference is money gets paid after the fact so if you subtract this component which is a revenue which is basically minus m minus r you will get this now you might not trust me which is perfectly possible i do not take any personal reference if you do not trust me so let me actually work it out it'll be a nice recap of high school arithmetic or high school algebra okay so let's do this and see if we can actually get from here to here right so we said m minus r times 1 plus d divided by 1 plus d minus r and we said we're going to subtract this with m minus r let's see if we can get back the previous equation that we got so common denominator is going to say 1 plus d minus r will be my common denominator so this equation turns out to be minus r times 1 which is basically m minus r plus d times 1 minus r okay so that's basically multiplying this with this multiplying this with this minus 1 plus d minus r times m minus r so far so good then ultimately i'm just going to expand this further i'm just going to retain the previous one which is m minus r plus d times m minus r obviously the whole thing is divided by 1 plus d minus r right minus so i'm going to multiply m minus r with 1 right so it's going to be minus m minus r and minus i'm going to multiply d with a minus r again so it's still going to be minus d times m minus r and then finally i'm going to multiply this minus r with m minus r so since minus and minus get cancelled out will be plus r times m minus r okay so we'll just cancel out the negatives d m minus r gets cancelled out with this minus m minus r gets cancelled out with m minus r okay so we are ultimately left back with this formula which is basically m minus r times r by one plus d minus r okay so that's basically how we get from here to here okay so see even if you don't believe me i proved you that i don't like okay i actually got this back for you guys okay so that's the difference in formula so you um you would use either of these formulas depending on which scenario you have okay so that's it in terms of uh clv so like i said think about applications anywhere we have the hardest part in the in this formula to calculate and the most impactful that uh one single value that that impacts the clv the most is the small r which is the retention rate try doing that i mean as an exercise try to uh take the same example that we did and try to do a lot of analysis try to increase every value by 10 percent so how easy is it to achieve 10 revenues probably not easy how easy it is to achieve a 10 reduction on retention rate you don't have control over that sorry on discount rates you don't have control over discount rates how easy it is to achieve a 10 reduction on a margin or 10 increase in margin probably not that much how easy it is to increase the 10 increase on retention rate very much under your control right so you can so just play with it and see just based on a 10 percent ratio five percent increase which moves the needle the most you'll probably find out the answer as retention rate and maybe revenue right so it's ultimate toss-up between revenues and retention rate but there's just so much you can increase your revenues to there's so much you can squeeze operational costs out of beyond a point you mature in terms of your operations your operations are pretty stable but you can generate something that can still control that's why a lot of companies do not want to let go of you as a consumer right it doesn't matter as long as you're there in the system i know that i can get you later it does not matter what you spend now i want to stretch out the lifetime value of you as much as possible right so that's kind of why companies keep promoting offers they look at loyalty cards they make sure you are in their system as long as possible because guess what the minute you're out of the system it's harder for them to get back you right so that's the concept of cld i want to introduce you to great learning academy a free initiative by great learning where you can access over 200 plus courses with thousand plus hours of free content on trending high demand domains absolutely free register now to complete the course and get your free certificate of completion check out the link in the description of this video below if you have not subscribed to our channel yet i want to request you to hit the subscribe button and turn on notification bell so that you don't miss out on any new updates or video releases from great learning if you enjoy this video show us some love and like this video knowledge increases by sharing so make sure you are sharing this video with your friends and colleagues as well make sure to comment on the video any queries or suggestions and i will respond to your comments
Info
Channel: Great Learning
Views: 12,095
Rating: undefined out of 5
Keywords: Great Learning, Great Lakes, Data Analytics, Artificial Intelligence, Analytics, Market Analytics, Retail Analytics, Market and Retail analytics, RFM analysis, monetary analysis, CRM, technology, lifestyle, analytics, big data, data analytics, retail, markets, web analytics, Marketing and Retail Analytics Tutorial, web analytics tutorial, Consumer Insights, Promotional Analytics, customer analytics, cutomer data management, customer segmentation, Customer Analytics, crm solutions
Id: qJ4pnGExqc8
Channel Id: undefined
Length: 431min 54sec (25914 seconds)
Published: Fri Mar 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.