SLICED Competition Lap 2: Live Screencast

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson and welcome to another screencast where i'm going to be live streaming my experience in the sliced competition so the slice is a great machine learning competition there should be a link in this in the show notes we're going to be um we're going to compete against three other uh contestants to analyze a data set and then here we go i'm going to talk about seven i think we are um we're going to analyze this i just got this dataset center five minutes ago i've explored being exploring it looks like this would be a really fun day i'm predicting whether a bank customer churned as um based on a couple factors like their age gender education level income category number of relationships uh i don't yeah really it does actually doesn't tell me what the data is about so um uh transactions i don't know what relationships and things like that mean uh in activity i've done trend analysis before in my some in my job so i'm on uh very various jobs so i think i'm excited about um i'm pretty excited about uh applying it here credit limit i'm sorry to take a quick look at these they allowed us to take a look at the i don't i can't code until nine pm eastern so it's about six more minutes uh so i'm just gonna be um uh i'm gonna be just taking notes for now and the uh and yeah right now they've uh they've been allowing me to take notes right and say okay the id and response the three categorical variables gender educational income category to these education level income category are um are kind of ordinal like we can actually see that uh uh education levels like college probably just like college graduate high school college it was uneducated high school college graduate post graduate something like that is also unknown and income category two so those are some we might want to parse some data out of and um let's say if i turn into an order and maybe we'll get a better model out of that uh then we have things like total relationship um relationships months inactive that are numeric and you know they're not too they're not normal but they're not i don't need to log them or anything uh credit limit um i could log that if i'm looking at a linear scale kind of a lot of the max the minimum i'm just gonna like take a quick look at some of these before um before we start our coding it'll be five minutes some thoughts on the strategy is one of your linear log credit limits maybe others i don't know do i really want a linear model maybe i do we don't understand like to get some interpretation out but i'd probably just start with some like scatter plots and stuff oh so let me see um uh so one thing you're gonna notice the um is yeah create summarize attrition function early on that's what i really like to do is i like to say the um uh i really like to say like uh the i have to create a verb that says oh yeah group by this and summarize that'll let me like analyze some things in terms of bi-bucketed customer age by gender so let's actually load up some eda uh we're going to want to know like a relationship between age and attrition between uh so that's like a summarize and a line plot a lot of these will be line plots whenever they have scatter plots maybe when we have this outcome education education and income and attrition uh bar patrol bar plot income category total relationships attrition line plot you know i'm not adding a lot by like looking into these and um month months inactive uh so actually say like age relationships and um well actually let me see i'm kind of thinking like like i'm just thinking about these these can be done relationships and months inactive can be done as a line plot and then the um the bucketed line plot where i'm actually going to like use cut is going to be a relation between um between what is it uh months inactive credit card uh months inactive uh total relationships sorry sorry no i'm doing this wrong uh credit limit so credit limit where's my utilization ratio oh they're the rest of the columns wasn't looking all them credible credit limit total result revolving balance age total revolving balance total amount change total amount oh this is how much it changed from q4 to q1 so kind of like a trend total amount change i really want a lot of line plots here where i'm going to say like what's the relation between this i can also use like i can use um a density plot actually let's do a density plot for my large numeric ones total transaction or total transact i'll just do of all numeric comparing attrition and not and then it'll kind of tell us some of the places we can look i'll use auc of attrition versus all predictors i want to try like um i'm just looking around oh average utilization ratio maybe is log plus some small offset a lot of things i can do good but just like oh yeah all these numerics i'm kind of um i was excited to try like an extra boost on these attrition flags gonna need to be oh yeah attrition change to churn what is it churned yes no gonna be a lot easier to work with some you know the word attrition is too long i'll say churn yeah this would be some good stuff ah yeah the um so we're gonna see some some of these relationships then i want to dig into a little more some um some of these variables uh they mentioned gender yeah the gender is it oh you see this all right here we go and let's go uh pipeverse library scales library um what else do i use no text in this one i'm gonna do no i'm not gonna really use uber i need stacks i might use on that and then theme set being light and i always do do parallel registered parallel cores equals four then let's read in our data so we'll do read csv uh desktop notes download slice download sliced here we go and train i actually call this one dataset and uh hold out so one e8 i'm look at me i'm nervous today i'm so excited so excited to be here with you to be coding read all the um doing all this and then we'll say uh split is initial split of data set train is uh training on split and test is testing on split and i'm going to do set c 2021 with my tradition in fact i also like to do trane five fold uh so train the fold cv5 and i do it all in the same uh same plot okay now we actually can take our training data and work with it a little bit i'm actually going to start up i almost i almost got mutated attrition uh churned is uh is if else attrition flag i don't keep writing that nutrition flag is one then yes no and uh run through there and i'll just select okay and uh and there we go then we do summarize churn just because it's easier for me to write and say uh title table summarize n is n i always like this i like this m and churned some churned is yes pc churn i just said really similar for my last lap so i like to um i like to to apply here another train summarize churn uh there's an approach i like in here where i say um i'll do this again as i said last time say low is ensuring plus 0.5 and minus n churned plus 0.5 pi is q beta 0.975 n plus 0.5 n minus m plus 0.5 that's the that's the stuff and now i've got my low and high i want to do this because now i can group by uh gender and now i've got my stats and starts seeing things like okay the churn rates actually do differ where i actually can say pcp churned and gender is a couple of these numeric ones and i don't like how tall that is height equals point two um churned and uh yeah let's let's go from let's kind of go from here uh and what i might do is i might say categorical function x uh the data and then x and uh actually i'll do i'll accidentally category and there's a great trick here taiyuma uh high detail where i can say is it double hug i believe it is yes uh and then i say all the rest stays the same um and except i neglected to do uh so peace to insurance category oh i forgot to do gender and um really i can do category and then dot that dot why am i doing this way because now i can actually uh you'll see i'll be able to do other summaries and um oh i that's that doesn't look good until i have um scale x continuous labels equals percent see this one like i want to like get a quick machine of being able to say okay i've got gender now i've got education and this one is not going to look as good because it's going to be alphabetical and i might want to start by start by sorting it uh but you know the funny thing is like so i can do mutate uh um i remember so quickly i think it's this category is uh scp reorder by um by pc concern i think this is right let's find out um yes we got like college under you know what's funny is like it's not perfectly ordered uh and that's this time it's like um you know it's like doctorate and postgraduate uh highest churn rate college undereducated high s all lower but their college in fact the highest churn rate uh that could now let's sell let's also look at age uh so there's a few ways we can look at actually let's let's do income level uh let's do actually let's throw in gender and try fill equals gender uh but to do this i'm gonna need position equals position dodge here and in yeah the problem is the um uh oh i see my color red really got in the way here it's not actually a perfect graph because i'm going to want to hear the oh where's mine uh oh uh right fill and color equal samples and in here i'm going to need in the error bar h i'm going to need to be positioning for position i don't know if this is going to work uh and it's okay for me if it doesn't uh position it is but it's not wide enough um width equals point five i'm not sure what the exact width um width equals one and also i do need it to be uh i i change my mind on the color gonna need this to be let me see i'm gonna need a group there we go okay so the um yes so is it an interaction term don't think there's an interaction term all right but let's uh let's keep let's keep let's keep uh proven uh so the um income level income category is uh let's see you know let's reorder that um appropriately so i'm going to try let's actually do that we're going to do that i'm going to do that up here i don't know here's what i'll do i'll do scp reorder i think oh it's going to reorder it in in the innards of it in the inner clock categorical so if um whole table category if it's a factor then we say if not a factor then kibble i really just want to like get to have this general function i don't use it a lot it's just like nice so uh now i'll say mutate income category is scp reorder here's something fun income category buy um i'm going to reorder it by parse number of income category and then i think it's good oh uh of um [Music] income yeah did i spell one of them wrong uh object and concatenate found uh yeah i've got my income category uh here we go kibble is this oh here it is small bug see here's what we did is we put like um uh i tried reordering them did i did it work uh pull income category yeah it it does have them in an in an order but i actually also need fc uh to put the less than is fcp re-level income category less than 40 okay i need to put that in its own category and the um yeah that that's gonna be better here you go it's like ah maybe there's a trend it's pretty hard to sit you know it's kind of like a you know there's like a what if i didn't what if i didn't what if i chord flip it yeah you know i like this one more uh that one that's happened sometimes uh and yeah let's do a couple of things to say and what if instead of a bar plot now i'm going to keep it as a i'm going to keep it as a bar plot because it's categorical even though i kind of like the line plot income category and uh by education category there we have we don't actually have a better trip than uh count education level and do fcp level education category and give it just a vector where we say uh unknown oh no uneducated college nope high school create this up here i'm just sorting them intuitively in terms of high school most visionaries college high school college graduate host actually don't know if graduate or graduate doctorate i don't know if post graduate is more than doctorate or not uh because i guess post-grad like yeah yeah there's more i think a lot medical phd post phd i think that's right um but uh we'll see and the um uh yeah we say how does turn risk everybody yeah oh um so then i can send you know i want to do education levels i want to do this early on so i'm actually going to do it uh process data i do this um oops and i'm gonna do income category why am i doing up here so i can do a boat to both training and the test sets at the same time uh i need a train is not going to keep processing forward but i need this processing to happen to everybody i could do it in recipes but like if i do it up here uh yeah and i did something wrong let's see what i did wrong education category is actually education level yeah looking pretty good um these are on any is unknown that that's gonna work out uh i think so the um so here we go now if i do plot categorical yeah i'm going to do i guess unknown at the top is not so bad but yeah this is kind of like you see a it drops and then it goes but it's like in the middle has the lowest churn risk and you kind of actually see something a little bit similar with income category you know that actually makes me think i'm going to forward flip and let's put you know unknown is not actually at the bottom level but let's put it at the bottom so we say um and let's code flip this one too so i'm sorry this income one two how does churn risk differ by let's see so here's income category oh um where are my labels oh right the labels get flipped too that's fun here we go income category just going to y x no this is still right there we go uh inc and uh yeah that's cool these are both same thing i really wanted to get some of these graphs down like unknown actually high school college grad not a lot maybe a little higher among doctors hard to say but that's going to be for the model to decide uh so the um yeah that's actually those are those are categories yeah we have one more that i want to see which was age i did want to see group by age is top age into let's say what do we say age was uh yeah like i don't know less than 40 so like uh 20 to 40 to 50 to 60 to 70 is that miss any customer age plot categorical of customer age uh yes we see like no not a ton and let's actually look at all the new let's look at all the numerics and this is a nice trick what i can do is say train uh let's do select negative i select just um you know it's actually everything except for those pieces i'll say select uh churned no no actually uh churned and uh customer age and total relationship count count through actually i can do it be churned and i'm in good shape gather metric value everything but churned and uh facet wrap by metric um the slime feels actually odd about these like it's not crazy to be large but i didn't expect to be quite that large or maybe something about oh of course um i need scales equals free in the facet and this is like just a nice graph where i can say yeah like some of them don't need to be locked some of them do and uh let me see yeah here's what i'm gonna try i'm gonna try look at total revolving balance look at that difference sort of credit some of them like age don't really matter total transaction count total transaction mark some of them really do um see a log is um alpha equals 0.5 and title is this a place to start when we're building a model and you know there's actually an alternative here where i could actually do um all this i'm trying to make it cleaner if else your title match nope churned string to title trend and oh i gathered gather and i'm going to try one other approach we're going to say gathered uh rank is rank of value group by um metric that'll mean they'll be they would be uniform oh and uh drop the log scale side of that five percent i think it's called percent rank and this is rank one sec yeah we can actually just get a sense of what is um yeah what is like the difference between them and between zero it's a little wacky to do it on a density plot with some things we can kind of like just see uh very very different i'll pick this on our so let's actually start building um building models so i really like the uh i really like actually like this data set so the um so here's what we'll do uh let's do an extra boost because we have a lot of numeric stuff so um xg boost on ones that are clearly creative or low cardinality so what i'll do is say um recipe of uh so rec recipe of um churned explained by uh gender plus average utilization ratio i'm going to go back to the log scale i like it a little bit more to be able to see this a little more intuitive no that wasn't not such a picture oh total transaction and i'll do um you know yeah let's do it here and uh train equals data is trained step dummy on uh categorical all nominal predictors do you have any missing data do you have any missing data nope that's pretty cool actually um wow uh yeah so i'm gonna start just months in active yeah i can actually see some relationship of credit limit months inactive 12 months what is that is that right no what is it to say uh months in active 12 month all right and then i'll do uh that's actually not a lot clean and then oh i'd like to do a couple things we're looking for from metrics set uh mean log loss uh log loss and i need control control grid um save workflow true i was like save thread true and um extract equals extract model x but i like to grab these things and um uh yeah and then i'll do um here and now i can actually do i can actually create the workflow in one step if i do because this is i'm using the um the if you're following our home you'll need to uh boost tree uh classification trees equals see a hundred to a thousand it's much more condensed five two hundred to a thousand by fifty i've learned recently that learn rate is 0.02 maybe trains a little faster and the um and i don't may not need a thousand i'm gonna need 600 and m try there's one there's one two three four five six seven eight eight so i'm going to try like i don't know three five seven do a little more than this uh xg tune is actually workflow tune grid train five-fold matrix equals m set uh control equals control i always do this thing have you noticed that i'm always doing this i need to do tune and try is tuned alpha ray this one i can set uh over here is where i do [Music] here's where i do crossing oh i do not need a learn rate because i'm not training with it there we go training our first model well i haven't paid attention to things like how big are these data sets you know i'm going to open up that second um project where i'm going to do eda i'm not sure oh uh yep and here we go so um yeah i don't need to start out so early seven's a little worried three is kind of beating them it's gonna stay like three and five um oh wow oops i meant to do 200. um yeah so we're already at a good starting place 20 uh because i can do as many of there as i want three four five um uh and uh all right so point one one one is like the best we've been able to do yet trying smaller ones a little more precisely i'm going to try tree depth pretty soon i'm going to try adding in more things i'm going to be doing an importance plot yeah so look at that it's like three is kind of the best um and uh maybe others and that makes me want to also try and try out two and um also trap point oh also try different learn rates yeah just you know worth uh experimenting a little bit with this and that fives is clearly worse i'm going to drop five out of the equation this only on a subset again of the predictors like um [Music] uh most of them are numeric except for gender and you know let me open up my second one and uh get it i got the same approach in the early theta except without any parallelization so do anything i don't want to use the chords i'm already using over here that one i do i'll do eda and yeah this one i doubled the number i'm doing like 0.01 and 0.02 for my learn rate and trying a few things about 0.11 for 0.11 yeah i'll need more trees at 0.01 so i don't love i just don't love that um and it's thick around 0.02 for now uh point about two wow two is doing like as well uh and it kind of like yeah i'm gonna try but i'm gonna increase this a little bit and let's here's let's try all numeric and um turn categorical into order so here's what i'm gonna do i'm gonna say by dot set uh id neural equals ib wow uh trial numeric and let's say cb is oh what was the on a log loss of 0.11 cv log loss about 0.11 and let's do it oh we're going to do an hour to do that uh what i want to do is say train and step dummy and step you take uh income category is um is as integer category except i want uh unknown i want to know dna oh um as i need to move my this two steps later i want to bake this to make sure i'm in good shape uh actually yeah uh prep juice lets me um have some missing data and i like that that missing data makes me feel good because i've got numbers there and uh totalization count you see everything is numeric now except for churned and male female okay so everything is numeric uh and yeah all right and then i do need one extra step where i say step compute mean all numeric predictors and uh yeah i'm gonna try one more of these let's try a whole bunch of uh of levels so yes everyone shouldn't have been that commenting that would be about it yeah it doesn't matter um all right so we're 34 minutes in we're training a model that's going to be like it's going to be on all the data so trading the obviously trained a model on all the data with income with all the with income and education turned into ordinal variables and unknown computed using uh imputed mean i'm not sure that that might not be perfect maybe unknown is more similar to one side or the other that's kind of plausible looking at it um i i should cons um revisit that choice i don't know that they're going to turn out to be that important uh but yeah that's my factor to ordinal situation and um yeah i love i love doing i love that i can do all these like um tree points and uh and try different learning rates uh it's very powerful and tidy models so cv log loss on good around seven predictors one point one one all right let's also set up our code for um so xg fit let's set up the code for fitting the full model yeah one i can see here is two is not going to be good when we have that many predictors six seems to be the best and it's 11 oh wow we could easily beat the others other ones we already had uh so the um with all predictors and the rest of predictors around uh point let's see 0.087 let's try that on training on testing so the um what i do is you do extrapolate xv workflow finalize finalize workflow to a select best xt tune which by the way will be uh six uh yeah we can see it's like leveling up around there uh you know let me yeah i'll i'll start with that and uh yeah then we'll try actually fit train a fit on train fit on the full train data oops uh and um augment on the test and uh log loss of mean log loss of the true of um oh it's estimated in truth nope it's not tread class that i want well i need to remember oh dot tread yes oh yeah that's dot tread top fred yes and uh churned uh churned is a character how did that happen oh yeah look at that look at it go it did happen didn't it wonder that makes a difference anywhere else i'm doing it backwards ha there we go 847 um on test set 0.0847 cool so the um and let's look at the importance for a set i learned recently a trip thanks so much i'm sorry i brought their name on um uh on oh extracted parsnip that's cool uh and then um xg boost gb importance uh see this is that this is the the trouble is it is the fit doesn't get it all the fit still i'm still sort of learning my way around this i want i want the actual fit oh engine the engine specific fit okay i think that's right there we go okay so it's like yeah i think we actually saw this early transaction amount of transaction count revolving balance those matter a lot and things like income category and education you know we turned them into and um id id id id um i'm so sorry i never figured out quite how the rules work i'm sure somebody else can i'm sure somebody can point out later right um and have i mean it's not important of course or hopefully not but uh but i include it anyway um yeah so the categoricals ended up like or the or the setup categories end up like not mattering very much um i could leave them in but like i could also narrow them down and possibly improve it i would probably cut the line here with months and active 12 months and to skip the categorical ones um and yeah age still matters and a lot of these ones like transactional transaction count still do matter i'm going to be doing some scatter plots on on those in a minute uh so i'm gonna add some like some eda to do's um so i'm gonna and the um here we go uh x7ea yeah it's still training um you know i didn't have to start with two i could have narrowed this set down a little you know what i'm actually i'm gonna narrow it down because i saw the best was around like six so i'm actually gonna do three four five six seven no it's four five six seven eight i don't know a little bit farther it'll take a while but um but yeah it'll be worth it it might be presumably worth it but um now at least i've gotten rid of id and i might get rid of you i might get rid of the uh you know i might get rid of the other ones too yeah let me get rid of the other of the ones that are the absolute bottom all right it doesn't mean i'm you know i'm not going to drop gender because it's smidgen more important but yeah i'm gonna definitely um i'm removing the ip uh-huh okay and let's have some relationships so like for instance i'm going to total transaction amount total transaction count uh how do those i mean those are going to be extremely correlated uh but it seems like those are the two most important so let's let's spend a minute looking at a total transaction count total trans amount uh oh these look at these uh levels and stuff that's kind of interesting or is it i wonder uh yeah and let's do color equals still going still going yeah and i'll do um alpha um look at that it kind of doesn't look like there's a there's an interaction between them so uh total and assume that's dollars i'm going to quickly check total transaction amount well it just says value but i i'm gonna i'm gonna assume that it's dollars uh and say labels equals dollar format yeah i'm like yeah the you know i really wanna the ordering i try to make it a little bit more transparent and let's do jam smooth maybe a little bit slow it's not too slow yeah it's kind of like yeah look at this total transaction amount like the people that churn have at higher transaction counts are is a ratio of average transaction amount low transaction count so let's actually do this as transaction count over average do they already have average if so i'm doing this work for nothing but i don't think they did so look i'm doing like average transaction amount and yeah it's like this is kind of more more straightforwardly for most of it it's a little bit higher and i think like instead of total transaction amount i might use average or i could use both of them and see how that um how that works yeah so what did i get 847 it was like slightly i dropped that id but like eight it's pretty similar i am oh and uh let's let's look at these uh four oh wow yeah so when i dropped those out three because four became the best so am i going to try three seven seven is the worst three through six learn rate 0.015 it's kind of experimenting with this um then i was going to throw it back in um and uh oh yeah i might want to add average transaction amount so the um that means changing multiple things at once but transaction about average utilization radio total amount changed um yeah uh about 12 months old um yeah i'm gonna so and let's see this total relationships that's cool how much delicious matters though i have yet to do this and uh all right um uh labs um here we go um three is on higher learning rate doing better so far okay and what was that best 853 that's pretty good i'm going to throw in the um the one we just came up with average transactions uh is um transaction amount is this three five three um i keep the only the one learning rate for now and her average transaction amount and uh we quickly check on total transaction that was it ever and a no it was uh i want to um sure i'm in total oh yeah we can see it's always at least 10. okay then the um then here we go let's see how it can get any lower than 0.0853 so it's seven eight five three yeah one thing is i don't i'm so creative visualization i'm gonna do today i don't see any like um i'm gonna write them something like a little creativity like some of those faceted ones but the um uh you can actually try a different one uh you know it's funny that small transactions that are lower that people know like oh yeah people that churned have lower transaction accounts but then higher average transaction marks and high transaction accounts like yeah okay at this count higher transaction amounts are more sureners that have higher values right that's so it's so weird i can actually explore this different way and say uh train group by relation by um total transaction count is cut total transaction count uh 0 to 30 50 to 100 to infinity [Music] and not basically the same way i had before if you notice most of them are 50 to 100 so oops and the um [Music] get back to that in a second yeah three predictors that's the that's the minimum whatever whatever the minimum does best i always wonder is should i bring it down a notch a 0.04 second all right um let's stick with that on uh on m try three 780 trees learn rate is 0.02 that's the first one by the hour uh let's see our first one yeah at the the hour we see all right that's similar to the model we had before maybe a tiny bit worse but i'm alright with that the um it's like i think the other 1.047 whatever uh yeah then the um all right actually fit fit full is gonna be the same thing but fit on the full data set everything sets your education i just wanted to know what attempt that's where i checked it's prediction yes that we that we supposed to have i met and so uh is yeah the probability of attrition flag yeah yeah it absolutely but i'll i'll make sure i'll make sure that this time okay the full augment uh yeah oh shoot anything to give it the right name uh attrition flag let me take a quick look before i submit to make sure i i'm accurate about what's in the model um i should do around extra fit my reader actually full extract parse got average transaction mount and yeah no id no anything else so yep here in one line making sure i put the right version oh oh all right that feels too good um yeah it feels way too low compared to the other ones i was looking at it but i guess it's very few observations uh yeah the um yeah people getting crushed by a few people up up there the um and we don't pay too much attention to it uh but yeah the how much data is in the hole in the holdout anyway three yeah so that'll be times .01 all right then the about 30 observations yeah it's those it's not going to be uh tons of it but yes at the very least i'm not like getting it backwards or anything the the meaning backwards that's right matter all right so let's yeah let's keep it let's do flaps uh let's see plan for second hour more tuning on learn rate and max depth and then trains cool i'm gonna get some water i'll show you like three few more minutes you know while i'm running while i'm yeah [Music] that story yes [Music] all right let's go back in so here i'm just seeing like got these 12 features three is the three um best hold yeah let's see i have to make sure we get this right how did this oh wait oh biggest biggest randomness or did i change the workforce i'm venomous all right yeah you know because it's really hard to tell with randomness okay so the um uh all right so we're gonna do is start by bringing it down five and six seem to do really poorly though maybe that'll ch they're doing poorly for now for no 6 002 4. yeah i'm messing around like a little with like now i'm gonna i'm gonna start just here and um yeah mess around with the learn rate a little but we really do feel like the key is two three maybe four yeah okay so um yeah what i'm going to do is load oh yeah low transaction count and you know i like this ribbon plot i should turn it into a functional x all right the um sunrise churn oh yeah see like look at that isn't that the print i think it looks pretty good and the um this is good and look i already have this i forgotten this this trick and group equals one and i need that to happen on this one too i'm not gonna get there all right so total transaction count bucketed 30 yeah those most are 30 40 50 60 which is not a lot of things yes so it's like oh wow look at that relationship that's pretty cool and um and let's also group by total trans average as group by average transaction amount is cut what is the distribution of average transaction amount um remind us of what i'm up to while i go back to this plot yeah the um i think it does best to see this is so close so close i might increase by like 2 000 but wow these are close they're not like yeah they're like pretty damn comfortable and we're all in that kind of a kind of that same zone uh so the um yeah we're gonna just thinking for a minute on what to do with this the um i'm gonna increase this the number because i have more time now but i'm also i really wanna i wanna try some treat up things uh tree deck is uh true depth is term i think it's around eight or something like that so it's like five seven nine well i'm not doing very few deaths this is four six eight and two is sort of worse um they're all pretty good i'll try this all right give this a whirl see how we do and um yeah so this is this is like um the role of transaction it moves many also really likely to turn really interesting and uh yeah i'm gonna keep this then i'm gonna say oh yeah yeah i was gonna do gm histogram see look i can really whenever you see something like this you want to you want a dichotomy you want to yeah you can trick atomize it so like um uh here's what i'm gonna do in a show 50 110 130 only 30. why why so i can do this i can say this oh man it keeps running fast and expect tree depth uh huh we got a tiny benefit out of this best around six well i don't know with four and maybe more learning i'm gonna increase this a little bit change this to five six and eight is just like over fits real fast eight over fits fast seven [Music] i could try that out while i'm going and uh all right so i'm gonna do is say uh x intercept is 50 110 130 nope uh geom oh whoops i don't like that 110 i like 100. so all right um yeah the reason why i want that division is i want i wanted to cut it so i'd say like cut 50 it's still running good so i'd say cut fifty a hundred one third one twenty five one on one thirty i said uh and then and group equals too many break points because the 100 130 is too many yeah like i mean this is just not yeah this is still too many so i'm going to try yeah look at that that's kind of like an interaction term there you can see like the um it's kind of an interesting plot yeah we're going to do is change this amount is greater than 50. so all right oh yeah there we go uh rated 50 less than 50. yeah this is um so what this says is that if you have a moderate number of transactions uh idea customer segments uh low number of um so let me jump back to that in a second oh this got a lot better oh how did that get so much better oh i see how it got okay it got a lot better by what is the the optimum is it on six or what is no it's uh learning funnel two tree depth six simple yeah i thought i would have tried that i'm worried this is me getting lucky on one uh yeah this could absolutely this could absolutely be luck uh so i'm gonna do a couple of things i'm gonna try set c 20 21 i'm gonna reduce the rule of luck no i'm not gonna do that yeah i am actually uh oh seven of 13. chain 10 fold this train okay because you know i'm just like i'm just waiting on it and uh while i'm at it i'm gonna expand on my set a little bit tree depth it seemed like six was gonna try it out he's gonna be slow all right but the um but yeah customer segments greater than uh so let's see thirty two 30 to 50 transactions average greater than 50 likely to turn uh 30 let me see um fifty sixty uh 30 to uh less than 50. and keep this simple which means we're dropping out we're ignoring this section but that is uh less than transactions segment what's up so the transactions oops uh red i got these backwards look at that uh because uh yeah i want the um so cool oh that's not slow okay uh so that was our summarize in terms of our learnings or an average transaction amount wow this is just going and going i guess it's got a lot of trees it's got a lot of juice here um let's see transaction amount is roughly bimodal product um oh this is slowing down excited about this it's going slow make sure i have everything high model center with two times four times 16 and it's tenfold oh yeah it's gonna be slow oh well the um yeah i thought i should skip the tenfold cross validation but you know i want to be accurate all right customer segments let's go back to this the low uh oh hi i should be back i think on youtube let me see let me see great okay looks like i'm back i just want to make sure that all right and now i can jump back in so here we go okay i'm all saved i'm not going to do that tidy models like i'm not going to do a 10-fold validation again that took way too long way too long and might need to reduce down the rest of them she just said wow that was a lot going on all right and uh no and five-fold i only have four cores right yeah uh too many things going um let me try this on fivefold okay and uh yeah let's get back to my ada though thanks for your patience if you stuck around training fivefold do some sharon try and turn trench on still training yep all right um transaction amount i don't like this one but i'll keep it average all right for sure we saw that average reduction amount is bimodal oops that the um yeah that total transaction count was like greater than 50 less than 50. uh and the um go yeah and here's my like dividing the segments based on churn risk you might go back to the same graph on some other things but what i really yeah that means i'd like to find out what the other most important features were but i don't actually have that graph um handy but i can go back to that like rank one that or that um uh that one that was faceted by metric let's say next one just look at this all right transaction count total revolving balance utilization ratio maybe has a little importance you know i'm actually gonna uh i'm gonna take this gathered and i'm gonna ask about the wait till i get this but you know let me do it anyway what i'm gonna do is say uh gathered group by metric uh summarize auc is um roc auc of um back of uh churn by value and i always get it i it's gonna be uh metro about trend value oh this extra tuning it really is taking this time and uh yeah let's make this but let's make this graph while we have open metric is fcp reorder metric by auction and do um yeah the geom v line alpha lies two x intersect is 0.5 it's like age is not very important but uh total transaction amount account is important this is what i'm going to do i'm going to add one i'm going to add mutate total average transaction trans amount is total amount total trans count i'm going to throw that into the end of the gather you believe it how many did i even do is i should about to check two numbers that was indulgent that's all right so um where's average transaction oh average transaction amount is not predicted on a pure linear scale um average pronoun so that's why i want to wait till i have an actual um tune there oh all the models failed for real okay let's see what i did wrong we're gonna try is uh oh uh look here's one i'm gonna try i'm gonna do [Music] xg rack crap juice that at least didn't fail um and uh oh metrics m set yeah see all the models failed so the notes column oh chorus did not deliver results signs up with my parallel processing let me try this with a smaller number all right that works what was my best model again i said that somewhere i'm try three seven eighty trees learn rate 0.02 780 degrees near a little space three just experimenting around here trying some different learning i'm yeah so i'm experiencing a little more cautiously i want everything to crash again but once i do that i'm going to try out some um some of this model in the meantime yeah the um oh i need to yeah i do need to get some model what i do i just double the amount of time it's going to take to run and plus it's a little longer in the number of trees i'll give it a sec i'll give it a minute okay so to do is um all right the uh learn rate let's see yeah three uh lower loan rate right that's better all right so the um and the best 0.084 but i think it's not better than that but it could have been randomness uh so 0.084 and i do need to start at 200 trees too many uh it's making me harder to tell where things go well let me see oh but i do need to create this side of this back flight it's important scrap okay so the um and transaction it's interesting transaction amount is still more important than average transaction amount now the um these two just work well together total revolving balance and uh yeah some of the other ones we want to look at anytime i'm going to try this other all right so total other important ones to try yeah total balance total revolving balance uh all right so the um total evolving balance uh total change [Music] all right um yes let's let's look at a few of those so and revolving balance what is it related to most total amount the total count change total amount change and if i look at the um here we go where's my data there's one second uh we uh resolving credit there's a total of all god's answer uh yeah i'm actually open that up oh okay yeah like like wait like wait until later all right so the um so let's see total evolved uh oh there it is okay a little bit better and the best was a thousand look at this tree depth five it kind of was still 0.02 kind of still learning but i still you know i like it uh tweet up five and and this point yeah see straight up focus was still learning at that at that uh rate what if i should remove the seven and let it go to 1200 and i decided that 101.8 is too low low yeah okay so that one was 0.0841 trying one more model but anytime yeah so if i say total volume balance uh and train uh total evolution total evolving balance all right starts zero so what that means is that if i want to look at the rates i'm going to do group by balance bucket is cut of total revolving balance into zero one thousand two thousand it kind of locks out of twenty-five zero to let me see uh where is that zero oh yeah 2499 and uh and infinity it's twenty two four nine nine i think yeah they like there's a bunch a little over 2500 and uh all right here we go the um ah you know i tried this and it's like a little bit worse than the last version that could all be random luck seemed like really like six at this lower value five at this higher value it's it's hard to say i still like learning 8.02 yeah i'm going to stay just with learner 8.02 and i'm gonna go back to three and now i did three and i gave it one one more shot at tenfold five or six oops all right and um yeah so oh i'm gonna need a couple of labels there you go and uh let's also bucket that by the you know that the transaction count i really like that transaction bucket we were looking at let's bring it back in you know no interaction term all right let's take a quick look at this yeah it's like tree depth 6 800 seats still not as good as his earlier variation where was that variation i liked i liked 6.019 840 but there's so much randomness here that yeah i'm just gonna go with the best version which yeah i'm gonna go with the best version here i think yeah i think that's better than that yeah it was a little better than it was on hold up and yeah let's remind ourselves what's in here nothing i don't want in there all right and um i have two and try is three uh and tries three ten forty trees then there's point oh two tree deck is five and uh all right actually we fitted on the full data set of measuring software for once we do 7k um it wasn't improving investigator you know i still like it more than the bet in the other score because the um yeah i think i still like it more because it did have a slightly more slightly better than the test and i have a lot more data locally so i'm going to choose that one yeah even though it's even though it's work even though it's like we're so on there and i'm also going to like yeah so i'm going to try and try remove gender five six and how else do i do i try adding removing because i'm pretty close to i feel like i'm pretty close here i don't think i'm i could easily be missing someone important one but no turns out oh yeah um interaction term not really interacting to return transactions let's try average transactions and suck action how about that um that util utilization the utilization feels like it's related to the um yeah i feel this way and um i can say like what i want there's a little there's a fight spike zero and point five so let's make those the buckets we'll say uh average utilization or it's still training no it's done five seven still not doing so i removed gender and did it get any better sloppy and do it uh over again uh no it did not get it definitely didn't get better by removing gender maybe what if i if i move some of the others but i'm not going to do that uh and still seeing yeah i'm thinking through some of the things i can still mess with so i don't have a lot of time left to play around look too many yup learn rape tune learn right oh yeah average utilization ratio oh yeah so here we go average utilization all right um and zero to here we go so yeah yeah here's the story we can see on this scatter plot utilization average utilization i think there's actually a relation between these two where we can say average utilization ratio and balance total revolving balance are clearly related yeah look at that they're like they're cleanly like this uh all right i'm thinking a few ways i can do this um yeah um uh geon point uh color equals churn that's cool everyone below a certain revolving balance was churned uh but i can do and yeah that that spike at zero has a lot going on there there's an interesting relationship i can uh acid wrap like sharon hmm well this is relationship and let me add a little bit of jitter to say uh yeah while this finishes training i just make sure that i have yeah i've selected like um five yeah uh oh that's too bad um let me try it on five try it on tenfold and do uh only one tree gap and only one made sure i'll try one more model okay but the um yeah so when that's the point yeah so oh yeah let's look at this one the uh position ratio watch like you gotta get a block down at zero is kind of an important thing and there is kind of a relationship here in terms of the total evolving balance in particular region always churning that's a little bit a bit uh interesting but yeah uh let's let him play kind of obvious okay let's take one more look through all the things we well um oh there's fit and best was five this is really similar to the last one i had like really similar what would it have if i did oh it was on ten-fold yeah it was what would happen at 1.018 what would it say let's find out but yeah let's actually go through all the things that we learned in our in our eda so the uh so first we tried doing some jet we tried looking at like by gender uh by um gender and education didn't see much there looked at it by education and saw like okay there's kind of like a maybe a different college but not a very strong one a strong bit of a stronger dip on income that could pop up in other um areas too we realized then that we that um i use a not a really strong effect of of age so then we can look at all the numeric ones and we said note you know i really like i like the tattoos and we're probably just gonna stick with it but the um yep i like it and uh and yeah what we did is we looked at um how these predictors differ give us some senses and then we really looked at some relationships here's my favorite section where we then looked at the importance of um uh we said two tribes of two types of spenders i've added is that right um and the um average transaction amount is right roughly by order i guess people that only save their um do a few transactions that are all large much more likely to turn is my favorite graph we made and i like that we were able to turn it into this um this like segment segmentation kind of thing i think it's kind of it could be understandable i'm gonna make one quick uh labels equals percent oh yeah all right so really important go over to the twitch if you're still watching this on youtube please go over to the twitch account and the um uh yep uh just go over to the twitch account and please uh make sure to vote for me so that's some over there's a link in the youtube it's on nick rand's twitch account so i'm gonna end the screencast here excited to see you over there um and uh yeah i hope i hope you had fun i certainly did
Info
Channel: David Robinson
Views: 1,810
Rating: 4.8222222 out of 5
Keywords:
Id: oCGmh3NIJ7I
Channel Id: undefined
Length: 125min 18sec (7518 seconds)
Published: Tue Jul 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.