Cluster Analysis and Anomaly Detection

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] no but seriously the the stuff we're gonna do day with the unsupervised learning is a bit more nuanced everybody loves the supervised learning because it's like it's super fun and the thing that we probably forgot to tell you or if we didn't forget somebody else will forget to tell you is that you have to have labeled data to do supervised learning and labeled data is actually rarer than you might like really when you're doing this stuff you do a lot of these unsupervised learning techniques and as I said they're a bit more nuanced it's a bit more art to them and so I'm gonna spend a little time explaining how these actually work save a little deeper understanding and that will help you apply it okay so first we're gonna talk about clustering so what is clustering that's a great place to start I think so as I mentioned this is an unsupervised learning technique okay so there's no labels necessary you can have labels it's okay you can cluster things with labels but you don't have to have them anymore okay so this idea where there's some column we're trying to predict just forget all that okay you could just have data okay it's useful for finding similar instances in fact that's pretty much all it's useful for that's not entirely true okay you can do some kind of smart sampling and labeling and I'll explain all this but really what's happening is we're finding self-similar groups of instances and we're talking about similarity we're just taking a data set and we're saying what things in here look the same that you can put together you'll find groups in here but things look similar so you can imagine customers get some kind of customer data that you would be finding customers that have similar behavior maybe they're clicking around a website so when you cluster them you're gonna have customers that are all clicking around in the same way or buying the same kinds of things that's what you're doing with clustering in a medical context you might be clustering diagnostic information so you're finding patients with similar diagnostic measurements all right so each of these groups that we're gonna find the output of these algorithms is gonna be defined by a centroid so the centroid is the geometric center of the group all right so you can imagine that if you're grouping patients they're not all going to have the exact same diagnostic measurement so you're going to be able to say there's just Colonel while all these patients are pretty similar and you're gonna define that group by the middle you're gonna represent the entire group by the geometric center there they're all in this range and so the center is this point okay it represents the average member of that group if you will in some ways it exactly represents the average and then the number of centroids is another thing that linguistically we're gonna to be talking about and we'll call this K because that's just a great letter and it can be specified or determined okay you can say I want three groups or you can say I don't know how many I want find the optimum number okay and then there's some sense of finding the right number of groups for your data a little different puzzle okay so just so that we're clear mm-hmm here we have some transactional data right so we have things like the date which in this case is just a day of the week and we have the customer and account number the kind of authorization they use right did they sign a receipt or they use a pin or something and there's a class maybe we've classified these purchases right and there's some kind of a zip code and the amount that was spent all right and when we're thinking about clustering what we're looking for is like those three blue rows all right there's somehow similar and the way they're similar that they use the same kind of authentication and they all have an amount that's around a hundred dollars all right and they're different though you'll notice they're not exactly the same right they're on different days they're different customers they're different account numbers they're for different kinds of purchases and different zip codes okay but we could group those it's one way we could do it and obviously we're going to part of the algorithm is to find the right qualities for grouping these things but this is one way you could think about how these transactions have some self similarity in them all right and if this was a cluster that we discovered from clustering we would define the centroid okay member as the average member well so for a categorical thing the date is going to be Wednesday because two out of three of those is Wednesday it's the most common class so that would be the centroid would be Wednesday okay and the customer would be Bob because two out of three of the transactions are for Bob and the account is three four to one for the same reason all right so in these categorical values you're taking the most common laughs for the numeric values you're gonna be taking an average okay so the amount is 104 which if I've done it right should be the mathematical average of those three amounts okay so what would you use this for all right it's probably a good time to ask that question so one idea is just customer segmentation so you just start with a data set that's full of customers and you just want to know what kinds of customers you have you don't even know what you're gonna do with them yet but you know that whatever you're gonna do you want them segmented you want them in buckets right these are all my customers that are similar somehow but they're different from these ones which customers are similar how many natural groups are there in your data is a good question to ask right maybe you just have one kind of customer maybe you have 15 kinds do you know right clustering is one way you can find this out there's an idea of item discovery so what other items are similar to this one Sammy have some kind of a product database and you pick one out and you go okay I know what this is like what are other products that are very similar to this one all right in terms of you know cost or application or classification and more in the similarity vein what other instances are sharing a specific property you can make a recommender and I say almost but you really can I mean it's it's not the way people when they if you google recommender you're gonna start you know going down a rabbit hole of the very specialized algorithms but you can actually make a recommender with the stuff that we're showing you okay with some clustering techniques and some simple modeling you can do it it's just not going to be what you read about when you read about recommenders okay so but if you like this one what other items might you like by similarity and active learning is another fun one too the idea is that as I mentioned to start more often not you in the real world you tend to have unlabeled data and then one of the very first things you're trying to do is label it and labeling can be very expensive and clustering is sort of a way you can cheat and make labeling cheaper all right because you don't really have to label everything if you know lots of things are the same you can just take one out label and say oh that's all of them and it's a bit unfair sure but if you're talking about a billion things it's pretty optimum and rather than trying to label a billion things okay so a little a little clearer on this customer segmentation idea so imagine you have a data set of mobile game users so you've got some mobile game app and you know you can imagine you're playing the game and eventually the game pesters you to buy something right you got to buy a level upgrade or something and so your data for each user consists of the usage statistics they installed your game they played it for an hour and they never played it again or they played it for an hour every day at 8 a.m. for a month ok or they played it non-stop for three days without sleeping that's the kind of data you have about your customers and you also know what their long-term value was based on how many levels they purchased or whatever else they bought ok and your assumption in your mind is that that early usage is going to correspond to that long-term value how you use it on day one is gonna somehow correspond to how much money you're going to spend over the next three months alright and so the goal is to cluster the users by their usage statistics you want to sort of bucket these all together you want to say ok in the first week this is what my customers did they installed it in the first week this was what their usage pattern looked like so all these new customers I have I only know let's I've got little customers I know data for a year I know how much they spent the new ones I only have a week so I want to find what customers they're similar to so that I can say they're probably gonna spend that much - that makes sense yep so if they have a similar usage by clustering then I assume that they're gonna have a similar spend by long term value and so now maybe I can you know give them a deal or offer them a special and encourage them to buy earlier so I can drive my revenue okay and so maybe we'll end up with some kind of a group down here where we say you know these these customers are 3% so these are groups of customers by usage and these ones three percent of them have a high long-term value and this group over here is only one percent and zero percent these customers are the ones that never buy anything and so if I'm thinking about you know offering specials maybe I'll form to that group right because there's a higher percentage of long-term potential long-term value customers in there ok maybe there's a cost to offering these deals so you want to do it to everyone okay uh thinking about similarity you guys know about Lending Club is that a thing here I can't remember if it's how about just peer-to-peer lending in general you know the idea that if I decide I just have too much money then I can loan it out to risky people I don't know in the Internet this is a you're familiar with this thank you yeah okay so you go on there and it's super fun actually and they make their data public which makes it even more fun but you can go on there and there's there's people that just say I want money to buy a boat and you can look at their their credit information and what they want the money for and you can say sure I'll fund that you need a boat I totally believe in you and then they pay you back and but it's it turns into a whole secondary market where you can actually go and buy the loans from other people that funded them so you you know there's people that identify the good ones and they fund them and then they just turn around and sell them for like a percentage yeah so you don't have to do the legwork of finding the good ones it's very interesting so but I have this data set of Lending Club loans I think we'll play with a little bit today and the idea is that it's a very similar behavior pattern i I can look at the history of loans and I can see customers that pay their loan and customers that default it I know things about their credit and things like that right but as a new customer I don't have that information yet because you haven't paid the loan right I don't know if you if I had six loans that you never paid I could say yeah you're not gonna pay it okay but what I have is data about what other people paid and happens and maybe you're a new person and so I can cluster your behavior all right by your application profile and put you in clusters with similar applications similar debt to income ratios similar customer profile information similar income things like that and I can rank all those clusters by how many troubled loans there were right so these are all loan applications and this one here is seven percent of them are loans that just never got paid you know and this one is three percent now one's one percent that one's 0% and so you know which group if you didn't know anything else which group would you want to start shopping in right probably not the seven percent unless you just enjoy risk yeah so this is a way to you know you could argue perhaps unfairly to make some kind of a claim but it's statistics so there you go about what the potential risk is for these loans happy out even though it's a new person you don't actually know anything about yet okay okay so this idea of active learning so I stole this picture from somewhere I should make my own that's not very good one but the idea is that you're you're kind of helping the computer okay do a better job of doing this machine learning idea so imagine that you had a dataset diagnosed measurements of 760 patients and I love this because I have that dataset and so very silly but the diabetes they said but that's okay so imagine that you have this data set like I do but let's say that we didn't have the label okay we just had the diagnostic measure Minh so we didn't know which of those patients had diabetes or not okay we just had 768 people come through we took their blood pressure and we the plasma glucose and it's just you know some nurses or something and they just collect all this data and you're like fantastic now I need to know which ones have diabetes okay so I could build my cool model but now I need an expensive doctor I need a doctor to survey every single patient and tell me which ones have diabetes which one don't right and that part could be very expensive if I'm not the doctor myself and so you know you could imagine doing this with sampling right I could just kind of sample but we want to sample intelligently and one thing you could do then is you could cluster all these patients together and rather than sampling randomly you're going to sample from those groups so you're gonna end up with clusters of patients that all have very similar diagnostic profiles they all look very similar in their to our glucose and their blood pressure triceps they're all very similar not exact the same but similar and you're gonna take a few patients from there and test them and you're gonna say well you know these these patients from this group they all have diabetes and so the rest is in that group they probably do too and you can just apply that label to all of them and yes we're generalizing but the whole machine learning process is generalizing and so it's not cheating in these the way that you're thinking it is okay and this is a simple example if you'd like a more realistic example of a high cost situation imagine you have a data set with a billion transactions okay and you need to label each one as fraudulent or not alright or a million images pictures that somebody has to look at each one and decide if it's a picture of a cat or not right it's the classic internet meme okay this is a very expense process right to look at a billion transactions and go that one's okay that one's not that one's okay it's gonna take forever you're much better off doing some kind of approach like this where you sample and reduce the size and say you know what these of this billion transactions sure it's a billion but ten million of them are exactly the same and they were all fine and so I'm just gonna call that whole group fine if there was one fraudulent in there I didn't even care anyways cuz it was five dollars all right and so you can sort of filter out this data set with clustering and do this smart sampling and labeling I get very useful and now again hopefully you're thinking about how this is turning into a workflow right you're you don't use it's not always just a beautiful clean data set do something great and you know move on it's it's a bit more nuanced so item discovery this is what we're gonna do because it's super fun even though somebody put this talk in the morning aside to talk about whiskey's at nine in the morning that's okay so I have this data set of eighty-six whiskey's all right and each one is scored on a scale from zero to force of the numeric values and there's these twelve possible flavor characteristics does anybody in here drink scotch yes it's just a few hands okay did you know that there were twelve possible ways you could describe the flavor of a scotch did you know that no okay it's a very very interesting world professional Scotch drinkers I guess okay so what we're gonna do those we're gonna cluster these alright by the flavor profile and what this is going to do then is find whiskies that have similar flavor profile right so now I can start saying you know these are the fruity whiskies and those are the smoky ones and this could be a process whereby I can say well I really like this one what's another one I might like and now I know because I know that there's five other ones that are very similar flavor profile maybe not exactly the same but similar alright and because it's such a great data set we're gonna have to do I haven't done this one in a while so we'll see if it goes well it just occurred to me that's not having two clustered this in a bit okay I see if I have this here okay well I do have it in the folder so do that there's an extremely small font where did you go it will help if I actually find the data bear with me for a moment oh yes we're winning now okay well that's a little big okay so as advertised we have these twelve flavor profiles right floral fruity multi spicy honey tobacco did you know it could have a tobacco flavor that's an odd one I think smokey that should be pretty familiar with any ones I've ever had one Scotch at least in life and we do have a text field up there it's a text right now which is the distillery and we're gonna largely just ignore that okay you mean you could think it's a label but it isn't really in the sense that we think of machine learning labels because it's distinct for every whisky right it's different value for every single one so it's not really a class per se it's all it's exactly the name of each one okay we don't well actually let's go head and change that we can make it a category why not it won't hurt anything okay so we'll do our one click data set and we'll explore what these flavors look like so again you'll notice that every single value in the distillery is different and there's a little red exclamation mark I'm gonna try to remember to tell you more things that we haven't told you yet so that's one when you create this data set one of the things we do is would run a little heuristic that tries to determine whether or not this feature is gonna be useful for machine learning okay and anytime you have a value that's distinct for every single instance it's probably not useful for machine learning it just has too much variance I mean what it's not you're not gonna be able to generalize from that information like a patient ID or some kind of in this case a name that's unique for everyone there's nothing you can generalize from that okay and so it marks it automatically as you probably don't want this feature in here okay so that's a nice little feature those for you there yep yes okay well fair enough I okay that's true yeah it could be unique in America one would presume I know well I mean no number of bedrooms in a house okay let's make a cluster though that'll be more fun than talking about identic things okay let's so what we're gonna do is under the gears there we're gonna choose configure the cluster and I'm gonna show you why and we're gonna configure it you can do a one click cluster if you do a one-click cluster it's gonna do what I'm doing right now but it'll do eight groups by default so it's going to take these whiskey's and break them to eight groups for this demonstration I want ten groups because it makes for a cooler demonstration and ya know that's exactly why and you could make the argument that well that's you know why does ten make a cooler demonstration and the answer and this is a completely honest answer is that I tried eight and I tried to end and a few other things and ten did what I wanted and you would say well that's a horrible answer and I would say that's how clustering works I mean this is a specific case where I know the behavior I want from this cluster okay and so I'm going to make it do that that's perfectly valid thing to do there are times when you have no idea how many clusters you wander what the behavior is and I'm gonna show you how to do that too all right and I'm also going to turn on this thing called modeling clusters so we'll play with that a bit too but that was enough explanation of that so this is gonna go ahead and run okay and now we have all of these groups of whisky so what you're looking at in this each one of these little disks represents a group of whiskies okay and visually when you have one highlighted like this red one in the middle these groups that are nearby are somehow their flavor profiles are nearby that one that's in the middle and you see this one that's floating out here it's not nearby okay it's a very unique flavor profile to the rest of those in the middle and if you pick any one of these and click on it it puts it in the center and you can see how the rest of them are distributed because remember this is like a big 12 dimensional space so we're just looking at two dimensional view so we're looking at distances so in this visualization these two flavor profiles are somewhat close to that and the rest are really far away from that one that's saw highlighted okay and for any one of these I can freeze it and see what the the qualities are all right so I can see that that group of whiskies all has a body of close to four all right in a sweetness of close to 1.5 okay and these other features I can I can I can learn what the centroid of that group is okay and that's pretty much all there is to clustering okay there's lots of things we can do with it so we're gonna do a few more things don't don't panic one in particular that might be super nice to understand this group a little bit more okay then just talking about the center of that cluster and that's why I clicked on the model cluster so any of these groups you pick you can highlight it and then you click this little button down here and what its gonna do is use that those cluster groups as a label so we've just labeled this data set the label is the cluster they belong to so that was cluster four so these are all the cluster four whiskey's and everything else is some other cluster and it just built a tree that tells us the properties the differentiate cluster four from everything else so this tree tells us how cluster four is different than the other whiskeys all right which is kind of a neat little trick and if I'm honest it's not always this clean you don't always get a decision tree this is tiny in short but you can also do Association discovery all right we won't have time to do that in this session because we're not doing such a discovery but but you do often or sometimes get trees like this and you can learn very quickly that what distinguishes this group is this medicinal quality if it's greater than two it's in this membership so these are all the whiskey's have a very strong medicinal flavor okay that's a very strong pattern for membership in this group all right and likewise even if it isn't medicinal okay for medicinal quality is less than or two you see it's also splitting on the smokey and if it's greater than three okay very smokey then it's also in this group so these are the whiskies that are medicinal or smokey and by or I mean the traditional non-exclusive or yep okay if we wanted to actually think of this as a labeling project we could actually go ahead and label our data set with these memberships in you can do with a batch centroid all right and I can just pass it the data set and what it's going to do is take this data set and run it through this cluster and assign the label to each one of those members belongs to and make a new data set okay so when I output this data set now I have all of the original data plus the label that we've just learned okay so there's nine that are in cluster zero and fourteen they're in cluster one and if I download the download this is CSV there'll be a new column which is the cluster membership of each one okay so I can now I can now see all of the cluster one whiskies and all of the cluster three okay individually make sense it's a little early in the morning for heavy things like clustering but okay let's talk about how this works because it's actually kind of fun and then I'm noticing I'll have to speed up a little so I apologize at a time so let's let's um does anybody know how clustering works by the way this k-means algorithm anybody already know not fantastic believe this will be fun okay so here's a set of objects and these are actually objects that I found on my desk out-of-phase when I had younger children so that's why they're so random that's a true story and I actually took you took my children well the two that I didn't think would eat the battery there's a third that probably have eaten it so you didn't get to play but I said please cluster this into three groups take these objects and put them into three groups okay so if you want to take just ten seconds and think about how you would make three great it has to be three you can't have six or twelve that's cheating it's got to be three so you put these into three groups okay yeah that's all you get they have more fun they were moving things around the table as good time okay so this is my middle daughter it's Jessa picture her in the corner there and these are the groups that she came up with okay and I wasn't done with him I said okay that's fantastic now you have to tell me what these groups are I want to understand your thinking all right but that was how she arranged them and what she did is essentially has she used her prior knowledge about the physical objects in the world ok to select possible features that separated these objects she didn't say it this way how cool would that would it in right right that would be great I used my prior no no she didn't but that's what she did and what she described was things that were round or somehow skinny or had edges and things that were hard those are the groups that she made round you know those ones in the upper right have edges and those ones down there or all schemee those are the properties that she identified and then she clustered them based on those chosen features right because some of them have cross some of them match both of those characteristics and so she went by which ones were more likely now and the separation quality was then tested to ensure that she met the criteria of three because she did a lot of sorting to make sure there were three and that the groups were sufficiently distant that you could really tell them apart if you look at that picture the ones that have edges are pretty clear they look different the ones that are all around yeah you agree okay and there was no crossover okay so and then and then I had to make a slide out of this so I had to start thinking about the features that she described and so I invented the length divided by the width ratio all right which is very nice because it defines both skinny and round okay if it's greater than one that's a skinny object if it's equal to one it's round if it's less than one then it's still skinny I just measured the wrong way okay so that separates the skinny from round very nicely and then you could talk about kind of like the number of surfaces right distinct surfaces require edges which have corners cuz you're talking about edges but surfaces are easier to count right so you could look at a cube and you can go 1 2 3 4 you can count the edges okay but a penny is just 1 2 & 3 there's only three edges ok so easier to count edges and so if you do that you you can create features for these objects and they would look like this right so there's the penny as a length width ratio of 1 it has three surfaces 1 2 & 3 the key I had to cheat a little bit but the length width ratio is 4 and I went ahead and called that three surfaces because otherwise it has like you know 97 or something it's a bunch of little notches so we just called that 3 that was okay all right and now what I want to do is I want to just plot these we'll just put them on a graph we've got two dimensions so it's very easy to look at and this is this is real okay this is what it would look like there's the box the block and the eraser there's the knob the penny and the dime and the bead and the key the battery and the screw okay they're all on there and if I asked you now looking at this to put that into three groups it'd be pretty easy right okay now my youngest could do it cuz unless he eats the paper you just give him a crayon you know it's really easy okay it's not as easy for computers they have a little bit more trouble because they're not as visual as we are but you can see that with these features the way she described them this is very clear how we would cluster these and so the k-means out of them what we want to do is come up with a way to take this clarity that we see in the grouping and make it out rhythmic that's what we want to do all right and so here's how it works the first thing that came Ian's does is it says oh you want three groups well oh sorry huh this is what we want this is where we want to arrive these are the groups and the black dots of the centers that's where we want to get to okay so k-means is gonna find the minimum distance okay so kami says oh you want three there you go there's three here's three you wanted three there they are okay those aren't the three we wanted right so you say okay try again and it says okay here's the catch I'm gonna get better every time I do it I'm not gonna get worse I'm never gonna get worse I know what you want this is just a guess I'm sorry it's terrible but the algorithm is gonna try harder and so what it's basically gonna do is measure the distance from these points to all of the points in the data set the centroids it's gonna measure how far away they are okay and then it's going to move them closer to the where there's essentially if you want to think about like more mass they think about like a gravitational pull okay so that one that's connected to four things there's gonna move towards those three on the left because it sort of attracted there okay it's not a gravitational pull but that's an okay analogy okay so that one's gonna move that way that one's gonna move towards those two and that one's gonna move towards those three at the top it's just gonna move them closer to those points where they're closer okay and then it's gonna try it again and it's just gonna keep doing this until those dots stop moving eventually you're gonna reach some kind of a minimum and the dots aren't gonna move anymore even if you run the algorithm again and that's it centers have been found okay there's some caveats in there features matter this is my other daughter and she found two different groupings okay she said those ones are all metal those were wood and those were other very straightforward measurement okay so the features do matter for this puzzle by the way I probably I'll have to belabor this too much k-means does converge you if you're grouping these four dots you're not sure which group you're gonna get you're gonna want one of these you'll get that one or that one but it's gonna find one of them it's not just gonna labor around try and decide it's gonna say that's the group and stop okay so it works pretty well okay so starting points are an issue I probably need to zip through this yeah okay risk there's a risk of suboptimal convergence don't worry about it you'll never see it okay this is kind of fun though we are a little smarter about how you choose the starting points how you choose a member I just said here's three dots okay if you do it that way you can get into some trouble and so we actually use a technique that's called K plus plus or maybe a variation on this and it for choosing those starting points very carefully and I still need to reword that sentence is horrible so I did pictures because I couldn't figure how to reword the sentence so these are our data points that we want a cluster alright and what you're essentially going to do is just pick one from the data points you pick a centroid that is a data point and then you start looking at distances from that centroid in terms of probabilities so you're gonna choose the next one with a low probability you could choose one from that group but with a low probability and one from there with a high probability and so and that's the highest and so maybe we get that point over there you're still looking for three so we need one more so we do it again those are low probability choices and so maybe we end up with one down there and those are our three starting points so that's just a clever way to make sure you don't get weird outliers and you know you don't pick three your three starting points all right next to each other so the algorithm takes forever to converge because they're wandering around and eventually moving okay scaling matters a lot like if you're doing home cluster of homes data sets you have like the number of bedrooms and the price and you see the scale there you know as hundreds of thousands in the scale here is just small integers and you know so the distance there between those two red dots is one and the distance between the red and the green is 160,000 that matters for this algorithm cuz it's talking about distances it matters so much that we do the scaling for you automatically okay you can turn it off if for some reason you just want to have scaling issues that's cool but we take it out we normalize all these things for you okay you could think about other things that we do for you what is the distance to a missing value what is this between categorical values how far is it from a red to a green okay was just between text features does it have to be a Euclidean distance right so far we just been thinking about distances and what is the ideal number of clusters and I'm gonna skip through a little bit of this I apologize because we do this for you automatically but you can't really do a distance to a missing so you have to replace missing values that's one thing that does trip people up in the clustering by default the default action if there's a row that has missing numeric value okay when we're clustering we just go don't know what to do with it and throw it out and then you can change that behavior but if you happen to have a data set where every single row has one numeric value missing then the cluster just goes there was nothing to do I don't know what to do okay and you can change that behavior when you're clustering when you configure the cluster there's a default numeric value there okay so if you have missing numeric values you can say okay for any row that has a missing numeric value I want you to use zero instead or the minimum of the mean of the okay you can force that choice because those missing values are hard to deal with or by default we ignore them okay yeah I don't worry about this yeah don't worry about that either this is a really great picture but don't worry about that either okay this is actually a bit more fun because I think I'm only have an hour so I got to still do anomalies too if you don't know the number of K I just love explaining this algorithm so this is another algorithm we have for clustering called G means and the idea here is you don't know how many clusters you want okay and so we're going to come up with an algorithm that figures it out so imagine that this is zoomed in there's more data set around us but this is one of those groups that we've found okay this is this yellow dots the centroid and this is all the points in the data set it's one that we found just one and what we're gonna ask is what if that was two instead instead of one what if that was two groups so we just try it we basically run k-means on this subset with two things and we let the points move around and then we think of those two points as defining a line and we project everything onto the line that's very easy to do and you end up with some kind of a distribution okay you can see that those points would make a clump on the left and those points are there make a clump on the right and if we see something that looks like that sort of bimodal thing we go that should have been two not one but two is better and we keep two instead if it comes out all smooth then we go to is no better than one so we'll keep the one and you just do this for every little group and you keep doing it over and over excuse me so if you had something that looked like this you would start with k equals two maybe you get to centroids to look like that you take each one of those groups and do this same thing with this the making the trying the mount right so we take this up here we tried this one with two centroids and we tried this one the two centroids that one we get something kind of smooth this one we get something kind of bumpy and we go okay well that one should just be one but that one should be too so we just run it again with K equals three and maybe we get that we do the same thing smooth bumpy bumpy bumpy bumpy so keep that one split both of those so K better be five we do it again they're all smooth we stop that's the basic idea supercool so if you want to do that that G means you can do that it's right there excuse me this data set isn't as good for G means so let me show you another example I'm gonna need this anyways so we'll take our diabetes data set again and this time so uh this is what if you configure it you can choose the algorithm so I could say G means so the number of K goes away because it can figure that out but instead you it's never nothing's free right there's no free lunch so the number of groups goes away but you inherit this critical value instead and I was talking about smoothness and bumpiness you can be picky about how smooth or bumpy you want things to be that's what that critical value is doing okay it's a little Gaussian test that's the G means G for Gaussian and if you set this basically what you can read the little description there but if you set this closer to zero okay then you get more groups it's much pickier about how smooth those distributions have to be so it keeps splitting and splitting and splitting until they're really smooth and if you set this really big then it accepts really lumpy things and so you get smaller groups in general so you do still have a parameter but the nice thing about this is that that default of five works pretty well and so you can actually just now have a basically a one click cluster and so if you do a one click cluster that's what it does it does the G means for you with a critical value of five and it just tries to identify the optimum number of groups yep and so this is for the Diabetes Day set we get two four six seven and everything else after that works exactly the same I can introspect any of these groups and you know I can extract this as a data set I didn't show you that before but I can take a data set a group of items from from the cluster and pull it out as a new dataset and I can analyze it individually okay excuse me okay so unsupervised learning technique for finding these self-similar groups number of centroids can be imputed or computed you can choose what you get out of this is a list of centroids okay you've got k-means and g means to choose from and for the cluster parameter you got to have K or a critical value those are your choices and there's lots of things to do with the missing in the summer fields and scales and the weights and don't forget about the model clusters the Association discovery is also super cool for this and then I did at least show you one batch centroid you can actually do this cluster assignment as basically a labeling okay so you can actually assign the cluster labels to everything in your dataset which is fun to do too okay let's go straight into anomalies very complementary it's clustering okay I have too many cool things to say about these things okay finding the unusual ah what is anomaly detection well this is kind of all those things that you probably have an immediate visceral response to what is anomaly detection you should feel pretty obvious but of course we have to define it more rigorously so that we can talk about how to do it and what we mean by it so of course that's being de - who is unsupervised learning technique again no labels necessary it's useful for finding unusual things so what I mean by that is you use a filtering or finding mistakes in your data you can even do it one class classifier which if I have time I'll explain the N is super cool but you're basically finding instances that do not match so if you think about your customer data again you're looking for a big or a small spender perhaps you're looking for some behavior that is different from all the rest of your customers okay in a medical context maybe you're looking for a healthy patient despite all indications to the contrary they should be dead and somehow they're perfectly fine okay that's an anomaly and then we define each unusual instance by some kind of score that's the goal okay we got a rank these points somehow and the anomaly detector that we use uses this scoring system that's a number from zero to one zero meaning that it's completely normal there's nothing on earth that could be more boring okay then that data point one meaning that it's basically impossible unusual okay this is actually an exponential scale and you know like like a point seven is way bigger than 0.6 which is still bigger than a point five okay it's not linear okay the difference in two point seven point six is huge compared to point six two point five okay and we can look at standard deviations and stuff too okay so remember we had clustering when these transactions we said those were similar okay with the same transaction so so now we're looking for transactions that are somehow mystically unusual and we might pick that one all right why would we pick that one well because the amount 2,000 for $59 is way higher than all the other transactions rights enormous nothing else is even close to a thousand it's the biggest amount okay that's making that data point pretty unusual it's also the only transaction in the zip code two one three five zero okay it's the only one so it's the only one in that zip code and it's for a huge amount it's got two things that make it unusual and it's also the only one for the purchase of something related to technology it's got three things that make it unusual all right the anomaly detector that's something we would want it to find and say you probably want to look at this one that's weird all right so use cases you can use this for just exploration so a lot of these unsupervised learning techniques if you're playing with new dataset it doesn't hurt to run some clustering some anomaly detection Association discovery topic modeling it has text fields and just kind of see what's in the data and anomaly detection is another good one for that too you're just looking for you know if there's anything really unusual in there that's a perfectly valid thing for intrusion detection you're looking for unusual usage patterns maybe for fraud you're looking for unusual behavior identifying incorrect data so you're looking for mistakes in labeling all right you Michael this is really unusual and you go whoa what's just wrong it's my put it in wrong oh we're just taking out your outliers because remember these supervised learning models need to generalize so if you have anomalies in there they don't generalize as well because the learning algorithm is gonna see those anomalies and go wow you know it looks like if you're plasma glucose is just even higher you might be healthy because this one person was healthy so I mean if you're in this range you're sick but if you just go a little if you just get a little more sugar in your blood you're going to be fine the as I saw one person like that okay so that doesn't generalize very well so you kind of want to take those outliers out and then we could talk about something model companies I mentioned this yesterday and I saw a few people give me a confused look so we'll try to hit that one because it's fun okay removing outliers so the mentioned these models need to generalize and the outliers negatively impact that generalization and so we can use the anomaly detector to find them and then filter them out all right and diabetes data set is actually a great one for this and this is just basically to give you a sense of the workflow you're gonna take the source build a dataset you build your training and test set actually I'll show this to yesterday yeah I did and you've basically used the anomaly detector as a filter there and you take out the anomalies and then you can model both and compare them okay so you can see how the model performs and diabetes is great for this because you take out two anomalies and the evaluation metrics go up by three percent there's just two patients in there they're just very unusual and I'm not gonna show you that full workflow but I can show you the those two unusual datasets let's do that so we'll take our diabetes dataset and we'll just do in this case a one-click anomaly detection it's a one click unsupervised anomalies right there and I'm gonna tell you a little bit of how this algorithm works but what you get back is the most anomalous data points in there so it's going to show you the things that were really unusual in the score for each one but you could still use this detector like a model and run new data points through it and get an anomaly score for anything okay so I can use it like a model now it is a model okay so these are our top most anomalous points and let's have a look at this one so this is a score of 0.63 okay so we in the UI we show them as percentages generally speaking anything over 0.6 is pretty unusual for an average data set so you have that in mind okay not all data sets but an average day said anything or 0.6 is pretty suspicious okay so we can start looking at this patient actually maybe I want a different phone yeah let's have a look at this one this is a good one okay so here this patient has a super high insulin level ok 7:44 they're like the top of the scale that's a little orange dot there all right their diabetes pedigree is also right here top of the scale 2.3 this is a measure of how much diabetes there is genetically in your family okay and their two our plasma glucose is basically the highest seen in the data set at a 197 it's incredibly high all right and they do not have diabetes so when we built this model that plasma glucose was like the strongest feature for predicting diabetes and this patient has the highest right in this data set and they're perfectly fine and so there's a couple of possible outcomes here one is that this is a mistake right so doctors sloppy handwriting and that that F was supposed to be a T and somebody wrote down wrong so you can go now pull that record and fix it the other possibility is this is the luckiest person in the world okay which let's be honest is super great for them but we can't all be the luckiest person in the world so this is soup this person is super bad for our model right our model needs to apply in a general sense and we can't all be that person and so we got really got to take them out okay we can't consider them in this model if we want to generalize and so we can do that we can just take for example these two most anomalous data points and we can just hit the button there and say create data set and those are gone now those two patients are removed okay so we totally unsupervised we found the most anomalous points we had selected a few filter them out and now we have a clean data set okay Superfund intrusion detection this is a real-world example from my own personal history so you can take a data set of command line history for users right so looking at the each user that they're typing in okay in case you didn't know the sis admins can actually look at this stuff as you're typing stuff in they don't they know what you're doing okay so the data for each user consists of the commands the flags the directory they in when they executed that command okay there's lots of things you can measure about where people are typing commands and how they're typing them that worked like a fingerprint and the assumption is that users typically issue the same flag patterns and work in certain directories and that assumption by the way is completely true I mean it's it's fantastic how well you can finger for the people with this and the goal is to identify unusual command line behavior per user and across all users that might indicate some kind of an intrusion so I would make a bunch of anomaly detectors so I make an anomaly detector for each user okay so I can see one day suddenly your your your typing LS - la instead of LS - al I know you're either having some kind of aneurysm or there's somebody operating your account other than you right because you never type LS - Bal a okay I can make an anomaly detector for the directory you're in when you're executing that command you type RM 16 times a day but you never type it in the bin directory that's a really weird thing to do today okay I can do it across all users nobody changes into slash user slash bin okay nobody does that okay if you do that that's weird or maybe across all directories and I can use all of these anomaly detectors to build a score for everything that you're typing about how anomalous that command looks yeah super fun okay fraud same kind of idea we can look at credit card transactions and do a card level user level and similar user level I think I talked about that one a little yesterday so I won't belabor it too much but it's the same kind of idea building several anomaly detectors and looking at how anomalous that transaction is across several different data sets okay so model competence this one is useful enough that I'll spend a few minutes longer on it so after you put a model into production data that's being predicted become can become statistically different than the training data okay this is why you have to retrain periodically what I mean is that let's say you run a company that makes loans or something and you build a model that predicts the likelihood these loans will be paid back and it turns out that while you're building this data set nobody over the age of 50 applied for a loan and you put that model into production and then suddenly there's a bunch of retirees applying for loans and their profile doesn't match your training data at all all right they have totally different incomes totally different status totally different age totally different interests they don't match your model doesn't know anything about these people it's completely incompetent okay and you can measure this in real time okay in the way you do it is you take your data set and you build your models the thing you're used for scoring and you build an anomaly detector at the same time now think about what this anomaly detectors telling you for any data point you runs this anomaly detector is telling you how well it matches that data set that's exactly what it's measuring if that if you run a data point through here and it gives you a high score it doesn't match things that are in this data set if it gives you a low score it matches lots of things in there that's exactly what it's telling you so at prediction time you run it through both of these you say okay here's a prediction this loan is okay and I have an 86 or a point 86 confidence and that one is this other loan is okay and I have an 84 confidence but the anomaly score for this one is 0.53 and that one's 0.7 so if I look at this I say okay the model thinks both of these loans are fine with a relatively high confidence but this one that it's predicting on it's predicting on loan that doesn't match the training data and so I don't know if I can actually trust that prediction even though the model says that's probably fine it's okay okay it's in a totally different domain of the data where it's really doesn't know what it's talking about okay it's abstracting from what it knows into an area where it knows nothing okay and so this is a real-time measure of how competent you the model is in that prediction so you would say no that model is not competent there okay so you do this for every single prediction and if you see these accumulated you get started get lots and lots of these then it's time to retrain you go back oh I got to get new data build a new model so that it matches the data is coming in yeah I'll talk mm-hmm okay this is so cool I have to tell you all right so is anybody know been for as long now yes of course you do okay all right so this is this is really strange in real life numeric sets the small it's actually not that strange if you really sit down and think about it but when the first time you look at you go WOW no numeric sets the smaller digits occurred just proportionately often is leading significant digits this is a distribution here of a data set that has numbers in it and the leading digit this is the frequency of numbers that have a 1 in them as the first digit and the frequency of 2 and this is the frequency of 9 this is a property of data sets that have numbers in them period okay applications include accounting records this is a real case of detecting fraud in accounting by using Benford's law just taking the accounting transactions running this and seeing the distribution doesn't match and knowing that those transactions have been fraudulently created yeah electricity bills street addresses stock prices population numbers death rates lengths of rivers okay and you can do I can't do this in the UI unfortunately the news to the API but it is it is pretty fun is one way you could think about doing anomaly detection okay you know varied approach you've probably seen this before if you've ever had statistics class but you have some kind of a single variable you're thinking about you're looking at Heights and test scores you assume that value is distributed normally and you compute some kind of standard deviation your anomaly is the measure of how spread out the numbers are right you have the square root of the variance I'm going fast I know depending the number of instances you choose a multiple of the standard deviations you call that anomaly right this student is scoring three standard deviations over there brilliant somehow right I mean that's three standard deviations huge okay there's a looks like something like this and you go you know here's my normal distribution and here's three standard deviations and these are all the anomalies and those are all the anomalies as well are outliers in this case this would be a standard univariate approach okay it can be very effective it can also be formal okay and the reason why is that multivariate matters so here we have a dataset and we I mean I can I can ask you where are the outliers in this data set and you go that one and that one okay right easy okay but now we're thinking remember about a thousand features and a million rows so it's much harder to visualize like this okay but if we do this if we take this distribution on this axis it looks like this that one's an anomaly everything's fine if I just take a distribution on this one this one's all by itself that one's an anomaly everything's fine okay so looking at this at one variable of time and looking at distributions is working great you agree okay now I give you that it's still super easy to tell where the anomaly is but if I look in this coordinate everything looks perfectly fine and if I look in that coordinate everything looks perfectly fine this data point is only unusual in two dimensions at the same time it's a multivariate anomaly there's there's no handy projection for those coordinates that will tell you what that's an anomaly all right so let's go back to the human experts same pieces which one of these is the most unusual you have to pick one I want votes go ahead yell it out what do you think this is screw what the battery okay what anybody else the key okay all right so here's my oldest okay she said these these are skinny those have corners these are round that one's just wacky okay it's just a totally odd shape okay hi so we got a vote for the key all right and of course it's you know this one is skinny but it's also not smooth okay so she said that one's the most unusual because it has it's skinny but not smooth it has no corners and it's not round so you see what she did is she made groups these ones around these ones have corners whose are skinny and this one somehow it doesn't match any of those groups and so it's the most unusual that's what she said and so the key insight there's the joke is that the most unusual object is different in some way from every partition the features about how you slice up the features the anomalies of the things that always stand out okay I can look at them this way and that one's still weird I can look at it this way and that one's still weird every time I look at it it's always weird in some dimension okay and so the human expert okay same idea I won't belabor that too much we'll use length and width a number of services and we have to introduce a smoothness because she talked about smoothness and so we get things that look like this and so now we have a smoothness here and you see the the battery for example is smooth and the screw and the key are not so we have two things at least that aren't smooth and now what I want to do is just think about I have this data set okay and I want to just split it the same way is just talking about making groups and splitting things randomly okay so here is let's think about it like a decision tree I'm just gonna split randomly so I'm just gonna pick any variable randomly and split so we'll split on smoothness first okay if we do that these are all this this the falses go over here and the throughs go over there let's just split on something else where I'm peeking this totally randomly all right number of surfaces is six okay and we split again and look over here we're done weird to have the key in the screw okay and then we just keep going we keep going we keep going until everything is all by itself isolated that's key isolation okay so now we've just done random splits every time until everything is by itself and the idea oh sorry I should have told you the idea and that's fine but the idea well no sorry I'm trying to go fast okay but the basic idea here is we're just gonna do this again we'll grow a random decision totally random okay until everything in this sample is its own leaf and basically if it takes a whole bunch of random pieces of information to tell these two apart then they're really similar they're similar in all these random ways okay I took random things and they stayed together random split they stayed together it took so many random splits to finally tell them apart so they're somehow really similar but these ones up here it only took one two piece of information and I could tell them apart from everything else just totally random one two okay now of course to do this fairly what you're looking at though is the depth of the tree all right so if it's up here this is an ominous if it's down here it's pretty normal okay and you just do this over and over and over so you just make 128 times for example build all these weird wacky random trees and you look at how deep it is in each one and that's how it works okay I'm not gonna show you that one don't think about it okay and I'm not gonna show you that one either I should at least show you the anomaly detector in action though oh I did all right for diabetes that's true okay I'll give you one more example this one's a bit of a brainiac in the one we're done okay let's think about a one class classifier this is another cool application of the anomaly detector so let's say you place an advertisement in the local newspaper all right and you collect demographic information about all the responders so people write back to you they say I really want that refrigerator or they come into your store so you know who they are okay and you collect their information how old they are how much money you think they make so you've got all this data okay now you want to mark it in a new localities you want to go into a new city and put your ad in the newspaper again and to optimize your mailing cost so we're doing direct letters this time instead of a newspaper article sorry we need to predict who's gonna respond we don't want to send a letter to every single person in the entire city we want to use the same demographic information to figure out who's most likely respond and send letters only to them we want to be very optimistic okay our very specialized in art okay now here's the problem if I asked you to build this model you cannot currently distinguish people who are not interested in the product from people who never saw the ad do you see how that's a problem the data says collected from people who responded to a newspaper ad lots of people that maybe would have loved to buy that refrigerator or whatever it is but just didn't read the newspaper that day so I can't label those people as not wanting my nice refrigerator I only know that they either don't want it or they didn't know about it I only have one class I only know the people that were interested that's it I know one thing okay this is actually a fairly common problem not usually common but what you can do is you can train an anomaly detector on this data okay so you take all your respondents and all your demographic information for the city so we know things about people that didn't respond but we don't know why they didn't respond that's the problem so we build an anomaly detector and then pick the households with the lowest scores so if old has a low anomaly score okay then they are similar to enough of those people that responded positively those remember our dataset has a label that says these are all people who responded and if my demographic information matches they're gonna have a low anomaly score and all the people the high anomaly scores don't match and therefore less likely to respond so I can kind of be sneaky and use an anomaly detector like a predictive model for one class it's a cheat okay so anomaly text is process finds unusual instances I show you some techniques ok univariate standard deviation at least I whip through them Benford's law the isolation force that's what the algorithm is called by the way and you can control the number of trees and there's a few other parameters not super important basically doing a one click most time is all you need works pretty well in fact actually I should tell you and I'm now I'm running over okay but the isolation force part of the reason we implemented is because our chief scientist was working on a DARPA funded project to identify anomalous data maybe I won't tell you what data and the here was trying different anomaly detectors building like a few of his own and trying different ones from literature and when he tried this one it was like fantastic and then he basically sent us an e-mail say okay you have to him to let this right away and we said okay we'll do that it's actually a relatively new algorithm by which I mean in the last decade or something and so yeah it's it's pretty super fantastic you can use it to filter to improve your models find mistakes frauds and intruders knowing when to retrain and maybe you can use it as a one class classifier yeah oh yeah my final word on this I although I said the beginning to these unsupervised learning techniques they require more finesse an interpretation a little bit more practice and playing around so don't be discouraged the first time you're trying them right it just takes a little bit of playing to understand what you're seeing yeah okay now I'm done occasionally any questions before we continue with the next session questions jumped hello a module is playing a little bit more the distances between when you are clustered is something like colors the way I slide if I remember when there were a little protein like how how is far afraid by greeting or something like that Oh in the clustering yeah how you how you tell the distance between categorical values yeah and and yeah so and you're right I kind of just blew through that I and part of reason I did is because if using big value not to think about it we just do it for you automatically but the basic idea whoops is let me zoom right back to it I can show you distance to a categorical so you basically just define a distance so you need some kind of a distance there's no natural sense right you can't really order a red green and blue they all equivalently distant from each other somehow and so you use that fact to create a new distance function so you basically just talk about you know if it's a cat and you say cat again the distance is zero right that's the same thing but if it's different than you just call that a distance of one okay so red and green are different so that's a distance if it red and blue that's a distance of one two if it's green and blue that's a distance of one so if they're not the same it's a one if the same it's a zero and then when you have multiple categories you just you just called you just basically make a new function that combines these so you compute the Euclidean distance between a vector so in this example cat and dog is a distance of one their favorite toy is a laser or squeaky okay that's a distance of one but they both like Reds that's the distance of zero and so you just compute the Euclidean distance of those integer values and so you'd say that's a distance of square root two so you basically just map them to zeros and ones yeah it's a little more complicated the text vectors because they're sparse and that's where use this cosine similarity where you're looking at a vector of potentially thousands of features and so you actually just take the cosine if you look at think of the cosine of vectors it's one if they're collinear and zero if they're orthogonal when you compute the cosine and you don't use that so you only think about positive actors you actually use something called the cosine distance we just take one minus and so now you have the idea that if the same vector if they're collinear then you have 1 minus 1 so that's 0 and that makes sense if two vectors are exactly the same you call the distance of 0 and if they're orthogonal completely orthogonal you call a distance of 1 and anything in between is just something in between it's a very quick way to compute these a distance function on a very sparse vector is with that cosine similarity but all this happens transparently so when you put a text field in there we create this text vector ok inside the data set and if you're clustering we do this cosine similarity for you yep all right this is more like a general question and I wonder if you know these two books one is a automatic inequality and the other one is weapon of mass destructions which tells that using artificial intelligence can lead to increasing inequality in a sense that for instance when you're using for medical insurance for instance that then you you know that people that have some more possibility to get ill then you charge them more of insurance so therefore they can afford it so therefore you get they if they used to get more ill so in a way getting out our liners or classifying credit course course or doing other kind of things can lead to like a bad image for the artificial intelligence so it's just a general idea can you introduce like an ethical parameter so those thing a highlight one when it happens ok how will you deal with that sure you you I mean you could totally do it but algorithmically I don't I don't really see you clear where you almost have to do it by policy right you just have to make a standard that says if you're gonna have a model that's assigning these scores or doing these computations that you tell us something about how you did it right you're gonna have to reveal a bit about how you computed it which of which is the fundamental problem that that people are having a predictive models in general if I give you a neural network that says whether or not you can have a loan and I say you can't have a loan and you say why not and I because the weight on this node is a three you know and your that's a very unsatisfying answer to everyone and yes so has to be a bit more by by policy I think yeah you introduced and explained k-means as a way of clustering but as far as I understand there are a number of different ways of doing clustering and and I'm wondering whether you whether you go to building those other approaches or whether you've got a view and those other approaches and how that more particularly how one would choose between those various approaches to clustering yeah so so by and large if you're just if I mean when people are generically talking about clustering k-means this is really fantastic and the way we've implemented it is super easy to use and so it works for lots of stuff but then there's lots of things where it doesn't work at all and we do projects with companies directly that want help with specific machine learning applications and one in particular was we needed to basically look at routes that people were driving and we needed to cluster basically the destinations and the origins and things like that and for that we had to use a DB scan uh no fee from the DB scan but it's another way to cluster okay which works a little differently as a density based clustering and yeah so there's lots of these things we're market driven you know so when there's when enough people say why don't you have DB scan then we bring it and then everybody gets to enjoy it and but that's one that we've already used and so you'll probably see it in the platform at some point I would guess it's so it's already yeah so it's you'll see that so yes you'll see more for sure yeah is there any metric which you can use to compare the performance of clusterings found by these different algorithms as you oh so different to you there's no common metric no because because realistically it's when you talk about clustering you're being so very specific about what you want it to do in the first place I mean the way I even introduced k-means was let's let's find points that are close together in Euclidean distance that's so specific that k-means does exactly that and there's there's not a lot of value in thinking about other ways to do exactly that if you want to say I want to find the points that are closest together on the surface of a sphere then that changes the problem right and so you now you need to approach it from a different way and so so there's no if that's what you want to do assign points by similarity in a Euclidean space then k-means is fantastic it does exactly [Music]
Info
Channel: bigmlcom
Views: 1,287
Rating: 5 out of 5
Keywords: clusters, clustering, cluster analysis, anomalies, outliers, anomaly detection, unsupervised learning, techniques, Machine Learning, BigML, MLSEV, technology
Id: QhhQE9FmC14
Channel Id: undefined
Length: 66min 9sec (3969 seconds)
Published: Thu Apr 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.