Anomaly Detection: Algorithms, Explanations, Applications

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>> So, it's an honor to have Tom Dietterich with us today. I think I remember the first moment, I think I ran into Tom I was passing a seminar room. I looked in and then walked in at Stanford Margaret Jacks Hall, and there was some sort of a very interesting debate going on. Some machine learning topic. I think Rich's folks were there, and so, it's a just a memory that I've had over the past. I just remember that the first glimmer of the energy of Tom Dietterich. I've come to know him and love his work over the years. Tom mentioned during our fireside chat, the video we did last night together that we were talking a little bit about interests and this idea of that, I think we share one thing where we both have what we call, "Research Attention Deficit Disorder." We get very curious and excited about many topics and it's amazing to me how deep Tom goes on each of them that he attacks over the years. Tom did his undergraduate work at Oberlin, and then, did his master's at UIUC. I think you mentioned you were following at the time the line of where you can find the best computer. >> Right. >> Sticking phones into these 300 bud sockets and so on in those days. And then, did his graduate work at Stanford University, following I guess in [inaudible] after hanging out with Machalski at UIUC which got him into the machine learning world. And has worked on many different areas I think following all the recent trends and leading some of them. Remember, you and I were at the first UAI conference in 1985 together. And I remember you wearing a pair of shorts and asking hard questions in front of the room back in those days. I cannot forget anything Tom about these interesting interactions we've had over the years. Going to the probabilistic graphical model phase, it's always been touching on applications. Has been quite a lot of work during the Caelo project on desktop events and understand trying to infer the intentions of goals of users from desktop streams of evidence. More recently has been written like several of us here at the challenge of AI in the open world issues about robustness and understandability of these algorithms. He was a leader in this whole area that's now called Computational Sustainability, working with Carla Gomes at Cornell at defining this area and somehow convincing the National Science Foundation to fund it deeply. Looking at applications of optimization and other areas of Machine Intelligence to hard challenges with environment and speciation and resource management of planetary resources, sort of resonating with our AI effort now at Microsoft. Today, he'll be talking about a piece of this interesting area on robustness, I think it's an important aspect anomaly detection algorithms explanations and application. So, Tom thanks very much for being with us. >> Thank you. Is this working yet? >> Yes. >> Can you hear me? >> Yes. >> Okay. So of course, everything that I'm talking about today was done by someone else. So, this is a collaboration with two of my fellow faculty in Computer Science, Alan Fern and Weng-Keen Wong. Weng-Keen is currently rotating at NSF as a program manager but he'll be back in the Fall. And then two faculty from statistics, Sharmodeep Battacharyya and Debashis Mondal. And at least one of them did their thesis at UW but I don't remember which one. And then, several graduate students and as their contributions come up, I'll be pointing those out in the talk. So, I have a sort of long agenda here, this is more than I would usually try to cover in one talk, but some of them are fairly straightforward. So, what I want to do is start by defining this question of Anomaly Detection, what do we mean by that because that's not a terribly well-defined term. And then, I'll report briefly review big bench marking effort we did. Mostly, in the knowledge discovery and database community, there have been many algorithms published in the anomaly detection area. I think largely because they have many business problems in fraud detection and management, and this is a big tool there. But a lot of those are just publishing an algorithm, a mechanism without any kind of a theoretical foundation. And so, we wanted to understand them, and then, we've done a bit of work on probably approximately correct type of machine learning theory for what we call rare pattern anomaly detection. Then, I'll talk about scenarios like fraud detection, where you have a ranked list of anomalous transactions or users or whatever and now, you have an analyst in the loop. And so, you want to show the analysts say, your number one candidate and explain to them why you think it's anomalous. Obviously, it's not sufficient to dump 180 element vector on them and say, we think this is an anonymous user, but we want. So, I'll talk about some very simple feature-based explanations. And then, if we've got the analyst in the loop, we can also get feedback from them and use that to re-rank our candidates as we're presenting them. And that turns out at least in our benchmarking to be really powerful. And then, I'll wrap up with two applications. One which is a big sort of sustainability application involving, deploying 20,000 sensors weather stations in sub-Saharan Africa and then trying to automatically detect when stations break and sensors break and so on. And the other is, sort of open world AI problem, which is the problem of open category classification in say multi-class classification. So, defining anomalies. So, we're going to assume that our data are just feature vectors in dimensional real space and that we have a data set that is a mixture of what we'll call the nominal points, so that we don't call them the normal points and the others will call the anomalies. Our assumption is that the anomalies are generated by different generative process than the nominal points. But typically our data will be unlabeled, we'll talk about that I guess on the next slide which is there are really three setting. So, in some cases people have tried to formulate this as a Supervised Learning problem, so we get some labeled data, but in most of the sort of anomaly detection applications, the anomalies are quite rare. Fortunately, 50 percent of our users are not trying to defraud us, although, maybe it seems like that at times. And so, here we have severe class imbalance if we approach things from that perspective. Another possibility is that we are training on clean data that has no anomalies in it, but then, maybe at test time we are going to get contaminated data and we need to identify those contaminants and that's what happens in the open category case. And then, another situation is where our training data is contaminated with some kinds of anomalies and we want to figure out which points those might be. So, we'll be looking at those. So, I think a very important consideration here is this assumption I call the WDAD assumption. The assumption that they anomalies are drawn from a well-defined probability distribution. So, if just to backup, if you're formulating things in the supervised learning way you are essentially assuming my anomalies are coming from some well-defined distribution, right, and I can learn about that. But if we're in an adversarial setting like in the fraud or cyber security kinds of settings then that really is not a valid assumption to make, right? In fact, our adversaries are constantly trying to defeat whatever distributional assumptions we've made. Another case, in the weather network case it seems like it's adversarial in the sense that, it's sort of the number of ways something can be anomalous is sort of an unbounded set and I guess technically speaking it's still a well defined probability distribution but you don't have nearly enough data to have sampled the whole thing. So, you really can't model it. And another case that's come up, I've collaborated a little bit with Kiri Wagstaff, the Jet Propulsion Laboratory. And their user is usually a scientist and they're looking at data coming from some sensor like she's worked on some spectrometers that are on the Mars science observatory. And what the expert wants is to find the interesting data points. And now, the first few interesting data points are actually ways that the instrument fails, and they need to understand those. But then, they start being interested in unusual spectra that might signal some interesting compounds. But then, once they've seen those, then they say, well don't show me any more of those I want to see, show me real anomalies and so, it's a time changing notion. So I think, we need to be sensitive to when it's appropriate to make this assumption. So there are two main strategies for Unsupervised Anomaly Detection. It depends a lot on this parameter that we call Alpha, which is the fraction of the training points or test points that will be these anomalies. If Alpha is reasonably large, then we can think about fitting something like a two component mixture model and actually trying to model that anomaly distribution, if we're willing to make this well-defined distribution assumption. There are various conditions for that to work. Alternatively, if the proportion of anomalies is very small as it typically is in these fraud cases, then the main strategy is to treat it as a problem of outlier detection. So we're going to try to model the distribution of the nominal points, the ordinary users, and then treat the outliers as our candidate anomalies. The advantage there is that we don't have to make any assumption about the distribution. But of course, if the anomalies are not outliers, then we will miss some of them. Also, if the nominal distribution itself has very heavy tails so that it just naturally has a lot of outliers, this won't work either. So as I mentioned in the KDD literature, there must be 20 or 30 different algorithms that have been published. There's a huge number of papers there. A problem with many of these is that, unlike in a lot of other parts of machine learning and Computer Vision and natural language processing, there's not a large collection of standard benchmark datasets that you could do a thorough evaluation on these algorithms. Often, there's some sort of a corporate application with a proprietary dataset that they can't share. Then the most popular benchmark dataset is this computer intrusion dataset that was collected as part of a DARPA project back in the late 90s, and it turns out to be extremely easy. So, we strongly believe in benchmarking and the need to have a large and constantly growing collection of anomaly benchmarks. So, my student Andrew Emmott conducted a big benchmarking study. Our methodology was to repurpose existing supervised learning benchmarks either classification or regression problems. So we chose 19 datasets from the UC Irvine repository, which were all the datasets that satisfied a set of criteria we had. We wanted real-valued features. We wanted enough data. We wanted enough dimension. Then we would choose one or more of the classes to be these anomaly classes, the source of our anomalies, and the rest to be the nominals. This ensures that our anomaly points are being created by some process that's different from the data that's creating the nominals. Then we systematically varied things like Alpha, the relative frequency of the anomalies, the difficulty of the points, like how deeply embedded are they inside our nominal distribution versus how much are they outliers. We added redundant or irrelevant features to see how much that could defeat the algorithms. We also looked at how clustered the anomaly points are. Because to a large extent, these anomaly detection algorithms are all something that some form of approximate density estimation. You're essentially trying to model the probability density of the nominal points, and then score points according to how the very low density points are your anomalies. Now, if the anomalies are all tightly clustered, then they form a little peak of density, they don't look like they're low-density anymore, and so you'll miss them. So, we were interested in that. We created 20 replicates for each configuration of these manipulated values, and that gave us about 25,000 different benchmark datasets. So then, we compared a set of eight different algorithms. I won't go through all of them. But basically, there are some that are doing density estimation, some that we call the quantile-based methods, like the one-class Support Vector Machine is probably the best known of those. It tries to find a boundary around your data that contains say, one minus Delta of your data for some quantile you choose Delta. There are nearest neighbor-type distance methods, and then there are these projection methods. I want to talk about these because these two algorithms turn out to be very attractive for many reasons. So, the first algorithm and the one that actually performed the best on our benchmarking exercise is something called the Isolation Forest. Those of you who know me, know I love decision trees and anything random forests boosting all these things. So, I had a natural affinity for this algorithm. Although, when I first saw this paper came out, I said, "This is such a weird idea." But the basic idea is we're going to construct a fully completely random binary tree. Again, recall that our data are not labeled, so we just choose one of the attributes at random. Then, we choose a splitting threshold uniformly at random out of the observed range of the data. Then we do that, split the data recursively. We split this in some sense overfitting radically, until every data point is in its own leaf in the tree. In the paper, actually, they have ways of avoiding having to do that. You can stop the splitting at some point. The idea, and we'll build an ensemble of these, and these statistics were interested and is actually the depth of a point in the tree, the depth at which a point becomes completely isolated. Like let's say that my one data point x_i, is the only data point in this leaf node here, then we would say x_i, is isolated at depth three. The intuition here is in some sense, we're just throwing like a dartboard random separators at the data. How many random separations do we have to do before a data point is separated from all the rest of the data? The intuition is, well, if you're an outlier, it's going to take only a few random shots and we'll get you isolated. Whereas if you're deeply embedded in a tight cluster, I may have to split, split, split, split forever to get down to isolating you. So, the anomaly score that we get, well, so d-bar is going to be the average isolation depth over the 100 trees say some ensemble. Then, one nice thing in this paper is that they calculate the expected isolation depth from a little bit of theory. So, then they can normalize this so that it lies in a zero-one scale, where anomalies are given a score as close to one, and the nominal points typically have scores less than 0.5 or around 0.5. The other algorithm I want to talk about. Yes. >> Does that favor a particular basis of the features? If you have a linear transform of the features, you get a different basis. This is invariant. For example, if you have the point is not obviously outlier in one basis. But if you transform it, for example, even like the sphere. Then like it's fuzzy around that. I don't know if you can tell just by looking at this via one basis? >> It's true. So, of course, we've also looked at randomly rotating the data before we top it up. It helps a little bit, but the randomness increases the variance of the isolation depth, so then we have to make the ensemble bigger. So, it hasn't been enough of a game to be worthwhile to us because the memory starts to be an issue here. >> Why this business problem arise frequently enough in nature, or like in practice that you should actually care about this? >> Well, so in nature, I don't know. In our benchmark datasets and in our fraud detection applications, the features are generally meaningful. We're not doing this on images or things like this, where pixel 25 has no particular meaning at all. So, yeah. So I certainly wouldn't use this out of the box on images, that would be crazy. Maybe I would use it in the penultimate layer of CNN or something. I'm not even convinced that would be a good idea, but yeah. >> Maybe something like relationship between variables, like say, x squared plus y plus z squared, is like the meaningful variable like the outlier. >> But usually, just as decision trees, basically are looking at higher-order interactions. As you go deeper, you're looking, these things pretty much follow the data distribution. We only put splits where we have data, so they do a pretty good job. There was an interesting paper by Nina Mishra I think, and colleagues at ICML a couple of years ago, where they analyzed a slight variant of this algorithm and showed that the isolation depth is related to an LP norm and the distance between the outlier and the data. So in this case, basically an L1 norm. You can sort of imagine how this would be sort of like an L1 distance. If you look at these different algorithms, I think a lot of them are essentially doing something that's very similar. I mean, the nearest neighbor algorithms are calculating some distance, and the Isolation Forest is definitely calculating something like a distance also. Clearly, these things are using kernel distance, and these things are using something like a Gaussian kernel in the original space. So yeah, of course, you could apply PCA to try to rectify the data in some sense. The other method that I want to mention, or I should say, are there any other questions? Yes. >> Do any your benchmark datasets include behavioral data or sequential data that was generated by capturing traces of someone's behavior like networking intruder's behavior? >> No. So when we apply this in this DARPA, we were in this DARPA program called Atoms. There, we were looking at an organization that had 5,000 employees. We had actual desktop spyware installed, monitoring software on their desktops that they were working for some organization in the military industrial complex, so I'm allowed to say. So, we could go down and really look at every email messages they sent and even the content and all that stuff. But we summarized that at a daily time step, and turned it into a vector of how many emails did they sent, how many websites did they visit, how many this and that, and the other thing. Then we also did some normalization to try to deal with both asking how unusual was this day compared to the typical day of this user, and how unusual is this user on this day compared to everybody else in the company on that day. Because otherwise, we had these things where some new web link would be sent out through the whole company and everybody go visit it, and nobody had ever visited that link before, and we will get all excited if we're only looking at one you user. So we had to do some pretty serious feature engineering to turn this into something that was almost an ID. But then, all these algorithms are purely for that sort of feature vector case. In my weather data, we're again, just having to figure out how we can engineer the features to take what is very much a spatio-temporal time series and turn it into little database views that are approximately exchangeable. Okay, well, the second algorithm I want to mention is even simpler, is called LODA. Given a set of data points, what it does is, is create a set of capital M, say again, say 100 sparse random projections. So, it randomly chooses square root of d, where d, is the dimension of the data, square root of d, of the dimensions, and then defines a random projection onto a one-dimension, and then fits a one-dimensional density model. I've drawn it as a kernel density estimate, but they actually use an adaptive histogram estimator. Then what they do, is just calculate the average of the minus log density of a query point, so the average surprise. We can see, if we took a data point like this one right here, it would have a pretty high probability in this projection. But of course, if we got some other projection, then it will be mapped to a low-density tail. So, this is extremely simple idea, but also works surprisingly well presumably because they're not super however interactions in our data. So looking at three or four dimensions at a time, works pretty well. So then, how do we do our analysis? I point you to the archive papers for the details. But we basically, chose a couple of metrics either the area under the ROC curve, which we put through a LODA transform, or the log of the lift, which is related to the area under the precision-recall curve. Then, just try to set up a linear model to see it across all of these benchmarks, how much of the variation in the metric could be explained by the relative frequency of the anomalies, the difficulty of the points, the clusteredness, the amount of irrelevant features, what we call the mother set, the original, the domain from which the original dataset was drawn, and which algorithm are we using? Because I wanted to say, well, a typical finding in machine learning is that the choice of problem is much more important than which algorithm you use, and we were curious if that was true here or not. The answer is pretty much true also. So, here are those five factors. Here is the change in the R squared basically, how much variation can be explained in these two metrics? So area under the ROC curve, if we take a model and delete the dataset feature, how much R squared do we lose, or how much can we assign to the dataset? And that's the most important variable for explaining the observed AUC. Next is a relative frequency. So, if you have a dataset where the anomalies are very rare, we can do very well because we have very little contamination. As the nominees get more common, then these techniques fail. Because really only one of them has any explicit attempt to be robust and use a robust loss, that's this technique called robust kernel density estimation. All the others are just plunging forward fitting some kind of a model. So, the choice of algorithm is really only the third most important factor, and really not a whole lot more important than the others. But with that caveat, here are the different algorithms that we tested, sorted by decreasing change in the metric compared to a control baseline. What we see, is actually, the Isolation Forest was performing better than the other methods, and then there's sort of like a five-way tie for second place. Surprisingly, the one-class Support Vector Machine and Support Vector Data Description really performed the worst on these problems, which was quite a shock. Because I think if you asked a random machine learning person, "If I had a one-class problem and I needed to fit it, what would you use?" And they'd all say, "Well, gee, how about Scholkopf has this algorithm." We're a little puzzled about this. Of course, it's possible that we just didn't tune these properly. Although, we tried a lot of things, but we wanted to have a fully automated tuning procedure So, it may be that we didn't do that well. But there have been some papers recently in literature and auto tuning those, and so we were trying to use the current best practices. My own hypothesis here, is that the metrics that we're evaluating on are really related to how well you can rank the data points. Whereas these methods really just set it up as are you inside or outside of a particular quantile, a particular level set of the density. So, it's not clear that they're really been trained for the right to do well on these metrics. So, maybe that's why they don't do as well as some of these more distance-based or density-based approaches. So, we also find in general that the Isolation Forest is a bit more robust to irrelevant features, and that it's also a bit more robust to clustered anomaly points. So that may be why it has this slight advantage over the other methods. Okay. Well, let's move on now to thinking about whether we can have some kind of a theory to explain what's going on. Because one of the things we were quite surprised at is when we look at the learning curves, these algorithms all learn quite quickly. Of course, if we look at what existing sample complexity theory do we have for this kind of one-class or density estimation, it's often quite discouraging. So if you look at full density estimation, you typically need a number of examples that's exponential in the dimension of the data, and that doesn't match our experience. For the quantile methods, there are polynomial sample complexity results. So, our observations are maybe more consistent with something polynomial time. So, my colleague Alan Fern came up with this theory we call rare pattern anomaly detection, and it works as follows. So we imagine we're going to define, this is very valiant style pattern learning kind of thing, we're going to define a space of patterns or regions. So we'll define a pattern as a Boolean function that takes a point in d, dimensional space, and maps it either zero or one. So it might be a half plane or an axis-parallel hyper-rectangle, some kind of region. We'll let capital H, be the set of all the patterns in our pattern space. Then, the idea is to define a notion of a rare pattern and a common pattern. To do this, we need to do a little bit of being careful about measures. So we're going to assume some baseline measure. So if we're in say a hyper box in d, dimensional space, we might use just the Lebesgue measure, the uniform measure over that. But if we're in unbounded real space, we need to use something like a standard multivariate Gaussian as our baseline measure. We'll let mu(h) be the measure of that pattern, pattern h. Then, let's let p, be the probability density of the nominal data points, and it's defined over Rd, also. And will let p(h) be the probability assigned to pattern h. So, we'll say a pattern h, is tau-rare if the sort of normalized probability density is less than or equal to some threshold tau, and otherwise we'll say that pattern is tau-common. Then our definition of what we want, is we're going to say, a data point now, a query point is tau-rare if there exists a tau-rare pattern that that data point belongs to. So that data point belongs in pattern h, and h, is tau-rare. So it means there's some way of looking at the data, there's some bin in the data that this query point x, belongs to that is very unusual under the nominal distribution. So, then our goal, an ideal anomaly detection algorithm would output all of the tau-rare query points in a test set, and not output any of the tau-common points that would be the idea. So, I guess what's interesting about this model is that instead of trying to estimate the entire density or even trying to estimate a level set of the density, we're just choosing some set of regions in the space and just trying to figure out roughly what their probability density is or the probability of those. I think that this is why if we don't have too many of those regions, then we can get uniformly good estimates over most of them from a relatively small sample. So, we define an algorithm to be probably approximately correct RPAD. If for any probability density p, and any threshold tau, it draws a polynomial size sample. Then with probability one minus Epsilon, it detects all the tau-rare points and rejects, does not output any tau plus Epsilon common points. So we have to introduce a sort of elbow room parameter to allow the algorithm to make some mistakes. So it's allowed if a point is between tau plus Epsilon rare and tau rare, then the algorithm can do anything it wants there. So, what we end up with is an algorithm such as the following. We can draw a polynomial size sample, and we'll talk about what this is in a moment, and then we just estimate p-hat, is the fraction of the sample that belongs to each of our patterns, and then we can calculate the normalized rareness. Then we just say a query point is declared to be an anomaly if one of these tau-rare patterns is made true by that query point. So we can prove a sample complexity result that says that the number of samples we need is on the order of one over Epsilon squared, the log of the number of patterns plus log of one over Delta. So, very kind of typical pack result. It's a little unfortunate that this is Epsilon squared here. So that's going to mean if we want Epsilon to be 0.01, then we're going to need 10,000. This will be a big number, a big constant. Then we can get a VC dimension version of this, where the VC dimension plugs in here times log of one over Epsilon squared. So, there are lots of different practical algorithms you can design with this procedure. So, half spaces and axis-aligned hyper-rectangles, and in our paper, we worked out the VC dimension of all of these. In particular, we designed a version of the Isolation Forest that exactly matches this theory. To do that, the pattern space has to be chosen a priori in this theory, it can't be adaptive to the data. So to deal with that, we use the usual trick of say, taking our training data and dividing it in half, and using half the data to grow an isolation tree to some fixed depth k, that fixes our pattern space, and then the second-half of the data, we can estimate the probability of all of these bins. So, here are some examples on some datasets, where the plot on the left is for the actual Isolation Forest algorithm grown to a fixed depth k. This is our theoretical version of it also at depth k. The vertical axis is the area under the ROC curve, and I apologize that the axes are different in the two plots. Then these different lines are for different depths of tree. But if we just look in this case, the depth one tree is actually works the best. We see that in this case, the pattern min method actually does a little bit better than the Isolation Forest on the area under the ROC curve. For another dataset particle, the Isolation Forest does much better because notice this is 0.71 down to 0.66, and this would be 0.66 down to 0.52. So, all of these curves basically are below all of these, so the Isolation Forest was much better. Then, here's a dataset shuttle, where the RPAD, the theoretically inspired algorithm is actually doing quite a bit better than Isolation Forest. This time, I did get the axes aligned. So I think qualitatively, well, the shapes of the curves aren't always the same, but yeah, there's sometimes some funny business going on with very small samples. This is particularly weird that we do very well with very small samples, and there's an interesting phenomenon going on here that we don't fully understand. But, when you have a training sample, and it includes some amount of contamination. A standard practice that's done in the Isolation Forest is to train on actually subsets of the data, not on the entire dataset, and the idea is that you're kind of thinning out. It's a robust stratification strategy. By training on smaller sub-samples, the absolute number of contaminating points is reduced. The fraction isn't reduced, but it seems that if you have a tight clustered points, you're reducing that local density of that cluster. And I think this helps the algorithms do a better job, but we don't have a good theoretical understanding of this. But empirically, even the original authors of the Isolation Forest algorithm say that you should train on a small subsets of the data rather than the entire dataset, and that this often helps. For those of us in machine learning it's like, "What, you're not using all your data?" But of course, we're building an ensemble. So across the whole ensemble, we're using all of our data. The smaller the subsets, the bigger the ensemble needs to be, because you also have to reduce the variance that you've introduced by choosing these random subsets. So this is a paradox that where this kind of thing, where the small samples are actually performing well. It's not observed all the time, but it's observed enough of the time that we think it's a real phenomenon and we're trying to find a nice theoretical understanding of it. Okay. So the conclusions from this is that we feel that this rare pattern anomaly detection theory captures a lot of the behavior of the Isolation Forest or of LODA, because these methods are solving a simpler problem than full density estimation or even estimating level sets of the density. So, I guess maybe I should pause and say, are there any questions about that? Yes. >> How does this algorithm compared on a high-dimensional data, very high-dimensional data? >> Well, I would guess on very high-dimensional data, you need to use a very large ensemble of trees. Let's see, what's the right way to talk about this? If your space of rare patterns is, in some sense, if you have a distribution of the data, you want to have these rare patterns kind of surrounding the data. Obviously, one intuition is in high-dimensional space, you're going to need a lot of those patterns surrounding the cloud of data. So, the number of patterns you might need to really tile the entire region might grow exponentially. So, our result says the sample size is polynomial in the number of patterns, but the number of patterns that you need might be exponential. So, that's a good question. >> Have you tried this on a bag-of-words? >> We have not tried this on a bag-of-words thing for example, yeah. Now, one might hope, so I'll do the standard machine learning thing and say, "Oh, but the data really lives in a low-dimensional manifold." So all we have to do, is identify that manifold then. Well, but we know that tree-type algorithms will in some sense follow that manifold. So, we would maybe we'll still do okay. I'm waving my hands. Yes. >> So, it seems like it still requires knowing the pattern space. But then, if you're dealing with anomaly detection, many successful intrusions happen exactly because there is some sort of behavioral pattern that you didn't take into account, and so the adversary was able to exploit it. So, is there a way to extend or fix this theory to maybe at least give you some indication that there's something in the data for which you don't maybe have a pattern, or you didn't miss the pattern that may be an anomaly? >> Well, if we think about what, for instance, the Isolation Forest does, it actually partitions the entire input space into patterns. So, there is always a pattern for every possible observation. Many of those patterns will be essentially empty. In fact, I skipped over a little detail of the Isolation Forest. But when you do a split at a node and you split the data, there is little empty spaces between the observed range of the data in the children and the observed range in the parent. They're actually, a query point can fall into that empty space in between the observed ranges. I'm being very inarticulate here about this. But if I have a set of data points along a line and I choose an arbitrary split, there's going to be some space between adjacent instances and that empty space effect for the algorithm allocates two additional patterns for that empty space. Because if a query point comes down, it's considered to be isolated if it falls into this space, it was inside the range of the parent and outside the range of the child. That turns out to be really important. If you don't do this right, then the performance of the algorithm suffers quite a lot. So, I guess my first point would be that if we use some of the data to build the tree structure, and if the tree structure is really following the structure of the data, then we will have tiled the space nicely. And every leaf node by construction is going to be rare because it will have at most one point in it. So we're going to have a lot of reasons that have essentially zero points or one in them, and we've also completely tiled the space. So, if there is an unusual behavior, it will fall into one of these empty spaces basically and will be detected. >> Then how do you tell between interesting anomalous behaviors and uninteresting anomalous behaviors? Because essentially, you won't be able to do that. >> Well, yeah, because we're only doing outlier detection here, and we have no knowledge that an outlier. We can't tell the difference between an outlier that's part of the natural variation of the distribution and an outlier that's due to some unusual behavior that's caused by something different like a fraudster or something, yeah. In fact, of course, a fraudster is going to try to mimic the nominal distribution, and so we might want to try to use some essential behavior, things that the fraudulent user is not able to manipulate to help us detect their behavior. But, all I'll say is that in the competitions inside this DARPA program, we were number one in several months using these kind of techniques. So, maybe that's a tribute to our future engineering skills or maybe the red team made it too easy, I don't know. But these methods were working pretty well. We're competing against an IBM team that actually has a product in this space, so I don't know. Anything else? Okay, well, let me move on now to assuming that we can find these anomalies, how can we explain them to the user? So, our basic idea is the following. We were really troubled by what constitutes a good explanation, and how could we measure that in a rigorous way without doing human subject studies. So, here is the thing that we came up with. We imagined that the goal of our anomaly detection system is to persuade the analyst that a candidate anomaly is a real anomaly. The idea is we're going to expose one feature at a time to the analyst, and in the worst case until we've presented all of the different features to the analyst. Then with an appropriate visualization tool, until we've either convinced the analyst that it really is an anomaly or the analyst says, no, it isn't. We sort of view the analyst, their job is really to decide whether to say launch an internal investigation like a fraud investigation, or not. So, this is the the kind of model here. So, we call this a sequential feature explanation. We had a paper at UAI two years ago on this. So in pictures, the vertical axis here is the experts or the analysts' assessment of the probability that the data point is a nominal, not an anomaly, given the features you've seen so far. So we'll say that notionally it starts out pretty close to one because say, one minus Epsilon, where Epsilon is the background probability of being an anomaly. In our sequential feature explanation is some permutation of the feature, say, feature number three, we think is the most important. So we start presenting these features to the expert. If we're being persuasive, the experts posterior belief that this is a normal point starts to drop, and hopefully goes to zero. What we do is choose some threshold, and we talk about our metric is something we call the minimum feature prefix, MFP, which in this case will be four. It took us the first four features in our ordering, we're required to drive the expert's assessment down below say point, whatever that is, 0.25. So, this is the MFP is our metric. Across the whole dataset of anomalies, we'll look at the average of that as being our performance metric. So far, anomaly detectors working well. Or if we have a good explanation algorithm, it should use short explanations. So this is sort of the minimum kind of thing we might have and obviously, it's only appropriate for problems where the features are meaningful. So, in this case, our technique, we relied on another one of our anomaly detectors, which is an ensemble of Gaussian mixture model, so the EGMM algorithm. We had originally developed this just as a control. So we apply PCA and then in the PCA space, we fit an ensemble of Gaussian mixture models, where we randomly vary the number of mixture components and the starting initialization, and we discard mixture components that didn't fit the data very well. And of the ones that survive, then we can get our anomaly scores are actually, again, the average surprise across those models. So it's the product of the density estimates I guess. So, then, we use a simple greedy algorithm. So, the score of a particular feature is how much the log likelihood drops when we marginalize out, or I've got this backwards there should probably a minus sign here. It's how much the surprise decreases when we compare the marginalized distribution to the original distribution, so right there. This will always have a higher probability density after we've marginalized that one feature, then the full distribution had. So, we do this, we score each feature and the one that led to the biggest change in the density, we say that's the feature of the biggest impact, and then we remove it from this set of candidate features and add it to our answer and just do that in a greedy fashion. Again, we used a bunch of our datasets. We only used seven of the UCI domains instead of 19 in this study. Instead of using real human analysts, we trained regularized random forests on the supervised version of the data, so they can see all the data. Then the question was, we would convince them if they saw all the first four features, is there a predicted probability that it belongs to the anomaly class, or to the nominal class as it dropped below some threshold. So, we compared against a random ordering of the features and against an Oracle that actually knows the labels of the data. What we saw was that we always did better than random, and sometimes we do almost as well as the Oracle. So, this is random ordering. This is the Oracle in blue, and this is our greedy algorithm. Sometimes, we don't do as well. I guess one thing that's kind of interesting is that for most of these Irvine type domains, on average, the Oracle only needs to show the analyst one or at most two features to give an explanation. So, that's convincing. So, that's kind of interesting. This gives us a kind of metric of how hard a dataset is. So, if we look at the KD1999 dataset, for instance, it's always the case that you need only one feature. Even our algorithm only needs to use one feature to explain an anomaly. Whereas if we look at our this Adam's insider threat database, you need about seven on average, so it's a much more challenging problem. Well, if we're going to be talking to an analyst and convincing them, we can also get feedback from an analyst. So, we wanted to also then look at, could we take the feedback from the analyst and feed it back into our anomaly detection process to re-rank the candidates that we have. So, we had a paper at ICDM a year and a half ago showing that this could be a potentially big win. So what we do is, we require one parameter which is a quantile, basically corresponds to how many cases can an analyst look at in a given, like in a day or a month, whatever the time period is that we're analyzing. So, the idea is that the analyst can look and say, Sixty cases in a month. We want to optimize the 60 cases we present." So we start by presenting to the analyst the highest scoring anomaly we have and then the analyst says, "Yes, that's the case I'm going to open an investigation, or no, it isn't." And then we feed back that. And what we've tried to do, is learn to reweight. In this case, we're working with the LODA anomaly detector, which remember, creates a bunch of random projections, one-dimensional random projections. So we learned to reweight those projections to try to rank all of the known anomalies above quantile tau, and all of the known nominals below quantile tau. To do this, we used a modification of a method from Stephen Boyd and famous colleagues called Accuracy at the Top that was published at NIPS several years ago now. So, we're re-ranking based on feedback from the expert, and we call this active anomaly detection. I guess the only interesting thing is that if the anomalies are fairly rare, then we will have a significant false alarm rate. So it could very well be that the first, three, or four, five, ten cases we showed to the analyst, none of them are actual fraud cases. What can we do? Well, one of the advantages of having this threshold tau is that even with no positive examples, we still learn to push down all of those non-anomalies below the threshold tau. So, this seems to help us learn even without any positives, but we had to make some minor modifications. So, we compared against a baseline where we do no learning, we just take our initial anomaly scores and present them in that order. We looked at randomly ordering them, and then this is our method. Then there are two other methods in the literature. One called AI2, which I don't think has ever been published except as a CSAIL Tech Report. So it's probably a class project or something, although it works quite well in some cases. Then there's something called SSAD, which was in a JAIR paper. Their method from Gornitz, et al. is called Marginalize and Cluster, I think. We thought we had an idea for improving it using our Accuracy at the Top method, but as you'll see it was not an improvement, but just for completeness. So, what we're going to plot in these plots, is along the horizontal axis is how many cases have we presented to the analyst, and the vertical axis is how many of them turned out to be true anomalies. Again, all this is done with a simulated analyst, and so on. So, we can see that our baseline method is this blue curve. So, if we have a budget of 60 cases we can look at, we find about, I don't know, 25 or 26 or something. Whereas if we can incorporate our feedback, we get 45. So, huge advantage from having the analyst in loop. However, this is the KD1999 dataset, which is extremely easy. So let's move onto some other ones. So this is a Abalone dataset. Here, early on, we get some benefit of our method over our baseline. No feedback is the blue curve, and ours is the red curve. Then this is SSAD method, which is doing just about as well as ours. But if we say, well, suppose we only had time to look at 20 cases, then we'd still be seeing more than a factor of two improvement. Eventually, obviously if you go out far enough on this, there will be no benefit. Because eventually, we'll have shown everything to the analyst. Here's another one for thyroid. In this case, again, compared to our baseline and we're doing pretty well. In this case, this AI2 method is doing almost as well as we are. But again, the benefit is potentially almost a factor of two in terms of the number of anomalies we can detect. In mammography, similar story. So in general, for all of these, we're seeing that incorporating the analyst feedback can be really powerful at least for this LODA algorithm. But now, let me speculate about why this might be overly optimistic. We have to remember that these datasets that we've constructed here are constructed from the supervised learning datasets. What's going on when we reweight these different projections, we're essentially reweighting the original features and putting more weight on some than on others. So, it could be that because our anomalies are coming from classes in the classification problem, they were presumably determined to be classes because they were believed to be separable using the features. It may be they were essentially in a sort of a sneaky way of learning a classifier and taking advantage of that. So, it's not clear in a real fraud detection case would we get this same factor of two improvements. However, we have a paper under review right now at KDD, where we're doing reweighting for Isolation Forest using a sort of online learning strategy. We have applied this to this advanced persistent threat cybersecurity case, which is not a supervised learning case, and we're still seeing a substantial, not a factor two, but a substantial benefit from integrating the feedback. Yes. >> Would it help to put the label to answer that question? >> Labeled? >> So you have your classes. >> Right. >> Would you put some label noise, it's going to start mixing the classes, they wouldn't be separable anymore. >> Oh, you mean when we constructed, that's interesting. That's interesting. We should maybe look at that. That would be a good test at least for this particularly. Yeah. I guess another thing is, we could create benchmarks where our nominal points come from say just one class and the anomalies are from the mixture of all the other classes. Or maybe we should use lots of each classes so that there's probably not a simple decision boundary that separates the nominals from the anomalies, because they'll be very multi-modal distributions on both sides. Yeah. >> So the feedback you have here is very example driven right on the analyst judging thing. It seems like you can also order it to be feature-driven, or going back to pattern terminology, patterns. So in fraud cases, people are often interested in patterns like a small gas station purchase followed by big purchase, is that pattern right? And instead, the anomalies are evidence that this pattern exists. Have you thought about trying to order in other ways where people might be able to give a more sample-efficient feedback over an entire feature as to whether or not I believe something anomalous is happening here? >> We have not. We did try to do some sort of model-based work back in the Insider threat dataset, where we had some theories of Insider threats and we designed features specifically to look for those threats, like someone copying a large number of files onto a thumb drive. But you have in mind maybe more of a knowledge discovery setting where the analyst looks at our anomalies. Or maybe can you do like as we've been discussing today, maybe some cluster analysis of these anomalies the way you guys are clustering mistakes and say, "Oh, I see sort of a pattern in this, and this looks completely bogus to me. This looks like just some side effect of some instrument. So, I don't want to see any more of these cases like this, for instance, or this looks interesting to me. So, everything that looks like this, let me label that whole cluster." Yeah. >> Okay. How you select the example that they set them in kind of set based criterion right. If you want to exclude a whole feature, if you just randomly select examples, chances are you won't exclude that feature. Whereas if you start selecting them in conjunction, you can do that. >> Yes. We did look at whether the sort of select the biggest anomaly is the right thing to do. And Weng-Keen, you may or may not know, but when he was a graduate student, he was working on rare category discovery and there you do have some diversity tricks. So we could think of this as also, could we elicit from the expert. Again, consistent with some of the things I've been hearing about here. Not only is that anomaly, but I'm going to call that a full anomaly because this corresponds to a kind of use-case I know. This is the check that the credit card is working before you use it pattern. No, we haven't thought about that, but it's a great idea obviously. Yes. >> Can you still reward diversity, because diversity will definitely change the efficacy of query versus anomaly. So, it seems like your quick trick that can completely change your metric, or is that not true? >> No, we would really have to also have something in our metric that rewards diversity. Well, as you say, if we could essentially present a group of proposed anomalies, and say we think these points are all anomalous for the same reason, are we right? And get feedback on the whole group at once, then that would be great. Then the next query, we would use obviously a different group. But we might have to balance them, I know in real situations. I remember years ago, I visited Bell Labs and Darrell [inaudible] was working on payphone fraud for, well, it wasn't even AT&T at the time, and they had some statistical model and they were looking for weird behavior. But they also ranked them by how valuable they were in terms of the financial impact of those frauds. So, that was it. Okay, well, I've got to move along because we're really behind. So, let's talk about these two applications. So the first one is, data cleaning and weather networks. So, I'm part of a group that's working on something called the Trans-African Hydro-Meteorological Observatory, and it's founded by two folks. One is a colleague of mine John Selker, who's a hydrologist and the other is a guy named Nick van de Giesen, who's at the Technical University of Delft, also a hydrologist. They were at some World Meteorological Organization meeting, at which this plot was put up showing where are the weather stations around the world that reliably report data to the World Meteorological Organization. Those purple blue dots are the ones that are the good weather stations. We can see that in most of Latin America and Africa, the weather infrastructure is just very poor. So, in the context of agriculture in Africa, most agriculture is rain-fed, there's not a lot of irrigation. They noted that when you have very poor sensing, then you can't buy crop insurance because the insurance companies don't know what risk they would take. They don't know how to price the insurance. So, that leaves the farmers to plant their plants very far apart, so that each plant under the worst case precipitation conditions will still get enough water to do well. There have been field trials showing that if you do offer crop insurance, then the farmers plant much more aggressively, plant their plants much closer together and get a much higher agricultural productivity as a result. So, this is one of these examples of if we could provide better data, this could lead to crop insurance industry, which could in turn lead to higher agricultural productivity. John Selker is not a modest man. He wants to have Africa leapfrog over the whole rest of the world and become the best sensed continents. Of course, the price of sensors is dropping rapidly, and we can now build weather stations that are very cheap. So, we're trying to deploy 20,000 such stations across Africa and then create various data products. I like to say, "Well, weather stations are really cool, but for some reason, people want to put them out in the weather and expose them to the rain and all kinds of problems, and then they break." So, this is a kind of Internet of Things project. I like to say, the Internet of Things is going to be the Internet or broken things. So, how can we automatically detect which stations have failed, which sensors have failed? So, we might approach this as a sort of medical diagnosis problem. If we could say, "Well, these sensors tend to get one in five different diseases, can we learn to recognize the symptoms?" But this is one of those cases where we feel like there seems to be an unbounded number of ways these weather systems can fail. So, where there are known symptoms, we can write rules for them or build probabilistic models, but we want to use anomaly detection to detect new kinds of failures. So, I think in the interest of time I won't go through all the technical stuff here. But I'll just say that we are combining anomaly detection scores in a dynamic Bayesian network, where we reason about the state of each sensors. Then we're doing a foreign map inference to try to say given the observations that we've made turned into anomalies scores, we try to not just flag the data with anomaly flags, but actually do some root cause analysis. And say the reason that you got these 2,000 alarms, is due to one particular thermometer failing for some time interval. So, I guess I should say, maybe the most interesting or weirdest move here is a typical problem you have when you try to do diagnosis on this, is you want to model the sensor readings as a function of the state of the sensors. But these sensor distributions can be very messy, and so we replace the actual distribution sensor readings by the anomaly score assigned to those sensor readings. This has a nice virtue of normalizing everything to be in zero and one, and having a more predictable distribution. At least that's our hypothesis, but we don't have definitive results to prove this. So, I'll say, we're currently working a lot on comparing sensor readings at multiple stations. So then the other problem which is more related to open world is Open Category Classification. So, given that I have training data for say K classes that I've collected and I trained a multi-class classifier for them, but then my test data may contain queries corresponding to new classes, how can I detect them? We call them alien queries, the queries that belong to new kinds of objects. So, our approach is, given our training data, we train both our multi-class classifier but also an anomaly detector. So, then when an input query comes in, we look at the anomaly score. And if it's greater than some threshold, then we reject that point and say, "We think this might belong to a new category, so we're not going to classify it." But otherwise, we feed through our classifier and make our prediction. So, this was for me was motivated by another ecological problem which was, for about a decade I was working with some folks at the EPA and on campus at freshwater stream health monitoring. So the EPA every year in the United States does randomized sampling of freshwater streams, and what they do is they use a kick-net, which is what these guys are doing here, to collect insect larvae that live on the substrate of the streams, and then they bring those specimens into the lab. Currently, they manually identify them to genus or species. But this is very time consuming and so we thought, "Well, we'll build a robotic device that can singulate the specimens as they say, and then photograph them, and then we'll use Computer Vision to tell them apart. We did this back in the mid 2000s when everything was Bag of SIFT and Support Vector Machines as the classification technology. But we got up into the low 90 percent correct, over ultimately 54 categories. So we were all excited about that and then we started to look at actually deploying it, and we realized although our collaborators had collected specimens and we had these beautiful training data for all the specimens, the test samples are going to contain other stuff. It turns out more than 29 species or more than 54 species out there, like it's estimated there's something like 1,200 species of freshwater macro invertebrates as they call them. So this led us to think about, how could we deal with this problem? Of course, more recently in Computer Vision, this has become a popular problem. So, there are several techniques that are out there but as far as we know, no one has looked at the question whether we can provide any kind of theoretical guarantees. So, we are looking at a set up now, where we have two training sets. One training set is of our clean training data. Our collaborators went out and collected all these training data for us. It's all nicely labeled, and we know there are no contaminating samples in these. But in addition, we're going to require that we have a unlabeled collection of samples that are representative of our test queries. So, here, we are making a well-defined anomaly distribution assumption that we can get a sample from the distribution of anomalies. In fact, even talk about what do we mean to have a pat guarantee on detecting say 99 percent of all the aliens, right there, I'm assuming that's a sensible statement that there is such a distribution. So, we're going to assume that this contaminated sample, which we'll call SM for mixture is a mixture where again, the proportion of the aliens is Alpha. So, this density is a mixture of one minus Alpha times the clean distribution, p_not times Alpha times this anomaly distribution. Now, what we want to find is, if let's suppose this is the CDF of the anomaly scores of the anomaly points. So, we're really interested in, say we wanted to detect 95 percent of the aliens or anomalies, then we want to know this quantile. The anomaly score, such that 95 percent of the probability mass is greater than that. So we'd like to estimate this tau at 0.05 or 0.01 or whatever we want. So we'll call this q, quantile 0.05. So, we note that if this is a mixture distribution since we're in one-dimensional world, it's true that for the cumulative distribution functions also, that the CDF of the anomaly scores for the mixture distribution is a convex combination of the nominal distribution in the anomaly distribution. So if you do a little algebra, you can say, GD, anomaly CDF can be calculated. We can estimate this from data. We can estimate this from data. If someone will just tell us Alpha or give us maybe a reasonable upper bound on Alpha, then we can use that to estimate this CDF. So, the picture looks something like this. This might be the CDF of the clean data, the black one is the mixture data, and maybe this is our anomaly CDF. So, we're going to set our threshold somewhere around in here maybe. We'll get all the anomalies, but we will have some number of false alarms too. So, of course, this all depends on how good the anomaly detector is, and how well they can distinguish the aliens from the known classes. So, when we look in practice, this is empirical CDF of the of the clean data, this is the empirical CDF of the mixture data, and of course, we're kind of subtracting them. So, unfortunately, this thing can go negative. This is the estimated CDF of the anomalies. So, it's pretty sad. But if we have a larger sample, it starts to look reasonable. But it still can cross this threshold a few times, and so we look at the largest value of the anomaly score that crosses this 0.05 line and we set our threshold there. That's what this pseudocode says. Then we're able to get a guarantee that if the number of samples in these two datasets is equal to n, then the required sample size is something like this. We have one over Epsilon squared term, we have something that's sort of a one over Alpha squared term, and then we have this, which is a funny thing that comes from the fact that we're estimating two different cumulative distribution functions and there are some concentration results for CDFs. So, then with probability one minus Delta, the alien detection rate will be at least one minus q plus Epsilon. So once again, we need some elbow room in the form of some Epsilon. Anyway, so that's what we can do. Then experimentally, I didn't put the results here, but we have a paper under review for ICML where we also show experimental results for this. Of course, it's a loose bound but it is a bound, and it does work in practice. We see that for some reasonable things, so we have some results on some image datasets and some benchmark datasets. We get false alarm rates using Isolation Forest as our anomaly detector. False alarm rates somewhere in the 10 to 20 percent range for reasonable sample sizes. So we're very happy with this, very pleased with it. So, to finish then. Outlier detection, I talked about the different unsupervised and mixed. Our benchmarking suggests that this Isolation Forest algorithm is quite nice. Surprisingly, that these percentile methods or quantile methods are not doing so well. I should say in passing, obviously these things don't work at all well by themselves for image data, we've kind of discussed this already, any kind of signal data or time series data. So, I think the next steps here, I'm very intrigued by things like GANs, and autoencoders, and other kind of manifold learning and density estimation techniques that are coming out of the deep learning literature, and so that's the next thing I want to try. There's some very nice papers at the last couple of NIPS about this, because I think that might give us much better anomaly detectors than we can get using these simple IAD feature based methods. Then we talked about the RPAD theory, which accounts for their behavior I think reasonably well. Then we talked about presenting the most important features, a simple ordered feature explanation incorporating expert feedback using this Accuracy at the Top strategy. Finally, guarantees for Open Category Detection. I apologize for running long, and thank my funders, and open it up for questions. >> Thanks a lot. >> Thank you very much. Yeah. >> So, at the beginning of the talk, you mentioned that there was a fundamental assumption about the nominal distribution and the anomaly distribution? >> Well, the assumption I wanted to call out is, is it safe in my application to assume that the anomalies come from a well-defined distribution. >> I'm curious, can you sketch how to weaken such an assumption to allow the fraudster also in the loop not just the analyst in the loop, but an arms race between the anomaly detector and the people who are trying to gain the thing. >> Well, so, I don't know how to do that in a theoretical framework. It seems to me that we end up having to think about it more like a game or minimaxing kind of adversarial learning setup instead, and try to get some guarantees that way, so that would be quite different from this. Of course, there is some interesting work in adversarial learning for supervised learning. Of course, everybody in deep nets right now is trying to do with adversarial examples in deep learning. Obviously, some of the strategies that are being explored like using GANs and trying to map queries onto the data manifold to strip out these adversarial attacks, many are very closely related to the same ideas of could we detect the adversarial queries as being outliers that fall outside of our data distribution. So, I think anomaly detection will again be part of the answer but this theory for open category will not be part of the answer there. We need some other more adversarial approach. >> And then just a clarification. Was the DARPA Atoms thing run as an arms race between red and blue team or was it? >> Yeah, so there was a red team that was actually based on people from CERT, the computer security guys. They have this whole White paper on different kinds of Insider threats in organizations that they've collected both from DOD and from industry. My favorite one is what they called the Insider Startup, that a sub-team decides, the company doesn't appreciate us, we're going to take the source tree with us and go form a startup. But there were other things, like someone who's passed over for promotion, survivor guilt, all these, and so they have this whole set of things, a lot of them very psychological. Then they hired a script writer to write scripts about what kinds of emails and text messages will these people be exchanging in addition to their desktop data, then they acted out all of this under this desktop and monitoring software, and then overlaid it on the behavior of real employees in the company. I had access to this in a very controlled setting, and we had to see if we could find those red team inserts. But unfortunately, we don't really know how hard they were and of course, we don't have access to the data anymore. It would have been interesting to, once we have the ground truth labels to go and see well. So how could we train a supervised learner, like go to the nearest neighbor classifier and figure this out quickly, we don't know. >> In the weather station example, I'm curious. Mostly, the anomaly detection algorithm you talked about seemed to be optimizing for low false alarm rates and high gadgets. But in some of the systems depending on how you're reacting to the anomaly, you might be able to tolerate lots of false alarms depending on what's important and how you're recovering from a problem. I was curious, in the weather station, what happens when you detect an anomaly? >> Well, so, in all existing weather networks that I'm aware of, they have a lot of flags and then they rely on some human eyeballs to look at them all. So for instance, Oklahoma Mesonet is the best weather network in the United States. They have a big budget, and they have four full-time meteorologists for 130 stations. So scaling that up to 20,000 stations in Africa, needed heck of a lot of. I just don't think that strategy scales to the Internet of Things so that's why I think we need to bring much more intelligence to bear. Also, most of the existing weather networks, they use basically hand-coded expert systems for writing these problem detection rules and so they are completely ad hoc. So, I think in some sense, low-hanging fruit for us because by bringing probabilistic inference on top of the alarming, we should be able to greatly. We still have lots of alarms, but we can say these 1000 alarms are really just due to one failure, and create a trouble ticket for it. Well, we're well on the way to building a system where the weather network manager, who everyday logs in and he gets a report of what are the new failures we found in the last 24 hours. Then he can very easily create trouble tickets and assign those to the field engineers, who then will be scheduled to go out and do the repairs. In our dreams, that would be integrated with an inventory system and on and on. But that's how they're set up. >> But you have to give a lot more knowledge than usually about how sensors failure essentially air models and must be not so much control you're aware off, where experts have very rich models about house, they would just use like this like doing weird things, but they filter in certain ways. [inaudible] >> A mature system, presumably, ultimately we find these recurring anomalies. So, then they bring them into the lab and say, "Okay, what's going on here?" Then, we've already had this happen. We didn't discover with our code, but there was a manufacturing defect or design defect in the station, and so all the generation two stations are having to be replaced with. But over time, you will accumulate a library of failure models and with very specialized detectors for those. But we want this anomaly infrastructure as a fallback because there's always something new. We haven't had any of our stations attacked by an elephant yet but I'm sure it'll happen. Oklahoma, they have unbelievable stories about cows getting into the thing. What is the symptom of a weather station where some ants have decided to lay eggs all over the circuit board? It's hard to have a disease model for that. >> I was just wondering about the online case of anomaly detection. You seem to assume here in the problem statement that the nominal distribution is stable over time. >> Absolutely, yes. >> What happens if the nominal distribution starts drifting over time because since the weighted identification changes, or whatever other issues? >> So, somehow we're going to have to track that distribution. Yeah, we haven't confronted that case here. Although in the weather data case, it's not stationary but it is very predictable in an annual cycle, on a daily cycle, there are lots of periodic components that you can sort of the traditional time-series techniques, model and pull them out of the signal. Then in your dream, you have now uncorrelated white noise and then it looks. >> [inaudible]. >> Maybe it's a part of the answer, but actually we've spent a whole year trying to do good feature engineering and decided this isn't working for us. Because it's just too much work to do all that modeling. When you make modeling errors, uses artifacts, and then we start making mistakes because of those. So, we're looking now at basically, looking at nearby weather stations that should be experiencing the same disturbances, the same annual daily cycles, and then comparing them. How well can we predict one using the other or what's their joint distribution, and using that to do the anomaly detection stuff. So, yeah. >> So you retrain the models, or there's some drift between predictions? >> Yes. Actually, we need to retrain the models really sort of a sliding window thing because again, the nature of correlations changes. In wintertime, for instance, precipitation is our most important variable, our most problematic sensor, and it has the smallest spatial scale in terms of spatial smoothness. In the wintertime, if we think about the Pacific Northwest, it's very smooth, very large storm. It affects the whole Pacific Northwest. So, if it's raining in one place, it's should be raining at another station and we can use that correlation. But in the summer time or in the late spring when we get these showers, stations becoming extremely uncorrelated, so then we need to have in a sense switch models to use a model that corresponds to a showery day with lots of little cells. Really, I think we need to be tracking that because it really is happening day-by-day weather. Like the first day of a rain event, it might have this large scale and then by the third day, it's into these little uncorrelated things. So, I think we actually need to attract that kind of larger spatial phenomenon. Speaking of larger spatial phenomenon, another thing is if you have a spatially clustered set of anomaly of alarms all ringing at once. It's probably a real weather event and an unusual weather event, not any of the stations broken, and that's knowledge it's not anywhere in our system, but the meteorologists definitely know this. Of course, there are other data sources like satellites, and radar, and things that we can use as additional sources of data for this, so. >> Okay. With that, we probably should thank. Thanks very much. >> Thank you.
Info
Channel: Microsoft Research
Views: 61,878
Rating: undefined out of 5
Keywords: microsoft research
Id: 12Xq9OLdQwQ
Channel Id: undefined
Length: 86min 56sec (5216 seconds)
Published: Thu Apr 05 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.