Anomaly Detection 101 - Elizabeth (Betsy) Nichols Ph.D.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay well I hope your enthusiasm lasts we'll see okay so I am Betsy I work at a company called nough to ative we're a 12 year old company and the very first product that the company introduced was effectively an anomaly detection engine for managing data centers so this talk really comes from two percent two perspectives the first is obviously how do we apply anomaly detection to DevOps and secondly really lessons learned so first a word from our sponsors who are our customers while Anna to ative brought really software and some math to the table our customers bring deep domain expertise a sense of adventure and a willingness to collaborate and as a result of the collaboration we've learned a lot and that's really a lot of the content of this talk so the stories you are about to hear are true however the names have been changed to protect the innocent so first of all let's start with a definition anomaly detection is basically finding patterns that do not conform to expected behavior that's a fairly abstract definition but it's useful and in this first picture probably most of you can see the anomaly the clue here is color right all right let's move on to another example here we really have two clothes right we have color which is a sort of static property which is interesting to DevOps but probably not as interesting as more dynamic properties and in this particular case there's a dynamic property was the direction in which the little fishy is swimming right here's another example presumably you can all recognize this anomaly the little red the little red arrow is next to one of the dot and the what the differentiating characteristic here is really distance the nearest neighbor of that dot with the red the red the red arrow is much further than the nearest neighbor of any of the other ones so the anomaly detection is really all about looking at clues and trying to compare what the attributes are across all those various you know the possible places that could be having an anomaly here's another case where it's a before-and-after kind of case and you're looking for differences and I don't know it's a little harder to see intentionally there's one here I don't know how many you saw that one but what's interesting about this pardon somebody say something yeah all right but what's interesting about this particular picture is there's got lots of anomalies and that's what happens a lot in DevOps is anomalies don't curve sort of as Singleton's they occur in groups okay last example this is a more DevOps esque example can you see the anomalies here what this is it is actual data it's from a credit card clearing company and what it's showing is the count of successfully off the authenticated credit card purchases and the bands or bands of normalcy which we'll talk about later but what's interesting here is that there's a pattern write the number of credit card transactions that occur at night is basically zero and then they go up during the day to a sort of a peak and then they come down in a kind of inverted parabola right you can see the anomalies maybe these are these are tough ones to catch maybe not by the human eye but they're tough to catch via software and via algorithms we'll talk more about that so anomalies come in a lot of shapes and sizes they're based a lot on a lot of different types of crude criteria and so they're complicated it sort of takes a village to create an effective implementation of anomalies so this particular diagram is sort of trying to capture that and at the one hand you have research that is being performed primarily in academia maybe in labs of big companies and the research turns out models and algorithms that are mathematically sound however for the most part they're generic so then you have at the other end of the spectrum sort of the people that are in the trenches fighting fires that are trying to take advantage of anomaly detection in order to do better just to you know improve the business and finally and really very importantly often overlooked is that you have entities that are putting this all together that are matching the models and algorithms with the specific detection requirements that exist in the trenches and they're wrapping it all into a system where there's ingestion of data there's applying the various algorithms they're visualizing the results raising alarms whatever they're doing and what's interesting about DevOps is that quite often now you have companies that are actually sitting in both places a typical company that's offering a cloud-based service is not only managing that service if they're if they're building a system to do that using anomalies but they're also living in the trenches as well might say eating their own dog food that's pretty interesting and it's good news it's making for better systems so just very quickly these are some of the research areas we'll talk about the top two for the most part the application areas here there's lots of them this is just a few the good news is is that detection has been deployed in these types of applications with huge success they're companies that really depend on it the bad news is is that if you if you try to search around and find how people have been using anomaly detection and more fancy algorithms in DevOps it's a it's a thin set of information it's that's the bad news but that's going to change so here the that's the best the talk is about applying anomaly detection to DevOps so let's look at start from the top let's look at what the DevOps requirements are I've divided into three areas will first talk about what's data what data is available to detect the UNAM the anomalies what types of anomalies do you want to detect and then finally what kinds of techniques can you use to detect the anomalies and where they succeed and where they fail first data three areas again rule of threes I guess rules here we'll talk about metrics attributes and collections very briefly metrics are probably the most fundamental data structure in DevOps a time series is essentially a sequential order it is sequential set of values or observations that all have time stamps and they're typically visualized like this the interesting thing about them is they can be very long if you're taking you know submit data for a year that's a long time series and yet you may need to do that if you want to detect anomalies year over year over year in the case of the credit card company they care about Black Friday and that happens once a year so you may have to keep years of data second thing about metrics that's really interesting is they can be really plentiful it's not unusual to see a data center that's not huge that has perhaps a hundred thousand metrics maybe two hundred thousand maybe a million okay on top of this what's very important for anomaly detection is to lay over the time series some sort of context as you can see in this picture context is very important so what cut towards what sort of forms this context take whether static attributes as sort of familiar no new news here in particular in particular for a time series you might want to specify what bytes it's represented in it's one of my key P is actually okay another another type of context is what's called in the BI world or the business intelligence world it's slowly changing dimensions and what's interesting about these is they don't change as fast as a time series which may change every second or every minute and they don't they're not constant they're somewhere in between and this is really important in in elastic computing in a software-defined data center where big things change the number of hosts that are say you know in a cluster or the number of replicated micro services that exist in a particular architecture these are all slowly changing dimensions and what they have an impact on when something is an anomaly and when it's not last type of sort of context is collections and this is an example it's an entity as it would be known in the relational database world and in our world we call them an element an element can be virtual it can be physical and this is an example of an element that's a method call in a java application and it has a number of very dynamic attributes represented this time series that describe its state as it moves through time another type of collection is driven by the notion of a relationship relationships can be like via containment such as members in a load balance cluster or they can be dictated by membership in a work flow when such work low work flow would be this sort of example here this is a collection that's showing interactions of various things occurring in a work flow in here it's a business level it's at a very high level you have the four big functional areas you know of an organization and down here and the lower right you have operations operations has servers and things like that that it's managing that affects latency latency affects maybe the page load of a website then customers can be happy or unhappy as a result of that and eventually you get profit and revenues this is another relationship and anomalies along that path are important to be able to detect obviously okay that takes care of data so we have these kinds of data that are available to us to detect anomalies now let's talk about what types of anomalies we'd like to detect there are three what a coincidence we have Point contextual and collective so one at a time a point anomaly is an observation that's unusual when compared to all other observations that are available to you good example simple time series you look at all the values that you've seen of this time series across time and guess what one stands out as sort of the outlier that's a pretty easy to detect anomaly well I just lied not so much a lot of open source agents are configured not to report interval data but to report cumulative data and all I did was take the data that was on a previous slide and map it to a cumulative map a cumulative graph and now this this anomaly is cleverly hidden even if you zoom it's not even that obvious really so even simple simple anomalies point anomalies are not they they have some nuances okay next next type of anomaly is called a contextual anomaly this is an observation that's unusual within a certain context but not in other context here's an example there's a before and then an after and in between there's a sudden drop that looks like an anomaly and you know various different types of and algorithms can be used to detect such an anomaly but in another context where you have a sort of seasonal pattern where you have low you know at night and high during the day this isn't such a such an interesting is not an anomaly at all actually a sudden drop is very common here's another type of contextual this is a case that the little red dot this is a case where the value is okay at night but not okay during the day so this is this is this is the type that there's a contextual aspect to anomaly detection that's very important last type a collective anomaly that occurs when a collection of related data instances is anomalous with respect to the entire set so what you're thinking about is a set of things and they're changing along in sort of lockstep and then suddenly one of them goes haywire okay so let's look at one of those here's an example we're looking at a very small collection of two metrics the metrics are tracking requests per second over time and revenue per second over time and basically each of these X's on this graph is a pair where it's a value of the number of requests and then the corresponding value of revenue at the same time and if you plot these all this is called a scatter plot and if you caught when you plot these all and you use a model like linear regression which maybe you remember from college or high school or something but if you if you do that then basically what you can you can use this information this is information about the interaction of these two things you can use that information to predict basically where you would expect the other one to be given the value of the first one okay so it says you've got a prior so to speak and Bayesian terminology you've got a prior which is the requests per second and then you have a dependent variable which is revenue per second let's assume this this is a website and you're making purchases on it then then you can expect reasonably given some certain level of confidence which you can specify and the level of confidence will influence the width of the expected band the band of normalcy you might say this gives you a way to recognize anomalies because if you see something like this where the requests are again in the same error its same place on the x-axis but yet revenues are odd then that's a good candidate for being a an anomaly so those are the three types of anomalies I got these types from academia they're all are always full of taxonomy z' and all but that it's useful because different different algorithms can be used for the different types the key differentiator between these types really is the amount of context that is needed in order to detect the anomaly okay so moving right along we go to techniques there are two general classes of techniques that I'll cover today statistical and deterministic deterministic first deterministic is in my opinion the unfortunate state of the art in DevOps we don't we make use of deterministic anomaly detection techniques and they get us partway there but not all the way there so let's look at some examples first is dashboards that's a critical feature you know in any product you want to be able to visualize the data and the human eye is great at detecting patterns so it's a-you know something like this you know is really successful as you can see in the little up you know the thumbs up in there right and that will be a theme throughout the rest of this so basically you got a dashboard it's beautiful you can look at it and maybe you can see things from that and you know I know anybody that I've worked with in DevOps relies on these things a problem is that they don't really scale over basically the true which the sad truth is that as the number of metrics that you have that you want to graph increases the likelihood you're going to miss something increases as well okay so that's dashboards let's look at static thresholds this is a great a great metric for a static vishal right you can just put the static threshold in there and if it goes over that threshold boom oh you've got an anomaly and maybe you raise an alarm right okay well basically it's the good thing about static thresholds is that they're easy to set sort of I may be lying there and if you have to set them for a lot of a lot of different metrics it gets to be a big manual problem so there's a scalability problem as well and you know if you set them wrong you know you can have dire consequences you can get floods of alarms and all that sort of thing but they're they're useful they have their place and in combination with other things they can really be helpful as we'll see okay for a static threshold the sudden change is a real problem to detect you could make the conclusion that lower is bad in which case in the first half of this of this graph you'd be getting a flood of alarms and this is what's called the crying wolf effect where you're being plagued by false statements everything's fine but your anomaly detection algorithm is saying this sucks same thing as if if the you know if if the lower one is is existing and the upper one is viewed as bad then you know you're going to get alarms on the other half whatever okay so moving right along here's another metric this is a monotonic metric that I showed you before and the thresholds just not going to work here so what do you do about that well we'll talk about some strategies later but right now basically what you end up is a head in the sand effect where you're missing a real anomaly you look like your heads somewhere okay what the next sort of deterministic the last one I'll talk about is transformations these are usually simple functions that are used to you know to convert the incoming data to something that's more usable what you can do with these I'll look at one in particular there's a delta function which is essentially taking the difference between two successive observations if you apply that to the monotonically increasing metric then you get something where the anomaly is pretty obvious the problem here is is that you know you have to you know you have to you have to the problem is is that you know sometimes that doesn't work that well due to the fact that you know the changes are short-lived and spurious and that gives you too much noise another case here is a of applying these particular functions is that you can map a metric to a frequency histogram what a frequency histogram is doing is it's counting the number of times you've seen each observation or intervals of observation and what happens here is that you get a sort of a you know vertical set of bars and if you look for the short bars those are the rare observations so this is a way of setting thresholds kind of without having to actually pick the threshold it's sort of interesting unfortunately it doesn't work with the type of contextual anomaly that I showed you before that's a perfectly valid value and so it's hidden in a high bar right okay so and again you get the head-in-the-sand effect okay so let's move on to statistical anomaly detection techniques this is where I think the state of the art is heading in DevOps so a lot you're going to see a lot more use in the various tools of this type of technique we'll talk about two techniques correlation and machine learning what's true of both of these is that they share the same common assumption that past is prologue right that in another way of saying that in a sort of a data science vernacular is that they're probabilistic lead stable okay a court look talking looking first at correlation models you can just see some scatter plots here and the value of a correlation statistic that you can compute it's called the pearson product-moment correlation correlation we'll show you the equation later but the idea is is that these scatter plots can be mapped to a number and the number ranges between minus 1 and 1 and minus 1 and 1 are sort of the perfect scores it means that as one metric moves the other one moves pretty much the same way it moves up and it moves the same amount or in the case of minus 1 when one goes up the other goes down but always pretty consistently the same amount okay so let's let's look at the equation here this is the equation I decided to put it up for one for really two reasons I guess the first and most important reason is just that it's it's actually computationally a very easy calculation you can easily do it in real time with just a little finesse you can manage any state you know state observation over observation kind of requirements so it's it's a really easy statistic to implement and it can be useful the second reason I put it up is because this is a presentation about anomaly detection and we should have some Greek letters in it probably okay so let's look at an example of this type of detection anomaly detection applying this anomaly detection technique to an anomaly here's a case where you have two metrics and they're almost perfectly correlated okay so it's successful at seeing that these are perfectly correlated and you know if one goes haywire like here then the correlation coefficient actually drops quite a bit so it's it's it's a decent detection algorithm but like all things there's problems and here's here's a problem what I have is a sort of a night day you know night day seasonal pattern but what happened is in the in the in a day right here suddenly both metrics simultaneously plunged okay so what's happening is you're going to miss that because the correlation is 0.92 it's too high it's not showing it's not showing that anything's really changed right now the reason that this particular algorithm failed is not is because it's looking at the wrong thing and that's one reason that anomaly anomaly detection can go wrong more than it should if you look at this way look at it this way basically what what is not important in this case is the correlation of behaviors of these two metrics what's important is their values so this is a case where maybe the threshold is almost better which is a little counterintuitive okay let's look at the last technique which is statistical machine learning and this is becoming increasingly popular I think in using in devops let's look at how it works first there's always there's phases and there's a learning phase where essentially you have you throw in test data or data you have a period where real data is being thrown at it's black box that consists a very very complicated man and very complicated algorithms that are from a software development perspective and a systems engineering perspective quite difficult to implement especially if you're running in real time but you may not run in real time you may run offline which case there's a little bit more latitude but the idea is the test metrics come in parameters come out that not only identify what sort of characteristics you're looking for but what settings they should have so that you can detect anomalies and then you move into the detection phase where you have both the metrics coming in the parameters coming in and the anomalies get detected and if you're really sophisticated you can do this in line and so a devops environment being a fairly dynamic system that's changing and over time you can actually do inline corrective analysis and change the parameters to reflect new normal that kind of thing right so let's just see how this would work here again is a picture of the the the credit card clearing metric that's how many credits were how many how many cards were cleared and this is a really nice visualization of some really heavy math so that's good that's a success point and you know you can put the the what it's capturing is the raw values and then it's also capturing what we term as bands of normalcy the idea being that you expect you expect the metric to have a value within a band right so the first band is the sort of green one and that one's based on a univariate model what that means is the model is is looking at just one metric and it's using the information coming from that single metric to decide what its standard deviation is over time and then the width of the band is determined by the standard deviation the much narrower band is the purple band and that band is created by via a multivariate model in this case what we're doing is we're using information from n different metrics where n could be a very large number and we're determining the standard deviation of one of the metrics knowing what values are for all the other metrics okay and you do that for all n metrics and therefore you get a you're using you'll see that the two the two bands are quite different in size and the reason for that is that the purple band has a lot more information to work with not only does it have the metrics data but it has all of the other metrics data that are participating in that model and over and above that it has their interactions so a multivariate model has a chance to be much much more accurate am i running out of time oh okay learning at work you can see the bands converge isn't that nice here's a case where we actually detect the anomalies they're all outside of the band here's a case where we're detecting new normal let's say we have you know an ec2 server and we've changed its type we decided to save money so we have a huge a huge server and then we downsize it this could be CPU utilization and what we're doing is we're adjusting to the new CPU utilization which will be higher because we had a smaller guy what happens in this case is that when the transition occurs you get anomalies but then it settles out ok here's the that was the yen this is the really kill okay I only took 28 minutes okay 21 don't look you only get 25 minutes all right I'm sorry then that's it I'm going to give you one last thing if that's okay you can add context it works but here's the moral to the story okay
Info
Channel: DevOpsDays Silicon Valley
Views: 30,879
Rating: undefined out of 5
Keywords: anomaly detection, devops, devopsdays, Silicon Valley (Region)
Id: 5vrY4RbeWkM
Channel Id: undefined
Length: 29min 37sec (1777 seconds)
Published: Sat Nov 14 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.