Real Time Sensor Anomaly Detection with Sci Kit Learn and the Azure Stack - Ari Bornstein

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] so today I'm going to be talking about real-time sensor anomaly detection with Asher and the scikit-learn stack there's going to be some data science involved some pre-processing some machine learning but most of the emphasis is really going to be about how do people like you who are very deep in the data science world take your models put in production and art and without having to manage all the infrastructure and headaches yourself so before I get started just a little bit about myself my name is Ari Bernstein as I was introduced I work for Microsoft on open source team so we engage with different startups and companies in Israeli ecosystem we work with them to partner on interesting projects which widget then open source the learnings and share with the development communities and developers like you today I'm going to be talking about a recent engagement related to municipal wastewater management so before I get started I'm going to give a lecture on municipal wastewater management yeah I know that's that's probably what your reaction is when you think of municipal wastewater management and I don't blame you it's boring and why do we really care well the reason we care is when you think about it we care when it all goes wrong when it goes wrong when leaks happen when pipes burst when sewage is in the streets it can cost municipalities millions if not billions dollars of damage and they want to be able to make sure that advance that they can build robust infrastructure that fits to the risks that are associated with flooding and overflow for wastewater management believe it or not that most of this is the for most municipalities they have a general idea of how much water is going to go into their system at a given time however the two leading causes of overflow are these concepts of inflow and infiltration so inflow is caused by when somebody like this happens action more than you think in Israel where somebody does like a convener they're building that the not line we put a little pipe and he's putting water and introducing water that's not supposed to be in the system illegally infiltration is and this happens a lot more in like countries like the United States where you have older infrastructure you start seeing leaks that seep through where groundwater enters the water management system in ways that unexpected and that's very very hard to detect and what's really challenging about both inflow and infiltration is for municipalities especially ones that scale across wide verticals from industry to residential areas to you name it it can be very very hard to predict where this is happening when it's going to happen in the effects that's going to have on a system now there are some traditional rules of thumb for gauging us that were put together in the 1970s and essentially the way it works is you put a couple what are called flow meters and rain gauges along sewer pipe across cities you put one about every two hundred feet or what is centimeters sorry I'm still I'm gonna let ha - so I'm still adjusting to the whole meter system but you can imagine you put it every couple of meters and then what you do is you measure during the dry season you measure during a rainy season and you kind of take the difference and you assume that that's a good estimate but the reality is it's not that great of an estimate it's very very high level and there are two traditional challenges with this one is when they built these systems the traditional sensors or rain gauges and meters you have to manually read them so you have to send somebody a utility man to go in and manually check all these sensors write all the readings and a little spreadsheet then they run it in every nine months ten months year they do an analysis report and try to figure out where they're going to do this structure over the next year that was challenge number one and challenge number two is the traditional algorithms for predicting infiltration in flow we're developed based on what are called the life cycle assessment study in the 1970s and what did that entail that entailed that they had certain researchers who would come maybe once a week or once every couple days with the yardstick stand in the sewer and take discrete measurements and then they fit the points over a period of maybe five six years and they try to use like a standard linear regression or some of the general statistical algorithms of the 1970s to fit that curve and generalize that algorithm for the rest of the world so you can imagine in with big data in the world that we live in this is started to change and so as you can see IOT is enabled real-time monitoring of applications and which enables more accurate predictions but the models the traditional models were not a good fit for that well I'm not going to speak to you about actually building models for that are more accurate here because there are companies that do that so for instance Microsoft recently partnered with a company called Carl data solutions they essentially aggregate they upgrade all these flow sensors across municipalities and then they aggregate this data into a management tool which then can be used for predictive analysis but they recently noticed that they had an issue and what was the issue that they had well many of these sensors especially because they're submerged in water they tend to fail and when these sensors fail they don't just go offline as you would imagine they actually give data back and that data becomes really hard to differentiate between the anomalies or the outliers that you're running an outlier detection model on for sensor for detecting inflow and infiltration so when they were building their standard algorithms they would take you know a standard outlier approach they'd come up with the anomaly detection algorithms that outliers but now they have two different types of outliers one caused by sensor error one caused by what they're trying to measure and they need a way to differentiate that so that's where we come in and that's what I'm me talking about today how we solve this I'm going to show a very basic model not the actual production model and how we solve this and actually put this and production so just very high-level or solution out architecture we start with Azure machine learning I don't know if anyone's familiar with this but there's two layers of it at top layers there's a ml studio which is not necessarily made for in-depth data Sciences but it's made for experimentation I'll show a quick demo of that a little bit but what it really enables you to do is take your models put them in the production with production endpoint without having to manage all of that infrastructure yourself and there's a marketplace that allows you to monetize your api's as well and also what's really really nice about is not only listening to great without your services but it integrates directly with Jupiter notebooks and there's a jupiter notebook service so you're not to maintain your own super notebooks and it's all in one centralized location that everybody can prototype from at the next layer we use something called event hubs what event hubs does is it allows us to it's similar to maybe you guys are familiar with our Apache store for stream processing Apache Kafka so it's similar to that where it's essentially allows you to pull in that million from millions of devices in a scalable way then we use what's called stream analytics to process this data in real time and we can actually do real-time processing real-time classification if we want and then across all of our devices with the data that's coming in and then where it gets interesting is we have visualization suite through Power bi that enables us to see in real time now I know what some of you guys are probably thinking they're thinking okay this is great a lot of marketing slides a cool solution so that's enough slides let's actually get into some code I'm a developer - I promise so before I get started just going to point to one resource here so all of this engagement is documented on our github repo catalyst code so every as I said earlier every engagement we do we code everything we share with community we have this here and we also have a blog real life code that you can look on and if you want to follow up afterwards that's a great resource to do addition Microsoft uses this standard API for anomaly detection or just point it out for documentation purposes and let's get started so this right here is the Azure machine learning studio there are again as I said there's a nice way that I if I'm you know new to data science what I can go and I and set up a project so I can I can go upload different data sets and I can run basic experiments so I have one here that I created very very basic what this does is it takes some of the real data that I have from a real pipe I can visualize this data and you can see it's a little blurry but you can see we have a lot of date times and we have a lot of flow values and then we can essentially very simply run a series of modules like time series anomaly detection and we do the basic outlier detection which is similar to the approach that they were using before that again was problematic for them because they couldn't differentiate between the different outliers what was caused by sensor and what was caused by actual inflow and infiltration so that's one way of prototyping since we're all placed on people here I'm not going to bore you with that instead I we also provide what's called as your notebook service which allows you Wow yeah I'm going to try to do that right now and I noticed that my device is a 4k device so I don't know if it scales very well to this screen but is this easier to see all right perfect so what's really nice about this it allows me instead of having a spin up my own super server for each device I can have one centralized place where all my data scientists can prototype from so here just to get started I then import a bunch of services that I need from Azure ml from scikit-learn just a couple other things like numpy and online services I've been told to preface so actually before I gave this presentation I had a practice run through with Brett was one of the core contributors for Python and the first thing he noticed which some of you guys might notice is that it used Python 2.7 and he's like why are you doing that well I'd only told them it's because I don't like parentheses around my quotes and print statements which he did not like but the real but the real reason is for the API that we use for production right now python 3.6 support is coming but right now the actual production pushes down to 2.7 so I just wanted to preface everything so people don't all right I know there's a lot of there's a lot of friction in the community about people saying that I should move on all right so to get started the first thing I do is I import a workspace essentially what this is is you saw earlier I uploaded some data in the portal this gives me access to the portal then I have a couple keys in constants here so this is again all this stuff is going to be removed at the end of the day so if you want to take pictures you can but it's not going to be very useful for you then essentially what we're going to do is we're going to pull in our data so essentially I'm going to use our workspace and we're going to create two channels one for our raw data and one for edited data and we're going to essentially create turn it into pandas dataframe and we're going to essential them we're going to set the index based on the time value so that we can actually see the status time value and and you can visualize now you guys can kind of see what this this data looks like so I have my velocity and my bilasa edit now what's velocity edited as I said before you know it's really hard for these people to differentiate between anomalies that are caused by inflow and infiltration and I'm nominally it's caused by sensor data so how do they do it today they actually hire analysts who have one of the worst jobs in the world they manually sift through every channel and remove all the debt or modify tinker with all the data that they know there's a sensor error around and try to fit some sort of distribution to it so we know based on that where the errors are and so now we can essentially take this data and we're going to do two things one is we're going to generate n windows the reason I do this is for when we're doing binary classification with time series data we need the binary classification tends to be discrete so it doesn't assume that data comes one one at each point comes one after another it doesn't take into account the history of the data so one way around that is by taking each of our points and we take the history of the values from the last four check the last four time increments and we create each of those as one of our as our input features so for each of our input features we have the current value and then the four that precede that and that way when we make our prediction the future we're looking at the last four values to predict the next one and then we're going to tag anomalies again really simple here essentially what we're doing is we're just taking the difference between the curated data and the raw data and if there's a change more than a very small amount then we know that it's an anomaly and you can kind of visualize what that looks like here so now we have we can see whether something is anomaly or not we can see how much of anomaly it is we have the last four readings and the current reading and we can do some things where we can count we can see we have 495 anomalies in the data and now we're going to what I would say is we should visualize the data so the first thing part of the in my opinion when you're doing data science one of the things that often gets overlooked though I'm sure not by people in this room is actually playing with your data before you build the model you know I work with a lot of customers a lot of partners across many different countries and a lot of people try to take their data and they try think mainly about the algorithms I think try to use boost trees or they try now with the deep learning craze they're throwing LSD m's at this or trying to do crazy convolution but they don't actually sit down to see what are the trends to play with the data to visualize their data and come up with a solid approach I find that the algorithms are critical but they're secondary because it really depends on the amount of ambiguity in the system and if you actually have something that's pretty if you can phrase your problem in something that's especially in a classification sense that's something that's relatively linearly differentiable then the model pretty much will do all that work for you've all done all most of the work for the model which makes it much much easier so in order to do that we're going to visualize the data so just going to visualize discrete some discrete points between on one given day and we're going to plot some sample patterns so the first thing I'm going to do is plot a sample daily pattern with no anomalies and we're going to try to see that this data makes sense and what you'll notice here is you'll notice some interesting trends to start out with so this data starts at midnight and you can see that overnight the flow goes down and then slowly around six o'clock it starts going up why might that be well at night people are sleeping and the factories are shut down so there's less water coming into the wastewater management system and as it wakes up you know early construction it pushes and pushes up the water consumption and also people showering and getting out and then it starts to regulate for the rest of the day until and if we were to expand this which I won't do right now but if we're expand this you see that this is pretty much a standard cycle that happens most of the time and then the next thing we're going to do is we're going to see what happens during anomalies and you'll see that this is where it gets really much more interesting because there are a couple things around sensor assumptions that you would make if you didn't visualize this data number one is when you're dealing with sensor and this was brought up yesterday to me when I was speaking with somebody at the speakers they're like what shouldn't be really easy you know if you see that you go from you know a really high flow of let's say 50 20 meters or whatever per second and then it drops to you know zero or point two over or changes over the second of you know milliseconds that should be indicate indicative of a anomaly and while that's true and you can actually kind of visualize that sometimes the readings are actually more gradual and also sometimes you have weather events like flash floods that surprisingly not over a period of milliseconds but actually would match that same signal so what's interesting here is it's still linearly differentiable but it helps challenge some of the assumptions we would have make before so the next thing we're going to do is we're going to compare a couple of different models and I'm going to take three approaches the first approach is we're going to first we're going to define some metrics so I'm going to use scikit-learn metrics we're going to use the confusion matrix we use the classification report which gives us F 1 score precision recall all the standard classification metrics and then we're going to do two things so the first thing we're going to do is we're going to take the naive approach we're going to use Microsoft's anomaly detection API to just find out and you could use any one you use the Twitter package I'm again I'm not here to sell Microsoft but because I work there I use that it's free for me so but we used the anomaly detection package and then we're going to just see how that relates to the data that we have we're going to build a very simple model off that so you can see here we've taken our values for pre-processing we send them to this service which we consume through JSON and we get our results back and this is actually pretty good but you'll notice something interesting here you know and this is where metrics are really really important you know if I was an analyst what I would do so I'd look at the bottom and I'd see these average totals and I'd say I have a model that has 97% f-score we're good let's ship but if you actually look at this model what it's doing here it's pretty good at recognizing where the anomalies are but it actually is pretty bad it biases and it takes a lot of regular air flow that's actually not anomalies and calls it anomalies and those probably are in our case they inflow and infiltration that we don't want removed so that's the first thing that we're going to do and you can see that that method worked but not really how we wanted it to and it's important to be able to evaluate your metrics there the second model we're going to do this one is going to be very very simple we're going to take our random forests we're going to use a binary classifier on this and you're going to see that it probably over fit here but we get much much better results when it comes to the distribution across tagging anomalies and regular and regular velocity but one of the challenges here though is and I can see from some of the looks in the room is you guys are right this is probably something that's not going to scale very well because it's probably fit towards the one sample of data that we have so I'm going to sauce perfect I think we're going to hit that mark to the letter so the one other approach that we did here just for the high level experimentation is a hybrid classification approach in this case what we're going to do is we're going to take all the outliers from the outlier detection model we're going to build a model just on those so just on those few discrete points and we're going to hope that that linearly differentiates between the outliers and the anomalies and then we're going to scale that to the rest of our our data so we can see whether actually scales and this one actually works much much better again it's probably overfitting based on this data set this is a more of a toy sample versus the actual production stuff that Carl data is using but you'll see here that the distribution for anomalies and regular velocity across the same data scales a lot better so again that's our model now we have that model that's great but you know a model by itself is not very useful if I can't put into production and actually use it so now I'm going to show you how we can do that so the first thing that we're going to I'm going to show you is the azure ml client API essentially what this allows you to do is allows you to take your workspace and a couple very simple decorators for each of your input your inputs or parameters and just wrap a function and when I click run essentially creates a service for this function so now I'm going to actually run up more obvious likes and it run anything else in the notebook but once I can I can fix that for you guys let's go just quickly run through this alright perfect and I'm just going to run that second model because it takes the less time to train I perfect and now when I put this essentially what I can do here is I put this model into production and you can see now I have these end points and I can consume this and this is where if I go back to my ml studio I can go to web services and I can consume this just like any other API which is really cool so I can actually for people who are not Python people you can even use it in Excel but for everyone here I can test I can run it and there's actually what's really cool is there's documentation that shows me exactly I just copy and pasted and I can run this and it's already pre formatted it for Python at the way bottom here so I just click Python I copy and paste this into my script and now I can run this model that I've trained on the cloud in production that's so that's step one so I'm saying yep so now that I've done that we're going to just test two things we're going to run on something that's clearly anomaly so this is a flow where the last four values were zero and it jumps up to ten that's regular so that's anomaly we see that we're going to test this on normal flow so this is something where the values are all within a standard range and we see that this this works so the next thing that we're going to do is we're going to use the event hubs and in order to do this you essentially what you do is you go to your azure portal and I've taken the liberty of creating one here how many more I'm going to select perfect that's perfect amount of time so I'm going to go to my resource group here and essentially what you do is you create what's called a service bus so service bus allows you to manage event hubs or IOT hubs for different things and within that add just generate this and I take my T's at night we'll copy and paste that I'll take all this information and I just copy and paste that into here so I have my service bus name that I created the name of the key and just some key values again this is all going to be deprecated at the end of this so I mean feel free to take picture now but it won't work tomorrow and then essentially initialize an API for that service bus and I create a new event hub in the service bus which is again one line of code really nice here and then I just declare this function so what is this function do it's called sent event hub and this is what we I would put on my device essentially allows me to do who take a time the currents are not velocity and see whether it's anomaly and I send that to the event hub and what we can do here is I can actually I to run this for you we can run this in real time so we can simulate what it would be like for a device to be sending information to event hub so now we have this device it's send each time it's going here it's sending information to our bank hub we can actually now see that in our event hub if we go to the overview you can actually see that the data coming in and then what we can do here is go back to my resource group and we're going to use stream analytics so what stream analytics allows me to do every one second which stream analytics allows me to do is it allows me to process this data in real time and all I have to do is create a query so essentially what this query what this is a very simple query what it's telling me to do is as the data comes in to my you guys can't see this one second let me make it bigger as this data comes into my event hub take it and send it to power bi for visualization and then what's really cool is I can now go in and visualize this data in real time if I'm an analyst I don't need to know anything what's going on behind the scenes I can see my flow go going and if there's an anomaly and this is happening in real time I can see those anomalies and differentiate them from inflow and infiltration and then I can use that to build more predictive models in from Carl data so again thanks so much I guess I'll open up for Q&A and if you guys have any questions afterwards feel free to find me [Applause]
Info
Channel: PyCon Israel
Views: 3,892
Rating: 4.7837839 out of 5
Keywords: pyconil, python
Id: Q61glJUcpmc
Channel Id: undefined
Length: 27min 56sec (1676 seconds)
Published: Mon Aug 14 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.