Scaling Machine Learning on Industrial Time Series with Cloud Bigtable and AutoML (Cloud Next '18)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC PLAYING] GEIR ENGDAHL: My name is Geir Engdahl. I am CTO and co-founder of Cognite. With me I have Carter Page. And we're going to talk about how Cognite is using Google Cloud technologies to enable machine learning on industrial data. And in particular, we'll talk about time series data and Bigtable as two of the key technologies and problems that we are solving. So a little bit about the Cognite. We're a young company, less than two years old, just crossed 100 employees. We're working with our asset-intensive industries. So that basically means large industrial companies that have big machinery that costs lots of money-- a lot in the oil and gas vertical, also in shipping. And our mission is to liberate industrial data from silos and piece that data together to form a model of industrial reality so that humans and machines can make better decisions and take better actions. So it's a model that's real time and historic to have both the present state of what's going on. And it has all the previous data too, which is important if you want to try to predict the future. So what exactly is an industrial reality model? Well, I'll try to show a little bit. It depends on how you view it. So there are many angles to kind of look at this model. This is a typical operator view. So if you're in the control room in an industrial plant, this is very close to what you would typically see. This is data now streaming in live from the North Sea. It has about 1 to 2 seconds delay. It is data that is concerning a single tank outside on an oil platform-- or inside actually. The tank is called 20VA0002. And the typical oil platform will contain anywhere from 10,000 to 100,000 sensors, time series, like this. So here you have a handful. But it's just a tiny piece of a huge machinery. So this is the kind of real time what's happening right now. You also want to see, as an analyst, what has happened in the past. Each one of these squiggly lines represents about 1 gigabyte of data. This is one year of data. I really like this chart. It's my new hobby to play with it. It's kind of like Google Earth. I can zoom in and view the data at any resolution. And you know, given that this is about 10 gigabytes of data, you probably noticed by now that the next Wi-Fi doesn't support downloading all that data this fast. So you need a back end that can quickly crunch the data and give you the data at the resolution that you want. And you can go all the way down here to the rural data points, which will pop up when you zoom in enough. And of course, you can view this in different ways. So this is a view that humans tend to like. It's a three-dimensional view. We imported the entire CAD model of the oil platform. And we connected it with all the other data. So for instance, if we want to see the tank that we just viewed data from, which is so aptly named 20VA0002, you can see exactly where that is, and what it looks like, and what it's connected to. So you kind of browse the data in the three dimensions. Just to give you a little impression of what we mean by industrial reality model, I want this model to be up-to-date and contain the data today and for what happened in the past. By the way, the charting library that I just showed you, we couldn't find that. So we had to make it. And we open-sourced it. So if you're interested, you can use that. It's not tightly coupled to the Cognite back end. So if you have any provider that can give you data at different solutions you can use that-- [? uses react ?] on D3. So the scale of data to be ingested is huge. And it's growing very fast. If you look, Cognite handles lots of different data types to build the model of industrial reality. So that can be ERP data. It can be maintenance logs. It can be 3D models, like you saw. But if you look at the data by volume, 99.7% of the data that we have is time series data. So that's really where the huge data is. Even though the 3D model was large-- it was about 10 gigabytes-- the time series data dwarfs that. And it's exploding. And all of that time series data need to go somewhere. It needs to be stored. It needs to be processed. It needs to be queryable. So how do we do that? What is under the hood? So when we started out building Cognite, we started with a few principles. And one of them is impact. And it kind of seems strange to say this. But you have to kind of write it down to for it to matter. It's easy as a technologist to build technology for technology's sake because its cool. And I've been guilty of that in the past. We've been lucky to have large customers, demanding customers, very early on to guide us in finding out what the real use cases are to create real value. And then there's speed. So you want to show something as fast as possible so you can iterate, you can get feedback. And those two put together, there's a consequence to that. And that is that you want to use managed services wherever you can, especially for anything that is stateful. Because handling stateful storage service that has to scale up and down, that has to have backups, have redundancy, have all the logs for who accesses the data, et cetera-- all of that stuff is just painful to implement. And it's going to slow you down. And it's not something that-- yeah, it's being done. So we recognized very early that we needed a time series database. And our hypothesis was that we could get this as a managed service too, that there would be something out of the box as an API support of this. And our requirements were that it would be robust and durable. So it means that we don't drop data. No data points should be dropped. It would have to support a huge volume of reads and writes-- writes in particular. You always get new data in. Low latency, so they can show the real time version of what's going on-- you want to see data at any time scale. So you want the zoomed out view and the zoomed in view that I showed you. You want to be able to efficiently backfill. So if you're onboarding a new customer, and that customer has a million data points per second being generated, and you can handle 2 millions, then backfilling is going to take a long time. If they have a year of data for you to backfill, it's going to take another year before you're done with that, because you're going to spend 1 million of your capacity on the new data and then 1 million per second on the old. And you want to be able to efficiently map over data in order-- so the sequential reads for training models, for instance. So we experimented with the OpenTSDB at the beginning. It's a great piece of software. The cool thing is you can use a Bigtable, which is a managed storage back end. So you can use OpenTSDB with Bigtable as the back end. Which is very nice, but it had a few shortcomings. For instance, it's not durable. That means if you send a piece of data to the data point, it will acknowledge that it got the data point before it's written, which that means you can potentially lose data if you're scaling it up and down. And it essentially used a front fill path for batch backfills, which made backfilling very inefficient. And there were a few other things as well. So we chose to build our own time series logic on top of Bigtable. So Bigtable is a fully managed service, which, as you know, we really wanted. It supports a huge number of reads and writes for node. It's been tried and tested on very large user facing distributed systems at Google. And it has this property which most time series databases don't-- most key value stores don't have-- which is that you can scan forward efficiently. The keys are stored in order. A lot of key value stores will hash the keys so that you get the load distributed evenly. But Bigtable doesn't do that. So that means you don't have to jump around when you're reading sequential data. It also means that-- the flip side of that-- is that you can run into situations where you get hotspotting. So you need to write your code around that. But for us, it's a price that-- it's been worth it for us. So I'll hand it over to Carter Page here, who is senior engineering manager for Cloud Bigtable. CARTER PAGE: Thanks, Geir. So I'll talk a little bit about-- [APPLAUSE] I'll talk a little bit about Cloud Bigtable-- how it's a good fit for IoT and why we see more customers coming to it. I do want to say I'm particularly excited to be presenting with Cognite. I think the stuff that they're doing is really neat. I think his point about doing impactful things, doing a comprehensive story of IoT is very exciting. The idea of not just connecting those devices and getting the data, but once you've got literally tens of thousands devices-- way more than a single human could actually monitor-- thinking about how do you extract data, react to that, and manage very high risk asset situations. And he's going to get into some really cool stuff after this. But let me talk a little bit Cloud Bigtable and how that is a good fit for these types of use cases. A quick show of hands, just to get a sense of the audience, who is familiar with distributed databases like Cassandra, HBase, things like that? OK. All right, so this is not going to be rocket science to most people. The main thing, particularly for large IoT use cases, where people are looking at collecting massive amounts of metrics, is being able to handle this really large scale traffic. So a couple of years ago, for example, we did a load test with a company where we basically simulated the entire US trade markets all being processed together, which we processed 25 billion records in about an hour. And that's capable just due to the scale of Cloud Bigtable and how it works. We were peaking at about 34 gigabytes per second and about 34 million operations per second on Bigtable on a single instance. And the reason this works is because Bigtable was built for very high scalability. And you essential get linear characteristics way out on the curve. So Bigtable was initially designed by Google as a backing store for our crawler. And so it was stored to keep a copy of the World Wide Web. It's expanded internally. It's being used for a lot of other products as well. And so we have put about 14 years of engineering into keep finding new upper limits and breaking those. So every distributed system, eventually you have this straight line that goes out, and eventually it flattens out. Everything eventually hits a bottleneck. Or you might hit something where it just-- if you've got an HBase cluster with 1,000 machines, you're going to hit probabilistic machine failures, and there's an overhead for your operations and things like that. So we will eventually flatten out, but pretty far out, and it would take a lot of work for you to get up to the scale where you would notice. The reason this is important-- the linear scaling, from a business perspective-- is this gives you predictable cost of revenue. So when you're thinking about building a system and you're like, I've got a terabyte of data, and now this has to go to 10 terabytes, a petabyte, 10 petabytes, usually if you're building this on top of your own home-managed Cassandra system, you are going to have to rethink each time you hit one of these new tiers-- all right, now, how am I going to deal with this? I've got a lot more machines. I'm going to need a new on-call rotation. I'm going to need new strategies to be able to deal with this. Here it's just a matter of cranking up your nodes. And the number of nodes you need is proportional to the throughput that you need. So give a quick overview for how Cloud Bigtable works, you have clients that basically talk to a single front endpoint that load bounces to the nodes. So you don't need to think about address mapping or talking to individual nodes themselves. I'll take that layer away and talk a little bit about what's going on underneath the covers. So the data itself is being stored durably in an underlying file system called Colossus. And the Bigtable servers themselves are actually not storing any data. They are taking the responsibility for serving the data. And every row is assigned to only one node. So your entire keyspace is basically balanced across these different nodes. And this allows a single entity-- by being responsible for an individual row-- allows for atomicity of operations on it and allows for read your own writes. The advantage here of having the file system and the data dissociated, is it allows us to do some clever things in terms of being able to rebalance workloads very aggressively. So you may have a customer that has changed their underlying workload, which then impacts how you are using Bigtable. Or you may have diurnal patterns. You may have different things that are coming on and going off during the middle of the day, changes which tablet servers or nodes are getting more or less activity. And what we'll do is we'll actually identify these changes in patterns. And we will just reassign areas of data to different nodes. And so this allows a couple of things-- one is it allows you to not have to worry too much about per node hotspotting. You can hotspot individual rows, which is a problem. But in terms of getting unlucky and having one server that's hotter than the rest, we'll balance that out. It also means higher utilization. And by having higher utilization in these nodes, by keeping things balanced, you're not having the provision for the hottest node. We're trying to keep all the nodes fairly well balanced. And that actually means cheaper service relative to running this on an HBase cluster or Cassandra cluster. In addition to rebalancing, you can resize up and down fairly trivially. We have some customers who might have ingestion workloads which only need a few nodes. And they might run a batch at the end of the month or the end the week. And they might want to scale that up to say 300 nodes. And you can do that fairly instantaneously. If you've got a really large data set, it may take 10 to 20 minutes for the data to rebalance over the nodes just added. But it could be a good way to make those batch jobs you run once a week run really fast. When you're done, you scale it back down again. The basic data model, it's key value but has more dimensions to it. So you have a single index, which is your row key. And then your data is stored in columns. And the columns are a tuple of basically a column qualifier and a column family. The column family is defined in your schema. And the column qualifier is defined at insertion time. The table is sparse. So any column family, column qualifier tuples that you don't fill in for a given row don't cost against your actual space. The database is also three-dimensional. So under each of those cells is an arbitrary number of versions. So you can keep them there pretty much indefinitely, as long as you're row doesn't get beyond a couple of hundred megs. Or you can instill a garbage collection, say, when you wipe out any data that's over a week, or just keep the last five versions or something like that. Wednesday, yesterday, we announced that we have replication now. So between two Bigtables in a region, we will replicate the data between them. This has a few advantages. The first is it expands your failure domain. So you're no longer [INAUDIBLE] into the failure domain of a single zone. You've got two zones with the failure domains. So that gives you higher availability. And another advantage that people use, particularly because the replication is asynchronous, is workload isolation. So some customers may have their critical low latency server loads on one cluster and then maybe doing batch reloads on another. And by doing it on the other one, they're not interfering with each other essentially. The effective result of this, on the high-availability side, is we get an extra 9 onto our SLA. We have a high availability application policy. I won't totally get into those right now. But you can go read up on these. You have application policies that can define how you want your traffic to be routed. And if you use the high availability one, you'll get automatic zero-touch failover. So if there's any problems in the middle of the night, you don't have to get paged. It'll just failover for you. Being a large data tool and a database that's designed for terabytes-- petabytes, actually-- we need really powerful tools to be able to make the most advantage of your database. And so we've got deep integrations with BigQuery with Dataproc, Dataflow. BigQuery, the queries are not as fast as if you're going it's native BigQuery, because BigQuery is an offline store. And certain optimizations of BigQuery to be online to be able to get single-digit milliseconds latency, which makes the BigQuery queries run a little slower. But it's nice because you can do ad hoc queries on your data without having to write a MapReduce job. If you do want to write a MapReduce job, you can use Cloud Dataproc, or you can use our internal replacement for MapReduce, which is housed inside of Dataflow. And then also this week, we're announcing that we have a deep first order integration with TensorFlow, which just got put on to GitHub. And so people can start playing with that. So I'll hand it back to Geir, here. GEIR ENGDAHL: Thank you. So with that background, I'll go into a little bit of detail on how Cognite is using Cloud Bigtable to store its data. And then we'll move on into machine learning. And eventually you'll see what the [? windmill ?] can do. So Carter talked about the data model of Bigtable. This is our basic data schema. So the first thing that is very important is how you choose your row keys, because that's the only thing that you can look up fast. So our row keys consist of a sensor ID plus a timestamp, or a time bucket. That means that you can look up, very fast, the values for a particular sensor at that particular point in time. And you can also then scan through all the values of the sensor in order. Which is nice if you want to train a model. And that's a very inexpensive operation. Inside of each row, we store more than one value. It's not just one value per row. We stuff a lot of data points inside each row, typically around 1,000 data points per row. So you have the timestamps, which are unique timestamps. And then you have values, which are fielding points. And of course it's binary stuff. It's just drawn out in readable numbers here. But it's all binary to save space. And actually, Bigtable does compression for you too. So what you'll see if you stop ingesting new data into your Bigtable instance, you'll see that the total size of it goes down. Which can be scary at first if you don't know what's going on. Where's my data going? But it's actually a good thing. Here's how we architected our data ingestion pipeline. So every step along this path is auto scaling. Carter talked about how easy it is to scale Bigtable. And we have a service which looks at the load and then scales it up and down. It's pretty simple logic. So it starts with Cloud load balancing. And then an API node, which is a Kubernetes service, which handles authentication and authorization. Then it will put the data point onto a Pub/Sub queue. And then it will say to the client, we got your data-- we're not going to lose it. So Pub/Sub is another component that we use a lot. It's a very nice component in the way that it scales to whatever you ask for. If you look at the documentation for Pub/Sub, it will say like the limit on the number of published operation unlimited, and subscriber operations unlimited. So that's kind of a bold statement. We haven't drawn into the limit there. Once it's on the queue, it gets picked up by a subscriber to that queue, which will package the data and write it to Bigtable. So that's where our time series writing logic lives. And once it's been written there is a new job put on the queue to do with aggregates. So those are roll-ups that we use to be able to answer queries about any arbitrary time scale efficiently. So that's what we need in order to do the dynamic zooming that you saw at the beginning. So your KPIs are throughput and latency for this kind of pipeline. It's typically what you'd look at. So the data that comes in is queryable after 200 milliseconds, in the 99th percentile, and we regularly handle 1 million data points per second. And I'm pretty sure it could do much more than that too. And querying, much simpler-- this is in synchronous operation, so it goes to the API node and then straight to Bigtable. One of the optimizations that you want to do if you want to transfer a lot of data here is that typically API developers like to have JSON data. And for most applications like Dashboard that makes sense. If you're writing a Spark connector to your API, and you want to run machine learning on that and transfer lots of data, then the JSON serialization-- the serialization becomes an issue. And so you want to use something like Protocol Buffer or another binary protocol for that. And it's not really the size that matters. Because if you GCF to JSON, it's very small anyway because it's very repetitive. But it's the memory overhead of doing that serialization. So that's a nice optimization. I want to talk a little bit about cleaning of industrial data because it's something that's often overlooked. If you want to make a useful application in the industrial IoT space, it's not enough to have time series and AI. A lot of people are running around saying time series plus AI is going to solve everything. A very simple question is if you have 100,000 time series coming from an oil rig, and you want to make your predictive model for this one tank that we saw, which time series will you pick? And how will you pick those? Are you going to manually go over all the diagrams? It's going to take you a lot of time. So typically, we see these data science projects, and they are really about finding the right data. So that's what 80% of the time is spent on. And then at the end of the project, you have a wrapup where you try to model something. If you want to truly understand what's going on in the industrial world, you need to be able to get data from a lot of sources-- data like the metadata of the time series, the equipment information-- who made it, when was it replaced, failure logs from previously, worked orders, 3D models, and the process information-- how things are physically connected and logically connected? Which component is upstream of this? And it's not enough to have all this data in one place. It needs to be connected. The hard part is connecting it. And the glue that holds this together is the object in the physical world. And the unfortunate thing is that the same physical object has a different name depending on what system you ask. So we spent a lot of time on this contextualization, figuring out how we'd map the IDs from one system to a unique ID for each asset, for each physical object. So if you look at the cleaning pipeline there, there's this thing called an asset matcher, which we spent a lot of time developing and which will assist experts and do automatic mapping, in many cases, of IDs from one system to another to be able to make this connected, contextualized model. So you're probably wondering now what this windmill does, and why it's here. So I'm just going to say a little bit more, and then I'll get to it. Predictive maintenance-- you've probably all heard about this. There is a great business case for predictive maintenance. We have seen cases where a single equipment failure on a piece of subsea equipment costs $100 million to fix. So obviously you want to prevent that. But this is also why it is so hard to do. Because the failures are fairly rare. There is not a lot of label data. Imagine what it would cost to get enough label data to validate your model, let alone train it. So you're typically stuck with these unsupervised approaches. And for anomaly detection, what we've seen is two classes of approaches for how to do this. One is forecasting based. So it means that you will take a set of sensors. You will hide one of the values. And you'll try to predict it using the others. And if your prediction is far from what is the actual value, then you flag that as an anomaly. Or the other approach is you take your sensor data, your set of sensors. You plot them in an n-dimensional space, and you see what points are close to each other. They form clusters, and those clusters typically represent different operational states. So you'll have a cluster around the running stage. You'll have a cluster around the idle stage. You'll have a cluster around the powering, powering down maintenance. And if you have a new points, and it doesn't fit-- it's far from any of these cluster centers-- then that it is an anomaly. Let's look at this live. So this demo is as live as it gets. It has a lot of moving parts literally. Everything that you see here is live. There is no pre-trained model. There is no pre-generated data. The data is going to be created right here, right now. And we'll train the model, and we'll see if it works. So are you excited? AUDIENCE: Yeah. GEIR ENGDAHL: OK. [APPLAUSE] Me too. So let's see if we get data from this wind turbine now. And it looks like we do. So this is a different view of the wind turbine. It's a 3D view into what's going on there. It has the sensor values. I can turn this knob. I can increase the speed. You'll see it will start to produce more energy. It's producing a lot of energy for a wind turbine the small. So let's go into Jupyter. I'm sure those of you who are working with data are kind of familiar with this. So we're going to interact with our Cognite API via our Python SDK, which is also open source. So first we're going to just log in here. We're going to select the right data. And then we're going to plot it. This live plotter will show the analyst's view of this. And you see if I adjust the speeds, you see it will go down. If I take it up, it's going to go up a bit. And this, what you see on the screen, is going to be our training data. So I'm going to give it a little bit more time. So it's seeing a little bit of normal operation of this wind turbine. We painfully brought this wind turbine for you here. It's 3D printed. It looks very homemade. We got it through airport security. I was taken aside by security here at Google Next. They were wondering what this base is. Because this wind turbine thing isn't mounted on it. It has wires coming out of it, and it has this red scary light on it. So I've done a lot of explaining to bring this here. Now I need to stop this plotting to move on. I will create an anomaly detector for this, and I will select a time range for it. Now this is pushing the training operation up into the Cloud. So it doesn't actually happen on my computer. The SDK will just do an API pull to train a model. And then we get the job ID that we can query for the status of the job. It is not a lot of data right now, so it took very little time to train. And now we can create another plotter. And it's going to plot again live data from this wind turbine. But this time, it's also going to plot the output of the predictor, the anomaly detector. So how can we introduce an anomaly here? Well, I'm going to use brute force. I'm going to hold it back here. It takes a little bit of time for the data to appear in this Python thing. And now you'll see it's detected an anomaly. You see the red background there. [APPLAUSE] And it should go back to normal again once I've let it go. So there you have live anomaly detection. Now it's back to normal. Operators can use this to monitor-- if you want to monitor a 100,000 time series you can't do that manually. You can't put it up on the screen and have people watch it. Well now, operators can be alerted to strange conditions, and they can look into what's going on, and hopefully prevent the next $100 million failure before it happens. So another very useful thing that we do-- once you've detected this anomaly, your first question is going to be something like has this happened before? Or has something similar happened? So we implemented something called similarity search in the API, which is not really machine learning, but it is a very useful thing. And it is very computationally expensive to do. So you can take a time period of a setup sensor and look for that pattern in a different set of sensors, or in the same, find similar portions. So this talk also has "AutoML" in the title. So AutoML-- on oil rigs you have hydrocarbons. And if there is a leak and you have a spark, then it's potentially very dangerous. So they're always looking to see if there are faulty wires. Faulty wires can be very dangerous. So typically what they do is they have these regular inspections every six months or every year, where they go over everything. But these damaged wires, they don't appear over time. They're typically the result of someone stepping on something, or something mechanical happened at a particular point in time. So what we have tried to do is build a model which can detect faulty wires so that you can wear a camera on the helmet or use CCTV or other ways of gathering images and in the background always look for this kind of failure. I'm not trying to replace those inspections. We are trying to augment them and make it even better. So we trained the model, using TensorFlow, to detect faulty wires and also tell us where in the image the failure is. This is a project that we spent three months on. And we were able to get to a particular accuracy. That was about 95.5% on precision and recall. And then in June, we got access to AutoML. So we figured we would upload the data set there and see how well that would do. So I'll show what that looks like. So training a model with AutoML, if you can upload a bunch of images onto a web page and you edit the CSB file, they you can do it. So we did that. And in 5 minutes, I was able to get a model with 96.1% and 96.2% precision and recall. So that was half a percentage point up. But then we didn't give up. We tweaked our model a little bit more. So we were able to get that one even higher. So we were able to get to 95% precision and 99% recall on our own model, which it's better on one metric but not the other. But the big difference, I think, is that we spent three months doing it, and it took five minutes to train the AutoML model. So let's see what that can do. So on my way here, on the first day at Moscone, I found this wire outside, and it looked dangerous. So now I ran the AutoML on it. And it was not part of the training set. And this is kind of a new hobby for me, to go around and look for faulty wires. But yeah, it really caught that quite well. And on my way to the rehearsal, which I did back in June, I also found this faulty traffic light wire. This is the knob that you push to make the pedestrian light go green. And it also catches that really well. So I think it is possible to do this. And I think there is a class of problems that you can solve in the industry-- like rust and leaks-- and using camera feeds as a sensor, that can be very powerful and instrumental in getting people out of harm's way. So I'll hand it over to Carter to wrap it up. [APPLAUSE] CARTER PAGE: Thanks, Geir. If we have any time-- yeah, we'll have a little time for questions after this. I'm sure Geir will be able to join us there. So quick review again. Cognite basically built the system on top of GCP, focusing on the scalability and the throughput of Cloud Bigtable. And focusing on the capabilities of AutoML, he was able to take a training exercise which took three months with TensorFlow, was able to do it in five minutes with AutoML. Which is really exciting for companies that want to tap into this type of technology without having to hire an army of data scientists. Not everyone can hire Geir in their company. Also, one of the things-- actually, I don't know if it was clear when things were switching around when he did the windmill-- you saw some code. But he didn't actually program the training model for that. He just highlighted the part that he had done before and said this is bad. And so they were able to put this out in the field. They don't need to have programmers. They can have engineers who know what looks bad on the graph highlight that, and then expand that out to tens of thousands of metrics and be able to quickly identify things that are going wrong. And that's really exciting to be able to scale these things out to industry. We have replication now in Cloud Bigtable, which is going to provide higher availability and some great features around workload isolation. And it provides a great basis for large scale data analysis and machine learning. If you want to learn more about the technology you saw up here, here is a handful of links. There is the Cloud Bigtable main product page at cloud.google.com/bigtable. And there's a bunch of material underneath that. And then we also have the various machine learning pages up there. So you can learn about machine learning in general under the products page. And then there is a master page for AutoML. And then at the bottom there is the TensorFlow integration with Bigtable that we launched this week. And so that's all we have. [MUSIC PLAYING]

Info

Channel: Google Cloud Tech

Views: 6,350

Rating: 4.964602 out of 5

Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate

Id: HYvAPjukKic

Channel Id: undefined

Length: 44min 53sec (2693 seconds)

Published: Thu Jul 26 2018