Building an Open Source Streaming Analytics Stack with Kafka and Druid - Fangjin Yang,

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

alright hey guys I think you can get started here first off I want to thank everyone for attending this session and I was like right after lunch on the last day so appreciate you guys being here my name is Benjen I am a co-founder of technology startup based in San Francisco called imply also at emitter on one of the open source projects I'm going to be talking about today called druid so today I wanted to talk to you guys about this idea of building an open-source staff basically to handle streaming data which is a type of data you find very often we're working with IOT devices and what I hope to cover during this talk is a little bit about the problems that you face with high volumes of event data and I want to talk about a couple of different technologies designed to solve various problems when dealing with high volumes of streaming data and those pieces consist of a message delivery piece data processing piece and a piece to serve queries so we start off with a problem I'm sure by now you guys are very familiar with the problem there's connected devices they are very growing very very rapidly you know you can now tweet on your fridge so people can know the whole world can know what you're having for lunch or you've got me for snacks so that's a that's kind of cool but connected devices are everywhere they're starting to there's continuing to grow in popularity and as a result of that data is growing very very quickly as well and I think a lot of the the problems people talk about when when talking about IOT in this entire space is that there's a ton of data that gets generated and this data is often very very valuable because when you extract insights from this data there's important decisions or optimizations you can make so the problems I want to focus on today is really just around collecting a massive stream of data and then making sense of it now with any connected device devices generally emit a stream of and these events are often called messages or laws depending on how you read em literature but there's just really bits of information describing what's happening at a particular period in time so when I look at events emitted by devices I see them typically being composed of three components so there's a time stamp indicating when the event was created there's a set of attributes around this event so these are properties that either describe the device to describe what's happening on the device or they describe something of interest and then as part of this event there's a set of measurements and these are the numbers of interest and when we get a bunch of these events generally what we want to do is we want to calculate a variety of statistics based on the measurements and when we calculate these statistics we often want a group on the attributes or filter on the attributes and by doing so we obtain more insight onto the data and once we achieve that insight we can make decisions based on the findings so to give you an example of some of the analytics you can do off of a stream of data I thought it'd be useful just to showcase a short example and for the example I'm just going to be using a UI here and I can make this a little bit bigger I have a couple of different streaming data sets here there's one for Wikipedia editor appearing on Wikipedia there's your more standard ilt data set which is air quality data that's been collected from various sensors this may be said it's actually kind of boring so I'm gonna pick one of these other data sets to demo instead really with device data or any activity stream they tend to have a lot of commonalities which is as I previously mentioned a set of attributes that describe an event and a set of measurements so what I'm trying to showcase here is edit on Wikipedia every time someone makes an edit on Wikipedia there's an event that gets generated and I actually collected these events and put them into a UI just to showcase them the workflows that you can do so there's edits there's attributes around the edit like the page be edit is the time the edit occurred the user doing the edit and then there's your measurements which are the number of edits number of characters added deleted and then so on and so forth so what I mentioned was most time what you want to do with this data is you want to filter it in some way or you want to group on the attributes in some way to get some interesting information so right now we're looking at the last most look at the last seven days there are about 2.8 million edit on Wikipedia with about 1.1 billion characters added and 37 million characters deleted so if we want to look at how these edits were trending across time we can do that and maybe we want to pick some arbitrary time range here if we want to look at the top pages being edited that might provide some interesting insight and we see that there's a lot of different types of stuff you can edit on Wikipedia maybe we just want to filter out filter on the articles being edited so in this arbitrary range in time that we picked we see death and 2017 is pretty pretty prominent up here of Championship League North Korean dictators and the Marquess cousins from the NBA so maybe we want to break this down and understand a little bit better of hey who are the top users making some of these edits we can do that for defin 2017 for example you know this particular person has been doing a lot of edits for this arbitrary range in time that we selected and maybe we want to go a little bit further in our analysis like we want to filter out those types of people who are not box for example so we can look at all those users who are not BOTS and just one particular user here editing death and 2017 doesn't seem like he's a bot maybe we want to look at for this particular user what other pages they're editing over the span of time and it seems like this would be only one they're editing and yeah so this type of workflow I'm trying to demonstrate of taking an event stream basically breaking it down grouping it filtering it trying to expose inside trying to look at various metrics this is something very common that you do with really any activity streams and for the end-user most of time the access this data either through some command line or through some sort of UI or application so when we're going about trying to make sense of a lot of IOT data I think there's a couple of different problems to solve obviously when there's a very small amount of data when there's a low volume of data problems are very easy and you know you don't see a whole lot of talk describing various technologies to use but with IOT data with device data especially with the current growth of data even seemingly trivial problems when you have a small scale of data become very difficult when you have a very large event stream so I think the three main problems that people generally try and solve the first is around event delivery so when some event gets generated by a device or as part of some activity stream you have to deliver it from where it is created to someplace where it can be consumed and analyzed and it's just giving an event from one place to another at a very large scale can be a pretty difficult problem the second problem that you face when dealing with large volumes of event streams is around processing the events raw data is oftentimes not very useful there's a lot of imperfections in it there's a lot of caveats to working with raw data so to make that data useful a little bit more consumable by analysts and users and whatnot now you generally have to process the event so either cleaning it adding business logic potentially transforming it in some way and then the third problem is just taking taking this process data and then making it available making it available for queries making available for applications so people can analyze and gain insights from it so as I mentioned I think each of these each of these problems is very difficult and it's very difficult to find a single sort of solution that will solve all three of these problems so for this talk I'm really going to talk about three separate systems and why I think there they're particularly good at solving one of these problems so in the model I described how it's kind of looking out right now is basically you have data getting emitted from devices or activity streams and at the end of the day you want to have applications or users making use of this data and in between I've kind of broken down the problems into three main pieces the first is delivery the second is processing and the third is querying so first I'm going to talk about data delivery and this is the problem of getting events from where they're produced to something that can consume them and do something further with them so on one side of the delivery system you have a set of data producers and on the other side you have a set of data consumers there's a couple of different problems associated with data delivery at scale the most of them most of them are related to having high availability in the face of different types of failures and also having a very fast and scalable system to deal with high volumes of events failures can occur for a variety of reasons so for example if your network is out is is out how do you prevent events from being dropped if the thing that's consuming the data completely fails how do you ensure that events are still getting delivered into your systems what happens when you want to have multiple data consumers so for example if you have some very high critical data set maybe you want many many different systems to get access to that data and at scale these can all be pretty difficult problems thankfully I think there's a pretty good open source system called Apache Kafka which is very good at dealing with the data delivery problem and nowadays I think Apache Kafka is really has really become the open source standard for this problem if you've never heard of Kafka before its initially built an open source that LinkedIn and since then it's the open source movement has grown very rapidly so there's many companies using it and vast amounts of activity streams so the way that Kafka works is there's a notion of producers and consumers and Kafka comes with a producer library that you can basically embed is part of your application or part of the thing that produces data and then has a consumer library that you can use to basically pull data from Kafka and in between there's a set of Kafka brokers and the brokers are going to store events as stored in a distributed message log or message queue so the idea is producers will write events to these distributed logs and events are group logically as part of a topic so a topic is like a table and a data source it's some grouping of events Kafka is a distributed system so it's designed to run across many many servers and producers basically write events in these events get spread across different partitions across different servers and you can have consumers then pull the data from from these partitions and read them and do something interesting with them what kind of different about Kosta is that the consumers they are responsible for basically maintaining information about which messages they've read so coughs itself doesn't keep track of where consumer has read messages to so in that sense it's very low overhead to add many many consumers of the data all Kafka does very simply it has buffers data and allows a bunch of different consumers to read that data so Kafka sort of as a summary it's a high throughput event delivery system it provides at least once delivery guarantees which means if you send an event if you produce an event and you transmit it it's guaranteed to be delivered it might be delivered more than once like you might get a duplicate event here or there but it will be delivered and I know the cost of folks are working towards having exactly ones livery of which is a very very difficult problem with streams has a very simple straightforward design it's basically just to disagree the laws that you write events to and its main purpose is to buffer incoming data so consumers have time to consume it in this model you have a logical separation between things have produced data and I think things that consume it so if these guys all fail the producers can still write data to this intermediate buffer and if a consumer comes back online it can read messages from any particular offset in these different in these different buffers so topic I think is really great for getting events from one place to another the piece I think makes a lot of sense after Tosca is a stream processing piece and the purpose of stream processing is really to transform or modify raw data in such a way that's more easily consumable by systems so raw data often times as many imperfections might have null values we might have this random IDs and you need to replace human readable strings there's you know there's there's a lot of things you have to do a fraud at a-- before it's usable and the stream processing systems what they're designed to do is take a stream of events and then transform it in some way so in the open source world there are many actually many many different types of stream processors they all have various trade-offs I'm not going to go into details about how all of them work and kind of focus on their high-level idea but different stream processors that are out there spark streaming patchy flank Apache storm Apache apex there's Apache fans as well there's cost of streams and things like more and more or coming up every other day the main challenges that involves processing a stream transforming the data of the stream like any any system dealing with a high volume of data the system needs to be highly available in the face of different types of failures it also needs to be pretty scalable so be able to handle massive event streams and do something interesting with the way that most stream processors work is that you transform data in a series of stages because it's actually very difficult to have sort of one big job to teach your raw data and make it something useful so instead you have many small jobs and each of them modify the stream in a different way so the stream processing system I've used a couple of different a couple of different types of stream processors but the one I like the best architectural II is Apache Sansa Patti Samsa also was first developed an open-source by LinkedIn and it's probably a little bit less popular than then sparked into many others but I think architectural II has some really nice properties so the way that Samsa works is basically you have an input stream of data and this could be data in Kafka Samsa then pulls that data and then applies a series of transforms on it and these transformations are called tasks and it's up to the user to basically write the logic for tasks tasks to do something interesting with the stream so you can have a be same stream go to multiple tasks and have more than one task write to an output stream and this output stream might be stored in a system like Kafka as well and as opposed to one transform task you might have a series of tasks but modify the stream in different ways and later on the top and give you an example this for like a real world application so the way that Samsa basically works is it breaks up processing logic into a bunch of logical stages or tasks and what's really nice about Sansa is that for each of the tasks that you write to process the data you can you can tune different resource requirements so certain tasks involving simple operations don't require a lot of resources paths require passes through more complex things more complex transforms of the data might require much more resource requirements and in that sense I think Sansa has really nice operational properties for transforming data so the third and kind of final piece I want to talk about here is okay he was taking your data from words created you've delivered it you transformed it you've done something interesting with it and now you want to go about and query it so the query system is probably has the most complex requirements and also probably has the most number of different choices that you can use so usually what I see people do with a very large stream of data is they want to be able to issue their interactive query so if you're accessing this data through an application if you're kind of slicing and dicing the data you generally want queries to be very very quickly however the data might be very complex that prevents queries from completing quickly so complexity of data might mean very high cardinality dimensions so you might have a dimension which has tens or thousands or even hundreds of millions of unique values and sometimes you want to do operations across those values that that can be very slow as I show kind of earlier in my demo oftentimes you want ad hoc analysis so when you're looking at a stream of data you might not always know like what it is you're looking for you might just be looking at a spike or drop and trying to understand more about why that spike occurred why that drop drop occurred and trying to analyze the root causes of some pattern that you've seen another challenge is a lot of traditional pouring systems especially databases they tend to be more designed for batch loads and are a little bit less designed for loading a massive stream of of eventually device data and once again of course because we're dealing with very high volume event streams high availability and scale are always kind of challenges in the background but I do think that once you're able to solve some of these challenges you can remove a lot of barriers for people to understanding their the reason why I think subsection queries is very important and not everyone might feel this way but I think subsection queries are very important to allow for iterative exploration of data so you look at one view you might see something interesting you look at another view based on your your what you saw previously and it's a very iterative process of like asking questions getting answers asking more questions and trying to rapidly iterate and find the root cause of a situation so to accomplish some of these challenges the third open source system want to talk about here is true it drew this is a system that I work on so that's why I'm probably going to plug it the most during this talk so drew it is is a column-oriented data store and what that just means is your data is stored individually in Tights columns so each column has a type associated whether it's a string at the number and so on and so forth I do it it's very much designed for sub-second ad hoc queries and it supports both exact and approximate algorithms and some of the approximate algorithms are there to make certain workloads complete quicker Drude is a system that's designed to work with other streaming systems so it works directly with Kafka if you don't want me to process your data before you visualize it it works with stanza and many other stream processors as well so at the end of your stream processing job you can feed that data into joyed do it doesn't just deal with streams it doesn't just deal with like recent incoming data it can also keep years of historical data as well so there's there's a bunch of different companies that use true to production both large and small so alright moving on so a very high-level glance of how could work similar to some of the technical overview of the other systems druid partitions data first based on time and the time partitioned shards are called segments and these segments are actually immutable so it's actually druid as a system very much designed to deal with a constant stream of Dewar's maintains a global index of basically time interval to shards so each of the queries within druid has a notion of time associated with it and so if you query for like a week source of data that we should work but weeks worth of data might correspond to a couple of different shards include through it maintains an index of basically how different shards map to different time intervals within each shard the data is stored in a column-oriented fashion and then compressed this is some of the other column stores out there and then each shard also contains a bunch of different types of indexes for very fast filtering so there's in this kind of demo I showed here when you want to filter on whether or not things are robots whether or not you want to do very fast groups groupings this demo is actually being powered by the stack I'm kind of talking about right now one other nice property about druid is that it supports different types of approximate algorithms so I think there's certain operations that you might want to do with an event stream it's very difficult to do exactly so for example four distinct count if you want to count the number of unique device IDs you want to count the number of unique users having a store every single device ID or every single user ID is very very expensive especially if you have now tens or hundreds of millions of users it can be very expensive very quickly to store all that information just to do a distinct count so there's a popular algorithm called hyper log log a lot of databases have this nowadays that allows for this estimation of this thing count without having a store every single unique ID do it also supports other approximation algorithms like top edge which is approximate renting by a chosen measurement it supports approximate histograms and quantiles and it also supports approximate set operations so you can do Union intersection difference that sort of stuff with with druid the architecture of the system looks basically something like this the false part is not that interesting for this talk the streams is really what we're focused on so imagine that the the input of this is the output of a stream processor what true does with the stream is it loads it across a set of processes called indexers these guys will basically intake a stream they will take that stream of data and then create these druid shards which we call segments and these segments are then loaded across a set of processes called historical the idea is indexers are only going to be dealing with a very small window of incoming data so these guys only might deal with about hours worth of data which they buffer up create these partitions and handoff the historical and these historical deal with anything older than an hour up to light years of data yeah right so very simple system systems but like like in Sam's are for example I go back here if each of the tasks here can be like containerized and they can run in a system like me sauce or kubernetes so basically you're processing logic all that lives within a container does that make sense form that's actually was in droid yeah that has never been done most upon one of the types is like an integer versus like a string versus something else if it was a container I would have to think a little bit more if how that makes sense but usually the types are describing for the scheme of the data like what is the underlying data but like if one of the types of like MPEG that might be okay that's possible right yeah yes I mean the workflows described are a lot more around like statistical analysis so like finding averages means aggregations that sort of stuff off of a stream of data but country's gone last bit of the architecture here within druid the idea of druid is there's a third process called a broker and these guys do query scatter gather functionality so queries go to the broker developer fans out queries either to the indexers of the historic hold and these guys hold recent incoming data these guys hold toast Oracle data so the brokers have a merged view of both real-time and historical data and that that's what gets returned to a caller or to an application so when you kind of put these three systems together just kind of overview then what you have is three separate open-source systems that are handling three separate problems dealing with streaming data so data comes in it goes to Kafka Kafka can actually deliver to the processing piece which is Sansa and with Sansa the very last stage of your data processing you can tell Sam that basically to send a query to druid another way of doing it is at the end of processing data with Sansa you can write it back to Kafka and then drew it can actually pull data from Kafka as well druid is the system that's designed to stream data from a lot of these other these other streaming systems so I thought maybe be interesting to kind of go through a rule of example to kind of cover you know how this works in the real world and the state of fetch here I actually did not use IOT data I use the data that's kind of more commonly found in my field which is advertising data because like behalf of companies Silicon Valley basically make money through data it looks like this but with this data there's there's two kind of data streams here there's an impression stream and then there's a click stream and impressions for those of you that are familiar of advertising are basically people viewing an ad and then click searches people clicking an ad and what we want to do with this data through the stacks that we've built up is basically create enhanced impressions which basically means for a given impression we want to know that someone click on this impression or not so there's a couple of different steps required to to process the data and get it in such a way that it's a little bit more usable the first step that we want to do is to be able to join our impression stream with our click stream with a traditional database you will be able to do this drawing a query time however because we're dealing with like massive event streams these might be billions or trillions of events and try to do that drawing a query time can be extremely extremely expensive so we're going to do this drawing at our screen processing level and then after that there's maybe a couple of different steps so basically clean up the data I just make you a little bit more user-friendly before we load it into our pouring system so the idea is this impression stream and this click stream this is server log data that might get generated the server's we first write this data into two separate topics in Kafka so one topic is called impressions and a lot of topic is called is called quicksand this is coming from our ad servers for example and we want to take this data we want to enhance it make into a single stream and then make it visible and Druid the raw data looks something like this so your impression stream you have an ID of some ad and you have a publisher where that ad was published and then your stream is divided into a set of partitions and if you recall what I said about Kosta Kosta partitions data across many different servers so we have many different partitions that contain our impression stream and also our click stream as well and the event that we want to join is these two events so one is from one viewing in AD and one is some time later someone clicking an ad this is where our stream processor comes in and what our stream processor does is it's going to it's going to create a series of jobs to be able to join these streams the first stage is a task called the shuffle step and it's the shuffle step is from a load data from the impressions and click stream the idea of the shuffle step is basically to rework data in the partitions such that the the join event or it's going to end up in the same partition this makes it actually possible for us to do a join later on so we do a shuffle where the impression and the clip that we want to join are now close to the partition 0 and at the end of the shuffle phase basically we're going to write in another stream - Kafka and this is going to be the shuffle topic so now where we're at 3 topics the purpose of the shuffle topic is that we're creating another job in Samsa which is going to read from the shuffle topic and basically do something with it and then write it back to Kafka to create this new topic called the join topic what what's happening is what well previously we want to join these two events we do the actual join are basically removing one of these events in this case the click event and then adding a new field to our impression event and that field is is clicked so the idea is now there's a new new field called is clicked and if it join occurred we mark that is true if it did not occur we mark it as false and events that are joined are that are then written decosta under this new topic called joint and after that we take this drawing stream and we can do additional tasks we do additional processing on top of that join stream so we might add additional business logics like let's replace the nulls with a default value let's take IDs and convert them into human readable strings and then that single stream is the thing that else make its push into druid and that's what and then the queries and applications all go through druid so to kind of summarize here all the technologies that talked about all these things are all open-source each of these projects have their own project webpages you can just download any of these projects the three I talked about kind of work out of the box with one another so you can download these things you can install them you can play with it load your own event streams and try things out for yourself okay so what I hope you've actually got in this talk is that I think managing IOT data requires a dedicated components that are targeted toward solving very very specific problems and the three problems I talked about one is data delivery the second is data processing and the third is queries system for queries and I think Kafka Apache Kafka is a great system for event delivery I think Apache Samsa is very useful system for stream processing and I think druid is one of the best in-class systems for the interactive exploration of streams cool so that concludes my talk and happy to just answer any questions at this time yes it is yeah yeah definitely so I have yeah it is so I actually have seen applications of this stack and this practice run in production at a whole bunch of different types of companies I have seen application regime learning it some companies I'm not sure if I can say their name or not but one of the use cases I've heard about is like just like behavioral analytics so let's say if you have customers using like 20 different products and you want to start doing correlations across like this across different data sources understanding how one customer is using one product and another product my third product are these like all the same customers or not what kind of like what how are they using the different products of my company offers that's one application of machine learning that I've seen physically here is Lourdes narro home and it's just a stupid several nature so in your world that you're flying this right now visual historical in these daily press maybe in some cases our but in an IOT world you might want to be driving alert and with literally real-time heaven and going back and doing historical which is closer to what you're doing so that was what I was getting out of you've seen this used where they snap your window did show me everything that's happening a lot you know 97 yeah so alarm kit then show me everything acceptable hour or fine pattern how many times tips happened in the last 24 hours and literally run through a pattern right so yeah with regard to the application of machine learning to this so the spec I described it's all for real-time data the latency from when the event is produced to when it's explorable is like millisecond so it is all very low latency the stack I described actually does both the streaming real-time component also historical component and that alert piece is something I have seen before how it usually works is people try to automate like spike detection or anomaly detection the best way I've seen that being done outside of like human intervention like there's always a thing of like alert me if x-values feeds like white threshold that's a lot less interesting than from the applications which is let's take all my historical data and then compare what's happening the last like 90 seconds with all that historical data and if one of some factor is a significant amount above like what I've seen historically and then immediately alert there's interesting challenges there because a lot of data patterns can be like tiny soy de like the middle of the day can be very different than the end of the day so and then like the end of a quarter to be very different than I can be anemic order but that's why I think it's actually important to have this Oracle piece there so you can you can look back like last five quarters and then say like is this actually anomaly but I've seen like pretty interesting work being done there and trying to automate like anomaly detection yeah it is being used devops so both actually alerting and a lot of sort of solutions 9 file t make a lot of sense in the devops world so one example of why I think the stack is pretty good is that it can do like very flexible ad hoc slice-and-dice analytics and the application of that is when you see like a weird spike in your dev ops data then the most immediate thing is like what's causing that sway and it might not be immediately obvious with coughing until I kind of break down your data and view it a lot of different views before you find like the cause that's like exactly all right all right cool any other questions I did there is actually an example here for github there's a US EPA which is your classic sensor data and then I've been loading github events as well so you can start looking at for example pop this is these are all open source github events but let me see if my oh man this is a split up normally I'm plugging try and plug it back in okay so this is this is github data actually I'm loading so what I'm showcasing right now is let's say last seven days who are the top organizations that have been contributing to open source github and I mean Microsoft the Apache Software Foundation are very high up here you can break this down by different types of repositories as well so kubernetes is here Google Facebook they're all they're all pretty big contributors to github but you know taking an activity stream from github and like analyzing and breaking it down I think is is pretty interesting so here you can see Microsoft over the last week looking at vs code err slim typescript Apache the top Apache projects here's Kafka here sparkers flink these her missile stream processor this is a batch and stream processor to do some other stuff and this is your event delivery piece yes the UI or the the logic so Penza is open it's under the Apache Software Foundation the logic of that so multiply the organization that's loading these things right from themselves I think Sansa and some of these other stream processors probably have some like default once I come out of the box but yeah oftentimes it's like custom custom business logic that the organization then writes yeah yeah it is yes it is but underneath this is just an example you I put underneath the themes it's the same stack so this looks like some pretty interesting stuff that you can do this so if you want to look at like the Apache Software Foundation you can do that you want to look at you know Apache Cosmo for example and look at who's who's kind of contributing to Apache Kafka you I know this person or the other people are but this is kind of a kind of an example of how you can start analyzing and slicing dicing like what people are doing on github and it's a good way of following like what are popular first project cool any other questions I'm not sure I have to send it to defend to people yeah but definitely I will upload it and have some information everything talks about open source you can download it and as I mentioned even like hooking up opponents up from one another all that is should work out of the box cool other questions all right thanks guys [Applause]

Info

Channel: The Linux Foundation

Views: 8,818

Rating: undefined out of 5

Keywords: embedded linux conference, openiot summit, linux foundation, linux, embedded linux, internet of things

Id: 5nVEWee9fc4

Channel Id: undefined

Length: 41min 15sec (2475 seconds)

Published: Tue Feb 28 2017