Apache Druid 101

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
currently i'm hearing myself okay do you want me to just start yep okay so i'll just i'll just share my screen and uh get going okay um hello everyone and welcome to apache druid101 at datacon la i see the talks today were scheduled alphabetically um that was my it's a sorting joke okay uh i like to think i'm warming up the crowd but maybe not um so what i wanted to do today was introduce druid um there's a lot to know about druid it's a super um deep subject and there's just no way that i could cover all of this in a 40-minute talk but i think that it's a i can do enough to explain what drew it is and why i think it's so cool and why you should go download it at druid.apache.org and get active in the druid community which is where i work um i work for imply the i come from a weird background apparently no one in community really comes from a community background i worked in corporate i.t i worked in healthcare i.t actually at one point in my healthcare i.t career i decided i wanted to go get a master's degree in epidemiology and study infectious diseases and do research which actually came in handy years later um because now all of a sudden it's uh it's val it's uh relevant again um plus my background in epidemiology gave me a grounding in a lot of the technologies that or not the technologies but the techniques that would go into um statistical analysis over big data i've been working so there was a juncture in my career in about the year 2000 which now makes me sound old as if my long gray beard didn't convince you to begin with um and i worked at pc mag i ran half the test lab there and we did fun stuff like testing programming languages and switcher switches and routers and firewalls and stuff like that and uh that then led me into the world of marketing and startups and evangelism and things like that and i have a background i have a background in security as well and let's see i always like to throw in one fun fact which is that i like to cook competitive barbecue on the the kcvs circuit and yes this is a thing it's the kansas city barbecue society um i'm on a we're not pros we're not even weekend warriors anymore but it's fun and we go out and we cook um anyway so let's talk about what is druid what is druid used for why was it created and then how it works a little internals so this is an example of what we would call a modern app like the expectation that someone might have in doing a business intelligence uh doing some kind of interactive olap interactive analytical processing okay now to be fair this is all a little video captures of some wikipedia data that was streamed in the akafka uh into druid actually this is then in a product that imply makes called pivot which is our data visualization thing but this is just an example this is similar to the experience you would get out of a tableau or looker um as other tools or even a superset this kind of drag and drop and instant response over queries that are over hundreds of well this is over hundreds of thousands of rows so these were the the initial intentions of druid to be high performance uh to provide low latency sub second queries uh we talk about queries in uh milliseconds um and to do sub second real time ingestion on streaming data as well as on uh batch data and to provide analytics that are optimized for olap as well as for um group by queries that kind of what i like to call slice and dice okay so i i took a look recently at all of the companies that are on the apache druid powered by page um and i you know just quickly keyed in their use cases and this is how people are using druids so they're you know our origins are in digital advertising which i'll get to in a minute and then there are user events right this is like anything that's click stream so one of my the points that i'm going to get to is that things that are timestamp uh organized by time stamp are well suited for druid use cases and i mean this is how kind of the world is using druid apm application performance monitoring is a big one and again digital advertising uh requires a lot of real time ingestion and sort of very quick uh query processing so if we look at these what are user events um i always get worried i'm going to run out of time so user event and if we looked at what is the user event these are click streams so they're like think about if you were trying to build your own google analytics um you might use something like druid um ingest all the different activity streams and then write queries and build dashboards off of druid to access all those streams and again this is like that slice and dice that we talked about you guys can all read the slide i'm not going to read it to you one thing druid does really well is the the groupings and building top excuse me building top end lists because these are built right into the druid query engine um looking at network flows see i i personally think that um open source network flow analysis is fascinating um and that's because i've looked at network tools for a really long time and it just seems bizarrely easy to me we've now reached this point where you could take a stream off of a network device to read your net flow use use kafka to and then ingest it into druid and there are actually a lot of people ingesting various flows into druid for analysis there are a bunch of companies in the security space that are doing this um and again i find this super interesting i wrote a little demo just very quickly wow it seems so long ago it was right before um covid started it was the last trip that i went on was to show this demo that i wrote where uh you could take just take your home router and immediately see what traffic was on it after connecting it um through kafka cloud and then to uh to druid okay so but keep going um then in digital advertising druid is used quite often um typically you're talking about streaming data sets and they're very fast data sets so druids used a lot in situations where bidding kind of takes place and you need to reconcile multiple feeds very quickly and so digital advertising is a great way of doing that taking in feeds from all different markets and um providing them to your uh to your customers and then uh being able to run reports so one thing that druid does is you can hold your you have your streaming data as it's coming in that's queryable and then you also can write the streaming data over time uh to deep storage and then that's queryable via historicals and so when you want to run reports it actually gives you excellent performance to combine streaming data and historical data in a single report and that would be something like show me what happened with my ads today and two weeks ago and six months ago or something like that okay so let's see the need for druid where did druid come from uh in the beginning now this was sort of back to uh back to what i was learning how to build things you would have all your different data sources and you would etl them into something called a data warehouse which may be a product and then you would expose the data in the data warehouse through various interfaces to your analytics and your reporting and your bi tools so that ended up not really working after a while because there was just too much data and people couldn't etl it fast enough and get it into their data warehouses okay so to solve that problem uh we started just taking the data from different data sources and dumping it into a data lake and then um processing it and then putting it into a data warehouse where it could be accessed by analytics reporting in bi tools okay so this is where druid comes in um and what this is sort of like a hybrid environment of uh you have your original data sources um and then you have a message bust right like a kafka or kinesis um you can also do you can have your etl process as well and that stuff all goes into your data warehouse whatever you're already using as your data warehouse but i think you know what a lot of people have started to see is that the data warehouses aren't fast enough for their their analytics use cases so they're fast enough for some of them um you know your monthly reports right who cares if that runs in two seconds or but when you're talking about uh streaming building dashboards over uh recent data and providing those dashboards to users at scale like for example in the click stream analytics role or the the advertising use case then uh that you need the speed of something like druid and it's not for everything but i kind of think this is important because once you understand how druid fits into kind of everything else you can understand how best to use it um when when you need like sub second query response um because and your data is able to be pre-aggregated and you're able to um do when you need high concurrency reads and very fast scans and that kind of slice and dice olap performance out of a dashboard uh that that's the kind of thing that druid is used for and i'll tell you though also um druid is a a bit of a hog when it comes to memory and cpu and you can already tell that that's not right for everything that you're using your data warehouse for so i am in no way saying get rid of your data warehouse and move to druid because it's faster because you would it's just not right for it you'd be wasting a lot of money on um on ram and cpu and then you'd also um you know anything that drew it excels at running against timestamps okay and let's see why because it was created at metamarkets which is part of that digital advertising and the time stamping was absolutely critical it was important to know everything about uh the ads that were placed and there's a whole bidding thing and it all happens in fractions of a second and there was a lot of volume and a lot of advertisers and it just they they couldn't find anything that would meet the needs of their uh their requirements for ultra-fast response times over time stamped data so we were talking about millions of events per second uh ingested over both in batch and streaming data uh it's a high dimensionality and heart high cardinality data and in that case it was uh semi-structured so and what they wanted to enable was this drill down fluid uh slice and dice kind of reporting with thousands of users and um it it's it's actually it's actually pretty amazing because since druid's been open sourced uh a whole bunch of companies have gone on to build uh some very high scale and high performance environments so how how was druid designed what is druid uh druid sort of takes the best of the search platform this idea of real-time ingestion and a flexible schema and the ability to do full-text search but again it's really actually not a search platform but it has built on some components uh using search and the searching and filtering capabilities and that sort of came about from what we learned from the log search world druid's also it's a time series it's time oriented it's absolutely necessary well it's not necessary you could use druid without a time stamp but uh it's not going to provide an optimal experience um it's actually everything's kind of built around ingesting and sorting data around the time stamp and there are time functions as well so like i said though if you just have the search or time series there's still a gap and the gap is uh the ability to do very fast ingestion and to actually operate on the data very quickly as well okay so druid basically takes the combines the features from log log search time series and bi software and turns it into a column oriented database that is able to use the the fact that it's a time series column oriented database to support fast scans and aggregations uh as well as a high concurrency um druid is actually built as a whole bunch of independent process processes which i actually remember reading the original druid paper before i got involved in any of this and just thinking um just being very impressed that uh that had been thought out uh roughly 10 years ago to separate everything out into individual processes so it could scale every process can scale independently druid's capable of doing continuous real-time ingestion and it can do that in parallel and it's also we recently added the ability to use sql to query druid and most installations will result in sub second to a few second query time and like that's for real uh it's it's super fast when the data is properly organized and ingested into druid druid has indices built for quick filtering on ingestion and it does the time based partitioning first it partitions data by time automatically and then you can add additional partitions uh based on how you know your data and uh then when you're searching you can filter on time and limit the then you're only searching across the times you need to search and this this makes it lightning i mean if i run a sql query and just search on uh for a string um i can then limit that sql query to search for that string within a time period and it's it's uh much much faster the other thing druid does is it uses approximate algorithms um things like approximate count distinct and approximate ranking and it can do these things on the fly um and it can also computate computate it can compute approximate histograms and quantiles uh and so this is actually really good on streaming data as well and it's also able to summarize data at ingestion uh which is something that i think is pretty cool where it can pre-aggregate your data um so let's say you know you're getting data in um i don't know let's say you're getting data in every second but when you run your queries you know you're only querying for every minute i know it's a ridiculous use case but you can set your ingestion spec so that when you ingest uh after 60 seconds it aggregates and then saves one minute so the the and actually what a lot of people will do is ingest to the most granular setting they can uh in real time um and then after say a month go from real time to a minute and then after a quarter uh go to a day and this changing granularity and aggregation is one is uh one of druid's strengths because you you don't lose anything as you we like to call it compacting as you compact uh over time it's compacting and rolling up okay so who's using druid i'm doing a time check i'm keeping going um this is where druids in production this isn't counting imply uh customers um some of these are are uh are actually really pretty cool um i don't know the one that i that i think is really neat is walmart uh which is able to take uh inventory and um pricing and sales information from all of their stores in north america and run reports in far less time than they were able to before and they do this so then they can do more dynamic pricing and do more intelligent inventorying and that's a that's a big problem of scale right so that goes back oh and z scaler is another one of those uh security use cases as well okay uh let's see so um that original druid cluster has grown over time um and it it has about uh it has more than 500 terabytes of segments which represents uh 50 head actually this was back at that time okay and um netflix if you if you were to uh there was actually a blog post that came out sometime earlier this year uh that was uh how netflix uses druid um to do their uh performance monitoring and uh that this information is from that and it's a great example of how you can use uh druid and so they're talking about doing a hundred billion plus rows ingested a day then you can see they they have roll-ups and they do their retention and they have hundreds of servers and they're able to have sub-second to a few second responses and these are on dashboards um and as you probably are aware dashboards are not one query they're multiple queries and they use a combination of streaming and batch ingestion again it's that use case of streaming to see what's happening today and um batch to see do you have something for comparison as far as druid speed goes um there was a druids frequently compared let's say to presto and hive because they're both open source solutions as well and apache products i'm sorry projects and as you can see druid is much faster than hive and druid is also much faster than presto and this again okay so this is done by a third party uh you know not someone who had anything to do with druid and um this was done at a university in europe so i mean you gotta kind of think that it's it's um realistic and responsibly done and believable and okay now i a large part of my job i had to imply is to performance test druid and i have been running tests using this star schema benchmark which is um built on tpc-h and uh i i these are hot off the press people in fact these aren't even on the press or maybe they are any second now i this the blue is comparing druid 0.18 this is a 600 million row data set and these are the standard ssb queries you could google star schema benchmark and download it uh and actually i mean so we've made improvements to um our our query planning and to vectorization so as you can see the i mean whatever blue bar was tall is has become short which is something that is pretty awesome um and so performance is something that we take very seriously okay now one of the questions that comes up is is druid right for me um so as we discussed right we're talking about where you're you need a fast query response over a large data set many times it's a streaming data use case in fact it's druid adoption is largely driven by streaming adoption and low latency data ingestion that then makes data available for um interactive queries and um the ability to query streaming real-time data and historical data okay and infrequent or you know hardly ever updated in fact it would be append only now you've got to have a time stamp denormalized you can do joins you can't really do big big joins um okay so where does druid fit i kind of talk through this so i'm going to go through this quickly and basically the way that we see people use druid is taking the raw data staging it and then providing it to druid um a lot of what i i've looked at has been uh kafka kafka and spark k sequel stuff like that um is feeding into druid as an analytics database and then building the application or exposing the data to um a bi tool through druid okay um now where i am in my job i i i do tech support too uh and i i answer questions that come in over the google groups and over the asf pound druid channel uh stuff that comes in on stack overflow and i i've been thinking about this and i i would like to present a uh a plan and this idea is called think like a druid and uh this is based on the questions that people ask so one of the secrets to success uh or failure in druid is understanding what what you're going to do from the very beginning um so the idea is that if you uh based on the you think about druid let's not yet talk about servers or processes let's just think about ingestion database management and then query okay the data comes in you do something with the data over time and you query it to take intelligence out okay um and so these have to be aligned in your mind before you start work on any actually they should probably be aligned in your mind before you start work on any uh data and analytics project but in druid particularly because it just won't work if if they're not aligned and by that i mean understanding um like aligning the granularity of your ingestion stack with the granularity of your queries um so these are all the questions that come in in each group um i'm not going to have a test later but i'm just these are the ideas and let's understand you've got your ingestion then you have the way that you store and optimize data and then you have the queries or the way that you get data out okay so now if we start to think about the not just the functions but the servers and the processes okay so there's really nothing like a druid server uh it's really processes that are grouped together and they're usually grouped together for management reasons um the leader is called the overlord um which is just uh i don't know one of my colleagues thinks that that's hysterical and uh the overlord takes tasks and hands them out to the coordinators and so what really happens is that the overlord takes the work it splits it up into discrete units and it hands it off to the middle managers uh or the indexers see these are over on the right um and they take the data and they analyze it they build all those aggregations that i talked about they build their indices they partition the data and they encode dictionary files and they write out a whole bunch of things that we in the druid world call segments and then we take that and put it into deep storage uh and then it can get read back in by the historicals to be available for queries later okay now everything is coordinated via zookeeper so that the cluster and each piece each process in the cluster knows where each other every other process is um we also have backup you can do this just over http in case you lose zookeeper and now there is another external dependency which is for metadata storage and that has to go into a relational database like mysql or postgres ql and that is where the maps of where the segments are and what's in each segment go and where all the rules go for uh querying and arranging the data in two segments okay um yes um before we actually jump into this next segment um okay yeah we did have a question and uh sure let's go let's do a question and oh and you know before we do a question because i know when we when we're about to run out of time um that everyone the slides i didn't get to you can download them there's a pdf in wova okay so um curtis bennett asked how many servers would be required to manage say a petabyte of data hey it geez um i it's go it's to vary about right you're going to so what happens is the petabyte of data will get um pushed around to all the different servers and then uh into built into segments um so it's and then they'll be all different different uh i'm just thinking about like what i run i i would think that if you wanted to operate on i'm just guessing but uh like in the 20 to 30 range depending on how they're sized i know there are companies that do that are looking at are you are ingesting um like multiple terabytes a day and they're only running like 10 servers it depends how you build out your servers because we're optimized more for um query for more for around cpu and ram than on storage so that's why i'm having tr it would be more like if you had a petabyte of d um the petabyte of data and you needed to query it in x number of milliseconds okay sounds great um just uh so we have about three minutes remaining okay maybe i'll just yeah but i'll just fly through these really quickly uh and i apologize of course this is one of those things that when i timed myself practicing it all went smoothly and fit into my allotted time um so here's the druid console i guess actually the most important thing to just shoot to here would be to wrap up and um go go to druid.apache.org find you could email me matt.sorel at imply.io if you join the asf pound druid slack channel then i am at matt and uh you follow us at druidio on twitter but really everything is going to be at that druid apache page under community and um download the the uh the quick start and play with the quick start just uh run through that wikipedia um tutorial that i showed you and uh start playing with it and get an idea of the the power of druid okay thank you thank you sir um thanks again for that great presentation uh i certainly learned a lot today um thanks for making the time to join us here at datacon la and like you said if anyone has any questions they can reach you guys on your twitter page your website on slack uh linkedin i'm pretty sure so yeah and download their presentation so you can see my slides that i would have just rushed through anyway and thanks thanks a lot for having me are there any more questions uh that was it that was the one question that we had today okay okay cheers thanks everyone all right thank you
Info
Channel: Data Con LA
Views: 7,704
Rating: undefined out of 5
Keywords: Data Con LA, DCLA, Data Con LA 2020
Id: wICuJkfcdmM
Channel Id: undefined
Length: 38min 31sec (2311 seconds)
Published: Wed Nov 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.