"Druid: Powering Interactive Data Applications at Scale" by Fangjin Yang

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good afternoon everyone thank you all for coming to my talk so today I'm going to be talking about an open source project called druid and actually just out of curiosity how many people have actually heard of this project before okay Wow so so quite a few that's great if you haven't heard of this project before druid is it's an Apache 2 license column-oriented distributed database and my name is fan Jen I am one of the original committers behind druid so what we're going to talk about today is some of the history and motivation behind this project like why create another sort of big data open-source project I want to showcase a demo of an application that was built on top of this project just to demonstrate the value out of this project the use cases was really designed for I want to talk about the problems that druid solves and alternative architectures you can consider for solving some of these same problems and then finally about half this talk will do a pretty deep dive into the architecture of druid and how data is stored and how the various components work together so the first point of druid were actually started in 2011 and the project was started initially for use cases in online advertising specifically what we were trying to do when we first started the project was the power and interactive analytics product for the advertising space and the purpose of this product was to be able to look at impressions data and ads bids data and be able to slice and dice and drill into the various events that were occurring and the main requirements of the project at that time and some of these requirements definitely hold true today is that we needed a very scalable back-end so if you're familiar of online advertising at all that space can generate a tremendous volume of data we're talking about hundreds of billions events or even trillions of events per day so this is like petabytes of events per day and when we were trying to build some sort of user facing application you also have to consider multi-tenancy and when I say multi-tenancy here what I'm referring to is potentially thousands of users using the same application settings as hitting the same back-end at the same time so a lot of these earlier requirements around scale we also wanted the backend to power this first product to be very interactive and what I mean by interactive is really just low latency queries so as you're using some sort of application you get the feeling that the application is updating or reacting to the actions that you're performing and finally there was this requirement of real time which is this buzzword that's basically meaningless nowadays but in the context of this talk what I mean by real time is I mean very low latency data ingestion so the ability to surface insights on events immediately after they occur so this project is having started in 2011 was actually initially open sourced in late 2012 the first license that was chosen was GPL and if you ever do an open source project GPL is not a great choice for an open source license everyone has problems with it the project grew slowly and it was sort of part-time development at this advertising company until about early 2014 around that time there was a sufficient community and there was this idea of wanting to put more resources behind this project so the project was changed to Apache v2 in early 2015 and everybody loves Apache v2 because you can do anything you want with the code and since then the community has grown very very rapidly and nowadays I think we're at over easily over 150 contributors from over a hundred different organizations so Droon has grown a lot since its very early days in online advertising it's still used in heavily in that space but this project the system is really designed for any type of event data so it is also found use cases in production for network traffic data for security data financial data gaming operations really any data that is generated by users or generated by systems so the main use cases of Druid how people tend to use it in the wild is the power some sort of user-facing analytic application and people use druid to unify both historical and real-time events so what is happening right now also what is happening potentially years in the past the type of queries that people tend to make against this system are business intelligence or OLAP queries and in these types of queries what you're mostly doing is you're slicing and dicing drilling in looking at some filter view of data and performing a lot of aggregations so examples of this this is commonly found in like behavioral analysis like measuring distinct counts doing retention analysis building funnels a B testing so on and so forth a lot of people also use the system to do exploratory analytics so trying to understand the root cause or a particular spike in their data and just drilling further and deeper into their data so Druid we have like a power bi page that people can openly update here are some of the companies that have just willingly disclosed how they're they're using druid and it's a variety of companies both large and small that use the system in production today okay so before I jump into too many technical details I thought it'd be useful just to showcase a demo of some of the use cases of this project and some what people typically do with the data once it's load into druid and this is a live demo so this is definitely maybe going to work so what I'm I just there's we have this like a public public demo site and within the site there's a bunch of different data sources that are just public API data that we've collected so this one hopefully ever you can see here the data source I'm trying to demonstrate right now is edit on Wikipedia and this is a very common data set you see in the Big Data space so with edits on Wikipedia how the events look is there's a set of attributes related to some edit on Wikipedia and these attributes might be the page being edited the user doing the edit on the page and whether or not it's a robot there's also various metrics associate with the with the with the edit like the number of edits that occurred the number of characters added or the number of characters deleted so this is a pretty simple drag-and-drop UI and if we drag time into the center here we see how edits are trending across time and maybe let's just pick a larger time range so over the last 7 days for about 3.4 million edits on Wikipedia okay so let's let's filter a little bit into this data there's a lot of things you can actually edit on Wikipedia let's just filter on articles because that's the thing that most people tend to care about and here we see about 2.8 million edits that have occurred on Wikipedia and about 1.1 billion characters added with about 26 million characters deleted so people tend to add a lot more characters a lot more information on Wikipedia than they delete and maybe let's just pick some arbitrary range of data in here between September 11th to 14th to 7:00 to 4:00 a.m. it doesn't really matter what this range is and in this range we've about 1.2 million edits so let's look at some other properties of this data maybe we want to look at this page dimension so this page attribute and we drag that and what we see is the top pages in Wikipedia being edited for this particular time range and up here we have Great Britain at the 2016 Summer Paralympics typhoon season I'm guessing this is probably a movie called backlash and so on and so forth but maybe we want to do even deeper exploration so maybe we want to know for these top pages who are the top users editing each of these top pages so we can do basically another operation there's two kind of operation bars up here there's a filter and a split filter is really just a where clause in sequel so it's narrowing the view field of view of your data and split is equivalent to group by so here we're doing a multi dimensional group by on page and user and the visualization we render is for each of the top pages being edited who are the top users doing those edits maybe we can want this filter on Great Britain at the 2016 Paralympics and let's just get rid of user here and maybe we want to transition to another visualization like time and here we see how that particular pages being edited and maybe there's a particular spike that we want to look at so we can zoom in on that spike and we can do further and deeper exploration there's also other other attributes you can view and there's different visualizations that get generated for example if you want to look at the length of comments of people doing edits on Wikipedia this generates a bar chart and this bar chart is fully explorable as well as in you can filter on comment lengths between whatever particular range you want and some other particular range so the purpose I'm trying to showcase with this demo the purpose of Druid that's powering this UI is the ability to arbitrarily break down and explore a data set we're doing a lot of complex computations we're slicing and dicing we're looking at different views of our data and all this is being done very very interactively so this UI is actually a completely open source as well it's something called pivot and the idea is there's different visualizations and as you slice and dice your data you can view your data through these different visualizations yeah so this use case this use case of doing doing aggregations of filtering and grouping your data is what druid was really designed for so the type of data that we're dealing with is event data so if you look at event data it tends to be pretty similar regardless of whatever vertical you're in there is usually a timestamp of when the event occurred there's a set of dimensions or attributes and these are the things that you want to filter on or you want to group on and then there's a set of measures or metrics and these are the things that you generally want to aggregate on or do some sort of computation over so the types of queries that druid is designed to solve and the types of queries that I just showcase in that demo the broadly categorized as like business intelligence or OLAP queries so examples of this queries these queries might include you know how much revenue did my product generate last quarter in some particular in some particular City or maybe running a web server you want to know how many users that visited my site last week returned sorry how many other sites visited last week also returned this week doing retention analysis and in all these like OLAP queries what you're not doing is you're not dumping the entire data set and you're not really looking at individual events because you might have billions of events and a single event is not that interesting what you do instead is you filter the data or you group it and you a great it and when you start doing these types of operations on event data you get a much smaller result set that is easily consumable by the user so in our demo here as I was slicing and dicing the data the amount of data that we're actually visualizing is not a lot but the total data set could be gigantic so in the types of queries that we do your ultimate result set is typically much smaller than your input set so Drude is not like a data processing system where your result set might be the same size as your ears your total data set a drew is very much like a query layer a serving layer something that's designed to condense the amount of data that you ultimately display or get insights from so as you no doubt know the whole data space is very very complex and complicated and there's many many different types of solutions out there even for this these OLAP queries for these business intelligence queries there are many different ways you can solve the same problem and there's many different types of open-source systems you can use to solve these problems so for example you might consider using a relational database such as my sequel or Postgres you might consider using a key value store such as HBase or Cassandra and now there's also general compute engines like Hadoop as' and SPARC and I just want to quickly go over some various ways that these different technologies approach this OLAP problem so first we look at the relational database this rational databases they've been around since forever everyone has a pretty good idea of how they work there's a pretty standard way that people typically use a relational database as a data warehouse and what people often do is they model their event data as a star schema and the idea behind a star schema is you have this fact table in the middle and this is this contains a lot of the primary information about your event data it also includes a lot of the measures and metrics those events that you want to those numbers that you eventually want to add up then you have these dimension tables to the side so you might have like a store ID in your primary fact table and your you have a store dimension table which has additional attributes about your store or you have a region and there's additional attributes about your region and what you do is a lot of your queries involve drawings between a dimension table and a central fact table and this is I think really becoming outdated nowadays I don't see many companies using this model anymore it was very very popular maybe about 10 years ago and the reason why this is starting to become less and less popular is because of the performance implications as data scales this modest way of modeling data with relational database this tends to be very very slow and before we actually started working on Druid we first tried using a relational database we were trying to use Postgres at that time to do some of these aggregations and slice and dice queries we abandoned it when the performance just wasn't good enough and actually the second set of solutions we looked at before we started ruied with key value stores and key value stores even to this day still remain very popular options for some of these types of queries so HBase Cassandra there's various time series databases like open TSD be an influx DB that are kind of based upon this model and what's nice about key value stores is you do get very fast write performance and you do get very fast look ups because you have a key value storage model you can get results back from your queries very very quickly however when you start using key value stores for some of these analytics we've been talking about for some of these OLAP queries then there's limitations so there's kind of two ways that people generally use a key value store one of the first ways that's very popular a few years ago is through pre-computation and the idea is try and precompute every single query you think your user can make stick it in your key value store as a gigantic cache and get results back really really quickly so in this example here you might have you know this is your raw data set it has a timestamp it has gender and age as your as your dimensions and it has this revenue which is the number you want to add up thing you ultimately care about and when you pre compute results how results end up getting stored is your primary key might be this timestamp lifts timestamp one and your revenue here is the sum of all three rows here and the idea is if you query for when timestamp equals one then you get your result back immediately and if you want to make a query that's for value timestamp 1 and gender equals female then your primary key is you know timestamp one gender female and the revenue is the sum of these two rows here so if you want to support any ad-hoc query like every arbitrary combination of queries that you can make you basically do this this idea for every permutation of filters every permutation of queries out there and this works if your data set is very very small and your know that the total pre computed query set is reasonable where this starts falling apart is as you start getting to more and more complex raw data sets as you start adding more and more columns into your data your total query space your total query space grows exponentially in size and it's very easy that you start hitting scaling problems and storage problems with this model but what is nice about this model is once you pre compute all the results you're basically just hitting a cache doing a no one lookup to get values back and it's very very fast when you do that we try doing this actually initially before we started using druid and as our raw data set just got more more complex it became intractable to try and store every single pre compute value so another way that key value stores is used to do some of these aggregations some of these OLAP queries is basically to do range scans and the idea here is your primary key instead of being a pre computation just becomes a hash of your timestamp and your dimensions and your value here is the number you want to aggregate and this model works I think I see a lot of people still do this with Cassandra and HBase and other key value stores but it also has performance limitations the idea is is when you when you do range scans you basically have a separation between your computation and your storage and remember you issue a query you have to shuffle all your data around from where it's stored into some intermediate computation buffer where the numbers are crunched and the values then return to the user and the shuffling of data this scan can be very very slow the reason is because it's often difficult to create intelligent indexes for this primary key in a key value store okay so let's look at another solution which is a general compute engine and that's using a system like spark or Hadoop and combine that with a sequel on Hadoop solution sequel to do there's many different solutions out there there's Impala there's hives or spark sequel Apache drill press though many of these what's really nice about the sequel hadoop solutions is they're very flexible they typically support full sequel and their limitation once again is performance how a lot of sequel Hadoop engines for example how prestel works is you have some intermediate compute engine again and this might be presto nodes and you pull data from wherever it's stored into this intermediate compute buffer you crunch your numbers and then you return your results and this like separation once again of your compute and your storage can have performance overhead and I keep talking about performance and the reason why performance is very important in my mind is as you start doing like interactive analysis of data a few seconds like there's a big difference between a few hundred milliseconds and like a few seconds or even a few minutes because when you acquire a start taking like 30 seconds or a minute your application no longer feels interactive and you're it's very easy to lose a user's attention this is especially important as you're like just drilling in and exploring data because that's a very iterative process where you might be asking some questions getting answers and using those answers to ask more questions so you're gradually narrowing down until you reach your final results and if your quarries take a few minutes to return you've lost a user by that point it's hard to keep your train of thought when everything takes minutes to return so the last class of systems I want to talk about our column stores I think really over the last year or two column stores have been accepted as the proper way of doing aggregations and doing some of these OLAP analytics on on data and the reason for that is column stores they have the way that they store data is as opposed to a row store which is what most relational databases are you store data individually as columns and whatever columns actually pertain to your query actually then get used as part of that query so if you issue you know a query filtered on product and price and you want to calculate price for a certain date you don't need to use the store column you don't use this customer column and it's just a lot more efficient for real world queries and real world data another nice thing about column stores is when you store these columns individually you can use different types of compression algorithms for different types of columns different types of encoding for different types of columns and create different types of indexes for different types of columns and these are all things that Druid does so getting into druid druid was really created out of necessity back in the day because we didn't I mean even today there's very few I think good open source column stores out there for us when we were trying all these different systems we struggled a lot with these different systems and decide to build an open-source columnstore so druid I guess its key properties it has a custom column format it's very optimized for event data and these OLAP bi queries it's very good for supporting a lot of concurrent reads it support streaming low latency data ingestion it supports extremely fast filters and when you kind of take these properties together I think you have a system that's very good for powering user facing analytic applications okay so let's let's talk about what you would how it actually works so let's say we have some raw data here and this is raw data may be found in online advertising so there's a time when someone looked at an ad this is the domain where the ad was hosted some properties about the person looking at the ad and then whether they clicked on it or not what Druid does with raw data as it ingests it is a process known as summarization or pre aggregation the idea here is if we can lose some information about time we can gain a lot in terms of storage space so in this example we see that there's events that are occurring at some precise millisecond in time but if we have like billions or trillions of these events we don't really care about a single event that occurred at a precise millisecond in time so what if we just lost some information about time in this example if we just truncate time to an hourly granularity as opposed to a millisecond granularity then for all the rows where the set of attributes is exactly the same then we can aggregate their their metrics together so I don't know how many rows of raw data here but there's a lot of raw data and it's colored color-coded to indicate if we just truncated our events to an alloy granularity this is what the rolled up data would look like and in practice when you start doing this with real world data you can easily reduce the data you have to store by 40 or a hundred times from your raw data set and this is significant savings in terms of storage and you haven't lost any fidelity about the data you like these numbers here are exactly accurate but what you've lost is the ability to explore events at a precise millisecond in time and this is a trade-off that you can optionally do in Druid so the way that druid partitions data the way that shards data is actually on a time basis and time is a very special dimension to do it here we have rolled up events or summarized events across three separate hours and druid has created three shards and each of these shards is called a segment and druid and each segment in druid is designed to encapsulate data for a particular time interval in this case each segment is holding about an hour's worth of data in our example we only have like five events but in the real world each of these segments might hold millions of data points so these segments these time partition segments they are immutable and the reason why they are immutable is immutability confers really nice advantages for one we don't have to think about any contentions between reads and writes so these these segments they're very much designed to be read optimized data structures and the paralyzation model in druid is that you use one thread one JVM thread to scan one segment at a time so if you want to scan more data in parallel if you want to scan more segments in parallel you basically just add more CPUs to your cluster so it's a nice way of being able linearly scale to cluster so internally what these segments look like is data is stored in a column orientation so example events here are just edit on Wikipedia we might have a page being edited and then the language of the page and other attributes so one of the things that druid does with string dimensions these string columns is it uses a method called dictionary encoding and the idea here is for every string value as opposed to storing the string value directly we store some integer representation of the string so Justin Bieber here we can map that to an ID of zero Kesha here we map that to an ID of one and then when we actually store the page to mention internally in Druid we store this array of zeros and ones so as opposed to storing variable length strings what we end up storing is fixed length integers and fixed length integers are much easier to compress similarly language here we just map that to an ID of zero we store an array of zeros the way that drew it gets very fast filters is it actually steals some ideas from search technology so if you look at our our string dimension page again we have two string values Justin Bieber and Kesha and that's imagine we have six rows of data we can create a very fast lookup which tells us which rows Justin Bieber occurs in which rows Kesha occurs so Justin Bieber here occurs in row 0 1 & 2 Kesha occurs in rows 3 4 & 5 and we can create this binary lookup that basically one indicates the value exists and 0 indicates it doesn't so justin bieber here has one mark for the first 3 rows and 0 mark the last 3 rows and what's really nice about this binary index is when you do a filter on page equals justin bieber Drude can do a lookup and immediately know which rows pertain to justin bieber so if we have to aggregate any of these metrics together dude would know only to aggregate these first three values and skip everything else oops and it's really easy to also do boolean operations with these binary arrays so if you ever want to issue a query that's Justin Bieber or Kesha then you just simply do a boolean or of these binary arrays and these bitmap indexes as they're called or inverted indexes if you're familiar with the search world these arrays of zeros and ones are very easy to compress and they're not very expensive to store as well cool druid also supports a plug-in architecture and most people use the plug-in architecture to create their own types of computations or calculations so there's various approximate algorithms like hyper log log or approximate histograms or you can even do a proximate funnel analysis in Druid and I think approximate algorithms are very very powerful for very fast queries and I think they're underutilized in the big data space so I just want to talk a little bit about how approximate algorithms work in Druid and why they're awesome so you know with druid ingest data it does this process called summarization which can really reduce the volume there you have the store but summarization requires on having the same dimension values across different events if you have a very high cardinality dimension with a lot of unique values such as a user ID it basically prevents roll-up or summarization from happening because all these values are unique you can't roll up your data anymore the idea behind an approximate algorithm like how hyper loglog or many of these like count distinct approximate algorithms work is instead of storing every single unique user ID you store information about how many unique users you've seen over some span of time in this probabilistic data structure it's usually called a sketch data structure but the idea of how it works in Druid is you can instead of storing every single unique ID you store users as this metric instead and what it enables is enables you to roll up the data once again and this is really nice in that roll up confers all the all the storage advantages I talked about before and when you store less data your quarries often go much faster as well so for 94 a 98% accurate result what you're getting is potentially you know 100x less storage space and 100x better performance so you know a lot of people still say accuracy is very very important in the big data world but I always feel that 98% or 99% accuracy is probably good enough for all the advantages that you get from it okay going a little bit more into the into the architecture of druid and how everything works so druid has two ways of loading data there's a batch ingestion method and there's a streaming ingestion method the way that the batch ingestion method works is you have static data in HDFS or some other file system and you can run a MapReduce job with Hadoop or SPARC and generate these druid segments and these student segments then get loaded across a set of processes or nodes called historical nodes these guys are very simple they know how to load segments and respond to queries of the segments and that's all they do in front of historical x' you have a set of broker nodes and these guys do query scatter gather functionality so queries get sent to a broker they know which segments live on which historical x' and they forward those queries down and these guys crunch answers in parallel and these guys merge results before they return to the to the caller Droon also supports real time ingestion or streaming ingestion and the high level behind how streaming ingestion works in Druid is you have two two different data structures you have a right optimized data structure and a read optimized data structure so if you recall what I said before segments are immutable but like druid has to support writes because you have to get data in somehow right the way that the right optimized data structure works is it's basically just a giant a memory buffer and events are written to memory as fast as possible and because this guy is not very read optimized you periodically convert the data in this in memory buffer into the druid segment format so if this this forever going process of events being written to memory and convert it to the druid segment format and while all this is happening to it can answer queries on both these pieces and when you put it all together how driven works as a system is you stream data into these real-time nodes or these real-time workers who then periodically create segments and handoff to these historical nodes the idea is historicals are read-only brokers are read-only and real-time nodes are read/write and because rewrite is a much more complex operation these guys only ever hold a small window of data so they might only ever hold about an hour's worth of data and these guys can hold in years worth of data and you know the broker combines both the real-time view and the historical view and you have a unified view of your system you put it all together when you do both batch processing and stream processing you have your classic land architecture so getting data out of Druid the native query library is JSON or HTTP but there's also community contributed libraries for sequel are Python Ruby pearl pretty much every language you can imagine there's multiple open-source you is there's pivot which I just demoed there's for fauna if we're from the operations world and then Airbnb actually created a dashboard called Caravelle that you can use as well okay so last last two sections here one is just on some production numbers this is from a very large cluster and some of these numbers are public so I just pulled these numbers from their largest cluster I'm aware of is doing about 3 million events per second sustained or about 200 billion events per day and this comes out to roughly about 10 to 100,000 events per second per core largest known cluster is serving I think close to 100 petabytes of raw data at this point probably closer to 100 trillion raw events and when you take these raw events and you summarize them and you can press them it turns out to be about 500 terabytes or so of segments and at this like massive scale I think drew it is a very cost effective solution in terms of query latency in this cluster it's about 500 millisecond average cornelian see 90% of the queries less in a second 95 months in two seconds ninety nine less than 10 seconds and the cluster gets a few hundred queries per second the slide is actually a little bit out of date because I'm now aware of a couple of different production clusters that are at now a few thousand queries per second okay so very last section here is how to it kind of fits in this whole wacky data space as an open source solution this is an architecture I like a lot this is an architecture which I think is flexible to handle a lot of different types of use cases with event data and in this architecture you have events that get fed into a message bus nowadays I think Apache Kafka is by far the most popular open source message bus here and then it's fed to data processing systems so systems to clean or transform the data and here in the stream processing world there's many different solutions or storm spark streaming flink Kafka streams and so on and so forth and the batch processing world or SPARC and there's Hadoop and I think that's those are the two primary ones and after your data is is transformed or cleaned then you can go into a system like druid which basically supports lambda architectures out of the box and over time I think this batch processing piece will probably go away and you can have this really nice end-to-end streaming analytics stack but yeah I mean druid is very much designed to be complimentary to all the popular open source Big Data solutions out there it works out of the box of Kafka you can actually support exactly ones consumption from Kafka works with RabbitMQ SPARC a dupe fling spark streaming storm so on and so forth and we're spending a lot of time right now making druid work really well with the sequel Hadoop solutions so we imagine a world where people can query drew it from hive or Impala or spark Seco or drill or presto really anything else they have in-house cool so with that I'd like to conclude my talk with hopefully some of the takeaways that you got from here I think druid is pretty good for analytic applications I think it's pretty good for OLAP queries so if you want to filter slice and dice and group data I think it's a pretty good choice I think it's pretty good at streaming ingestion so if you have a stream of the event data you want to load into a system it has really nice properties around streams and it also works very well a lot of existing data infrastructure systems so whatever we have in-house true it should work with it and if you ever want to try to it out we have a link up there that's our QuickStart and you can play with the UI I demoed and play with the system as well cool thank you everyone we have a few minutes left I guess I can take a few questions yes your head went out first uh so I heard about half of that and I heard what is the data storage engine for druid and I think something hash table was it okay a b-tree or hash table okay so the data storage engine of Judah is completely custom it's a custom column format so there's really no b-tree in there it's every data stored as as individual columns and the indexes that get created for those columns are inverted indexes or those bitmap indexes I showed earlier so drew I guess is a little bit different than a traditional column store and that has a lot of ideas from search kind of baked in as well yeah but yes yeah right so the question was around how user IDs get rolled up and whether it was not a whether or not it was being done for that Wikipedia example so for this example here I think there is a there's a column called unique users and let me just pick a larger time range here so this is this unique users metric is being done approximately and the idea yeah this is rolled up data so this all the data I'm being demonstrated here has roll-up kind of baked in and all the data here I believe is being rolled up to a minute Lea basis so with oh so this dimension the number of unique users here this is this is no rollup this is actually storing the user ID dimension we've actually approximated it just for faster calculations as a metric as well yeah so it's actually being stored twice in this example I think this was first right right so the question was around basically integrating hive of droid and how like different query operations would function across the two systems is that a good summarization right so there is actually quite a bit of work that's being done out there right now around hive crude integration and the idea of that work is Druid primarily gets used as an OLAP index for hive so for a lot of the aggregation queries a lot of the filtering slicing dicing OLAP queries they get pushed down into druid by hive but for queries that Drude can't support right now some of the the row level queries that hive supports then the idea is it would use to it select query and just pull data from druid into hive where the calculation is done right so the question is around this the events key might need to be fixed beforehand and how does do a deal with schema evolution so the short answer is it does not need to be fixed there's a lot of flexibility into it to be able to explore it's not fully like unstructured data all you have to provide druid is a set of metrics so the things you want to aggregate and also what the timestamp column is and then tree can can figure out what all the other dimensions are as part of as part of an event the reason why metrics need to be provided is so that Dru can understand how to roll up data if it has to roll up data the idea with evolving schemas is that every segment or every partition in Druid is very self-contained so every partition has its own schema so even for the same table partitions don't need to have the same schema and how drewett's operations across two separate partitions or different schemas is basically let's say one partition has a column the other partition doesn't and you issue a query for that column across both those partitions and Drew just treats all the values in the second partition is null so very common use case is to have an evolving schema with time yes yes so all the dictionaries that get built up sorry the question was this the same idea apply for dictionaries within a segment are within it within a table so dictionaries are built on a per segment basis so every segment has its own local dictionary so one dictionary from one segment to another like are there completely distinct in true it has logic to merge the dictionaries together basically any other question yes I know it's it's a it's real it's actually very very simple for dictionary that's in Druid it's basically every unique string value it finds it maps it to a dictionary ID there's we haven't tried that yet right now there's no like kind of global dictionary intro'd it's all done on a per part each dictionary is on a per partition basis okay right so the question was around joins and druid right now druid supports like star schema joins so if you have a large table in a small table druid is able to do that join pretty efficiently if you're doing a join of a billion row table with another billion role table druid currently doesn't support that and that's part of the work we're doing with integrating through sequel and Hadoop engines because we think that over time we're just going to have like presto or spark sequel top two drew it and it'll be sparks equal Russell that does that's the joint the star schema join the large table small table join is easy to do the large table large table joint is a lot more complex a lot more resource intensive and that's part of the reason why we haven't done it yet but within the next quarter so we anticipate will probably get done oh sorry I'm getting kicked off in across stage but I'm happy to take more questions this second thank you
Info
Channel: Strange Loop Conference
Views: 31,036
Rating: undefined out of 5
Keywords: Druid, analytics
Id: vbH8E0nH2Nw
Channel Id: undefined
Length: 45min 1sec (2701 seconds)
Published: Sat Sep 17 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.