Data modeling tradeoffs with Druid by Maxime Beauchemin

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
right thank you cool so I got a really good intro and Thank You Gian cuz like I was wondering how much people in this room were gonna be like familiar with Druid coming into this talk cuz this talk is about data modeling for druid so it really speaks to data engineers or people that will that either are loading data into droid are thinking about loading data into druid and it's you know there's there's a little bit of a need for a foundation of understanding what is great about and why would you even load data into it in the first place so I feel like Jian did an awesome job at kind of setting the stage for this talk so yeah so did a modeling for druid really targeted about like if you are going to if your job is to load that into druid or if your job is becoming or part of your job is becoming too to load someone did it some data Android these are the things that you should think about and consider or or a subset at least of the things that you should consider as you as you do that but before I get started I'm gonna give you just a little bit of context on this talk so the first bits about me but I've already been introduced so there's less to say I used to working at lyft now so I work in between these walls I used to work at Airbnb Facebook Yahoo and a place called Ubisoft which is happens to be just a block away from here so I got the San Francisco in 2004 and this neighborhood looked very different at the time and yeah so I started super set and and maybe an interesting fact about super set is that super set started as a hackathon project to build a front end an open-source front end for Druid specifically so the name of the product at the time was panoramix panoramix is is a cartoon character who is a druid so the name was after druid and really the intent was you know I looked at druid I was like wow this is a lot like I was coming out of Facebook at a time I was like wow this database is a lot like the scuba back-end we need a front-end for this thing and we have scuba again so I had a three day hackathon and the first come in in the repo is it's kind of that early version of of a druid only supersets called panoramics and then so at the time I was working on air flow air flow is a is a work flow it's a workflow engine it's it's a way for people to author and monitor data pipelines so I've been working on that over time and more recently I've been more focused on superset and now I'm getting back into loading a lot of data into druid which brings me to this talk over time I became the de facto or one of the de facto maintainer z' for a small library called pi druid and managed to get apache committer hood on druid as a byproduct of that I feel like I'm probably the less legit Apache committer on druid itself but PI druid is this like small library that is a client or an interface to speak into druid so that you don't have to write Jason so if you're in the Python world you can use PI druid you can use the the db-api driver the sequel alchemy dialect and all sorts of they're all like client type object objects to interact with druid so I want to talk about the motivation for this talk so the problem that I'm solving talking about this is there's not a lot of information so if you're trying to use druid and load data into it there's a lot of scattered information in the druid Docs the droid documentation is great it's very solid but it's scattered and it talks about all sorts of things and I wanted to have a place where where we have all the resources for a data engineer or someone or an engineer or someone who's interested in loading data into droid here at lyft so we're investing into druid we think it's awesome we think you know it supports or are real times and our geospatial use cases so we're super excited about it and we want to ramp up a lot of data engineers an engineer at lyft to understand how to load it into druid effectively so I thought what about I write you know a document it like some documentation on this and give a talk and instead of just scoping it to to the the size of the lift I'm like why not making this open-source so this is all open-source products so might as well you know do an open blog post and Oakland open talk on this so I was going to talk about druid segments but I think Jiyeon did already so I think if you're gonna use you know if you're gonna load data do some data modeling for a database it's important to understand how this data is stored and how this execution engine works and I think Jen did really good on this so so druid segments I think are you know fairly compressed and indexed and I'm not going to get into the details here because we kind of covered that already but like one thing that's important is to note the fact that they're immutable and that you need to know about that if you're thinking about can I use Druitt for this data set you have to know that it is an append-only store where you load data once and you ideally don't touch it we'll talk a little bit later as to like talking about lambda architecture and how you would go about Corrections and about you know identifying some of the cases where you might need to reload data but you definitely don't want to think of druid as a database that's highly mutable like traditional OLTP kind of databases cool so here I'm getting into a section about tips and tricks so this stock I think this stock as information that's really targeted about what you need to know if you're gonna load data and druid but there's a lot of things to in this talk that that are about how to mitigate like space usage and how to summarize your data as you bring it into a database like druid but it's not limited to necessarily druid per se so the first thing I want to talk about is segment sizes so the documentation on the druid website says like ideally a segment size should be somewhere in between 300 and 700 Meg's I wanted to say that in practice there are cases where you okay over they need a climate cleanup in Aisle one I think all right but but yeah so so the doc says that it should be you know a certain size to be optimal in practice and in reality especially when loading batch data into drit for convenience it might make sense to just operate say on a daily basis or on an hourly basis and it might it might happen that you have smaller segments and that just means that you would have more overhead on the Drake lustre proportionally right so that means the Drake cluster is working a little bit harder shuffling segments around but the it's more overhead proportionally but if you have the same number of segments on there very large data set or you have a smaller that I said so you'll have like a fixed amount of overhead and the reality is that it might not be super significant if you need to like you know so if you're writing grid like if it's a Ferrari and you're actually like racing this thing you might want to consider like really make sure that you heat you hit the sweet spot there but you have to balance that with I'm just kind of getting work done and getting your data into so it's definitely possible to tweak size the obvious thing is to alter your segment granularity and that might involve also so say if you're like oh you know these segments are really small so I'm going to use a weekly granularity instead of a daily granularity you might want to reload the whole that last segment over and over so that you have like data that's not stale but but still fairly like optimal segment sizes and then you know it's always possible to segment beyond time in Druid you might want to consider that so I'm not going to get into the details of how partitioning further then time Indrid works but the documentation is fairly good and the idea is generally to identify which dimension you might might want to partition on and which how you want to configure it so there's like a handful of options so this light is about blobs so blood we like in this day and age we have a lot of blobs right so we have kind of a rise of the no sequel databases we have all these protobufs and and thrift objects that somehow get serialized into database cells so your source data might have some columns that look like this and you probably like the guideline in general with druid is you probably should not load these blobs directly into druid so druid is not elasticsearch it doesn't do full-text indexing it doesn't do full-text search and if you have one of these like if in your source data you have something that looks like this what you might want to do is call them nice some of it right so you'd look at your your blob think about what's in there and see what you want to do whether you want to explode it into multiple rows whether you want to go in can a cherry pick the columns or the aspects that are most relevant to your data sets there's also a new feature in druid or fairly new anyway so the druid segments used to store systematically in a bitmap index for every single column dimension column now you can switch that off so if you're gonna load up some text data at least you can kind of switch off bitmap indexes and another recommendation I would do is if you're considering loading a blob inside druid or a large chunk of text one thing you might want to do if you're concerned about performance is to load a segment with and without the blob to at least get a real sense of in practice how much does that blob cost like how much is it bloating your segments and in general like injure it in some cases when you're thinking about aggregation summarization or the cost of adding or removing a column sometimes the easiest way to go about it is to load the data with and without and compare the segment sizes there's this thing called smush files which I won't get into but but you know there's other ways to get more metadata about what's inside your segments but otherwise like the easy approach is to just kind of try it with and without and compare the results so here I want to talk a little bit about of this batch ingestion framework that that we wrote here at live to make it easy to load data into druid so the thing is if you're gonna like so if I wish to give people our laptop here and say like okay you have a CSV file you know or you have a file in Hadoop and you want to load it into druid that go and try to do that you would waste like a day or two trying to do this and probably hate your life you know it's very painful to just like use the basic API is to do this so I think a lot of companies that are serious about investing into druid will have some sort of abstraction on top to mate to commoditize I just make it easy for anyone in the company that wants to load a data set to do so so here we wrote something that's called druid ingest that I'm gonna describe very quickly what it does and we that's something we may open source later I think this looks pretty awful we don't really see too well but what it is it's just a Python static object that holds configuration so it's really configuration s code where we assume that you're going to load data from in our case hive into Jerrod for us hive is really just s3 backs are just park' files in s3 essentially and plus the hive meta store that informs this as to what you know where are all of these partitions located in s3 and what does the schema look like so so what this framework is or what druid and jess is is instead of writing these large JSON files that the druid kind of API asked for we we kind of define just configuration s code as nice defaults it does some validation and it has some some safeguards it's all centralized in one place so instead of having your JSON file your HQ L files and all that stuff kind of living around a repo somewhere it's kind of all in one place driven just also offers a small CLI that allows you to to just kind of execute your specifications and and actually load your data you know quickly and easily it also as a bunch of utility methods that be used outside of the context of Jared and Jess to do to basically orchestrate your stuff so so in our case on top of druid and Jess we have an airflow tag which receives or read all of the druid ingestion specifications and out of that like we've a dag of execution so so airflow is just this Orchestrator right that will take your druid ingestion specification and generate a set of jobs to run every day every hour or Alvar you configure your stuff and out of airflow we get a lot of things that you also need so if you're gonna go and load that into druid every day you will need things like failure alerts retries SLA monitoring other interactions of running jobs that you need before you want to look you might want to prepare your data before you load it into a druid so so airflow offers all these things including things like concurrency limits so we don't want to like hammer too much the druid coordinator for instance so this is all stuff we get for free out of airflow so join so I said I was going to talk about joins a little bit so there used to be a time where we're joins were not supported in druid and that was kind of nice in a way because when you think about like data modeling and I don't know how many people like know much about data modeling in this place but I spent 10 years as a data warehouse architect in a previous life and data modeling is all about thinking how to organize your tables and how to kind of prepare your data so it joins nicely and you have to think a lot about normalization the normalization third normal form things like star schemas snowflake in your dimensions so that stuff is an art that's extremely complex when you're given a database that does not support joins you're like wow this is easy I just need to cram she's needy normalize everything and put it in this like dirty flat table and and and that's just like easy as a data modeler and easy for the people running queries to write the people writing queries are like which data source should I use okay which columns do I need there's no like which colic how should I join this to what and under which circumstances so so it used to be that case but denormalizing as also some issues and I try to find some examples of like where where so this is this is a case like let's say you're interested in doing analytics on like like say which artists are popular who's playing you know which artists under what platform so here we can see that the at the attribute of the artist dimension like Snoop Dogg essentially is changing name all the time and if you're if you're denormalizing the name of Snoop Dogg into your fact table and you're trying to group by artist names what you're gonna get is a bunch of results that looks like basically snoop won't top the chart so it won't be the most popular artists but maybe if you were to aggregate it the right way he would be the the most you know popular artist so this issue with denormalization is real it's a real problem with with databases like druid or flat tables in general is that the attribute of the dimension if it changes over time it will stay the same in the fact if you wanted to kind of clean that up what you would have to do is rewrite all of your segments and the whole point of an immutable store is to not to rewrite the segments ever so to solve this problem and this problem is not the most prevalent one specifically with Snoop Dogg but it's very common that I think it applies a way to describe it as it applies to all the cases where you're interested in dimensional attributes and you're interested in the latest value as opposed to the value of that attribute at the time of the event that used to be called or people call this slowly changing dimension type one where you well all you care about is the latest value of the dimensional attribute so it does support our query time joins now there's a whole new API or I say new but I I look at the druid Docs I was happy to find that it evolved and now we could do all these things so there's a way to kind of publish these look-up tables to the drill cluster and as a data modeler that's a tool you can use if you have use cases where where it's important to get the latest dimensional attribute you have to use this it's kind of your only way that or reprocessing all of your segments every day so there's two types of lookups one is called injective and the other one none injective the injective one requires a one-to-one unique mapping so that means you absolutely need to have unique value on one side and and one the values need to be unique on each side because they are executed on the broker so you can think of this as the reduce phase of a Madrid query so the broker is then is able to to perform that lookup and it's way cheaper because instead of broadcasting the join to every node that's part X and the query only the broker has to do this so the non injective lookups are the ones that are a little bit more expensive they are broadcast to all of the nodes in the execution plan and they can be one to many alright so lambda and compaction so it's this idea that and I believe I saw something around compaction on what's coming up next for a druid so there's people working on compaction but let me describe the problem a little bit so in the driba i retire and this slide is unreadable from where you're at but there's the real-time nodes you can imagine and they're constantly receiving a stream of data and in what I would call like a write optimize store right so these nodes get a stream and data they answer a query at the same time and every once in a while they need to flush the data so they'll come they'll take a segment they'll publish a segment to the deep storage and tell the coordinator about it and that's kind of an atomic operation that works well the problem with that is that the process so the real-time nodes don't do a super good job by creating segments because there's many of them and the segments that they write will typically be all smaller than they should be and more fragmented than it should be so real time leads to suboptimal segments there's also some like kind of core issues with like the fact that we have this time index injuried so late arriving facts can be troublesome right at some point you need to you need to write that segment if you're right for if you wait too long then you have like too much data on the real time side on the real time node side of things so in some cases you might have to drop some late arriving facts and hopefully you can fix those later with the lambda architecture and then yes some columns might be a very viable at the time in real time or the join might be too expensive or too complex to bother with it so an example of that might be something like it like I lived for instance the invoicing information for a ride comes much later after that the ride is closed right the invoice might not happen right away and later on we get the invoice information which we would love to have in that data set but we can't afford to wait for the invoice to close so those are all things that you can fix with the lambda architecture I think I kind of forget to mention what the lambda architecture is I kind of just you know assuming that everyone knew about it but it's this idea that you stream data on one side and then you correct data in the background using batch right so these historical correction can tackle these like more complex business rules they can they can fix some of these issues right and they can defrag your segments do so and I think I don't think there's any automatic mechanisms that do that so I'm not sure where Jeon is at this point but yeah I'm not sure if there's like any like if are there any mechanisms to compact your segments okay so this is changing fast so okay cool so yeah so GN is saying that this is rapidly evolving and that the problem of suboptimal small segments is getting solved already kind of using the Kafka indexer though though there are still these other like reasons why you might want to do lambda anyways right like the problem we were talking about about like the invoice information or some you know some like very special rides that might get cancelled posts you know like you know twelve hours later or these later arriving facts you might want to to correct something so it's good to know that you can do this land of stuff and that can be part of your pipeline alright so now I'm getting into summarization I think I'm gonna try to accelerate there's a lot to cover summarization why because the holding large amount of data in memory is very expensive right and at some point it defies that the limits of physics right if you have a gigantic data said there's no way that you can just store it all in memory so at some point we have to think about how are we going to mitigate the problem of space and the two main ways to deal with that are aggregation and sampling talking about aggregation so so it's all about like containing that cartesian explosion so a cartesian product is when you take two dimensions and you you kind of make a matrix with them right so you look at all your products times all your customers you have a hundred customers a hundred product you know you got ten thousand you just like you end up with just too many rows at that point in time right so one thing that's interesting about that is that it's very intricate and it explodes very quickly if you have like twenty action types with you know 200 car models a certain number of hours in a year and three thousand counties in the US you end up with hundreds of billions of rows the thing that the story is a little bit more complicated than that because data is not always dense right not in that matrix that matrix is not full like some cars don't exist in certain counties are not active for a period of an hour it's really hard to predict how this happens what what I know for a fact is that as you add more dimensions and I'm to have a slide on this but like this there's this asymptotical ascenta asymptotic curve that basically as you add more dimensions very quickly you get you lose out on your aggregate in ratio so if you have just one dimension that aggregates very nicely you get two dimensions you're a lot less aggregated and as you you'll just get closer and closer to your original number of rows so if your original data set at a billion rows it won't take that many dimensions with that much cardinality to get close to a billion rows and your aggregate set so here in this set of slider this section of the presentation I'll be talking about ways to mitigate that kind of aggregation explosion or Cartesian explosion and that applies way beyond I think like the scope a druid right so that's the part that's relevant to you from a data engineering perspective regardless if you will load data Android so sorry I wanted to say that like druid is a little bit confusing at first because there's two level of aggregations one is on and jessa druid well is one the only databases that assumes that you want to roll up on ingests and that's that can be a little bit tricky because some like when you run queries on aggregated data you might you definitely don't want to do a min of a max or the sum of an average right so you got to be a little bit careful with that one thing to note is that you can now switch off roll-up as you load that into druid so if for a reason or another you want to have you know you have atomic events that won't aggregate you can say do not a get on ingest and then I wanted to say that got in our aggregation framework we label the aggregate functions that we use in the column name so if we do a sum on ingest then we will like to label it into a column like some riots men rights to make it clear all that data was loaded chopping up the long tail I'm gonna go quickly on this but it's if you have like 250 countries you really care mostly like a small portion of these countries contribute to a large portion of your data you can I just keep the detail on those chopped-up the long tail you can do that dynamically or in a fixed kind of way just got to be careful about the flickering that may happen if you do dynamic long tail shopping where some countries that are close to say that top 50 countries might go in and out of that long tail and I can that can confuse users so bracketing so that's pretty easy the thing I wanted to say about bracketing is in some cases depending on the density of your data set going from storing age to storing brackets could solve as much as like a factor of say five or ten on your data which is kind of you know a big big savings I think gee I mentioned sketches a little bit so I want and I'm running out of time so I won't get very deeply into sketches but sketches are extremely cool there's this Yahoo library that Druid uses that it's called data sketches sketches are these like lossy data structures that help with simplifying like basically like compressing like get saving on storage saving on computation around questions like distant count distribution starting sample frequent items but they do have some complexity implications that I'm gonna skip for now if people are interested to talk about sketches maybe I'll be around to after this talk grouping sets I just saw that they're adding grouping sets to the sequel statement but one thing you can do and and that's not something that we typically recommend for a database like druid like we want to have all our dimension and allow people to slice and dice left and right but in some cases you might have to prepare different level of aggregations upfront so that's where you would say I'm going to load this data group by these three dimensions by these four dimensions and and provide these different cubes for people to query alright so getting into sampling so sampling is a nice thing because you can you know see if you decide to store only say 1% of your data then you don't have to remove any of the operational data and you don't have any Cartesian Explorer right you don't have the problem like if I remove this dimension I'm gonna or if I add a new dimension also then I can't afford this dataset anymore because it's gonna go 10x so sampling allows you to not worry about everything I talked about like just just now about like Cartesian explosion so if you do sample and drew it it's really nice nice to add a column to your data set called a simple rate or simple multiplier this allows you to change the sample rate over time and it also allows you to very generously sample your data an example of that might be you know if we dogfood our product here at lyft and we have employee data we might not want a that's very little data and we don't want to simple it because we want to look at what employees are doing and not a subset of the employees right so you can have some rows that are sample more heavily in the table than others and when you the process of up sampling is pretty simple you create metrics or post aggregators that will simply multiply the simple or the simple multiplier times the indicator you're interested in looking at and eventually you could have tools like super say it like know like find these columns and kind of do it on your behalf using normalized metrics I'm not sure if I'm using that term right but a normalized metric is a metric that regardless if you're sampling 1% or 10% or 20% its metrics that will kind of relate to the same scale right so things like the the p99 time load duration of a page crashes per user percentage of canceled rides all right I need to accelerate I think that was my timer but yeah essentially the idea of normalized metric is when you're sampling you look at these normal metrics that are bounded and allows you to you know make sense of your data regardless of whether it's sampled or not hash based sampling I'm gonna skip but typically when you sample you can just sample say 1 percent of the rows or you can do something like I'm gonna take one percent of the user so you hash the user ID and take one percent of your you users allows you to have you know to be able to do things that count distinct users which you wouldn't be able to do if you sampled randomly the downside is you can only pick one column one subject to do that per data set all right so I think I'm getting to my conclusion a few minutes late here so talking about stuff that's on the horizon so I've been working on this blog post I was hoping to to push out before this talk but it will come after this talk so it's going to be titled data modeling for druid and it's gonna talk about everything that was in this talk and and much more probably one thing I did not talk about is smush files which are these they can internal files to druid that have a lot of information about segment metadata that seems like a treasure chest of things that that we'd love to make more viable so hopefully that becomes more viable or we can build some tooling to introspect these files and gonna open up that treasure for people to to really understand what their segments are made out of and you know the drill ducts are super great so all that all this information comes mostly from the druid ducts from the google group so the community is super active there's tons of information so there's a lot there so thank you everybody that's all I got for today that's it [Applause] I'll get to questions anybody loading data into dread these days no there's one over there yeah so as you aggregate data so there's different option in terms of like loading they'd introdu it you could do whatever aggregation you might want to do outside of druid first right so for us really often we'll roll up stuff in hive or we might apply some trips um you know aggregate functions there I believe the array of aggregators is very well documented on the druid documentation but there are the sketches like the hyper unique stuff for the hyper log log u know aggregator so more complex I agree there's there and they have pretty much all the aggregators you would expect to find expect except for average because because that two-level aggregation thing you kind of never want to do an average until the very last moment and if your query right you want to get the sum of the numerator denominator at the very last moment created that ratio but yeah check out the dread docs on aggregators they have the full list of everything that's supported and now with sequel - you have Kanna everything you would expect to find alright next all right give it up for max Bush mo [Applause]
Info
Channel: SF Big Analytics
Views: 1,306
Rating: 5 out of 5
Keywords: data modeling, druid, apache druid, data engineering, operational data stores
Id: xtYI67fy0wc
Channel Id: undefined
Length: 34min 24sec (2064 seconds)
Published: Thu Aug 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.