How Superset and Druid Power Real-Time Analytics at Airbnb | DataEngConf SF '17

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] [Music] all right so I'm going to get started pretty quickly because I know we're running light a little bit late this morning so today I'm going to talk about how superset and druid power real-time analytics at Airbnb but before I get started on that I'm just going to introduce myself briefly so my name is bush my I hope that sometimes the microphone goes goes out when when Sid was fishing to when he would like turn his head to the AESA I'll try to be a little bit more robotic and turn sway all right so so yeah my name is Maxim Bushman I go by max to make it easy on everyone I work at at Airbnb I started this other projects with which you might be familiar with which is called Apache airflow fun fact about it the sick Oh like reach up the project and then we spoke together and you kind of push me to to go and join the Apache Software Foundation to the airflow drawing the ASF and then Sid has been super like a drunk great you guys let me know if I need to use a proper microphone let's see all right there has to be at least one technical difficulty here that's just the rule of text hopefully if this was the technical difficulty my demo is going to go well so yeah back to to airflow so more recently I started another project that's called super set and which I'm going to be talking about today and I'm going to try to sit you know super set in context of like how we use it at Airbnb and how we use it in the specific case of real time data what druid an interesting fact is that super set is also joining the Apache Software Foundation so soon it's going to be called Apache super set I used to work at Facebook we used to work at Yahoo and at Ubisoft I'm at a point where I don't really say how long I've been working with data because it's going to you know has just like been too long now and definitely been working with data since before it was it was cool I just squeezed him this new fly this morning because people were talking about bid engineering and I realized that it was data ang comp today and recently I wrote this I wrote this blog post called the rise of the data engineer that's on medium and it got it got super popular so who in the room has read that post okay that's not bad news and it leaves the room for four more people to go and check it out but I was trying to to really define in this blog post like what is data engineering all about and how does it relate to to say previous related jobs like like business intelligence engineer or data warehouse architect and where it fits in you know with data analyst data scientist in the modern world so if you're interested you can check that out and I put some analytics not to brag about their readings but it's going to cool that medium has this data pipeline in this data product where as an author of a post you can see you know how many how many reads you got and I think it's it's relevant in this context is you can kind of imagine all of the the pipeline and all everything that takes place behind the scene for this to happen alright so this is the quick Genda for this talk so first I'm going to talk briefly about druid and say you know a few words about what it's about and what kind of database it is what are the guarantees and properties that it has that I'm going to quickly introduce you know superset from a high level perspective then we're going to do a somewhat deep dive into the erbium views data infrastructure would it focus on real time and then this this presentation is going to be largely a superset demo so I want to really for everyone in this room to get a good feel for what superset does and does not do and you know one of the takeaways that I want people to walk away from this talk with is you know is is super said something I want to use in my in my environment and hopefully by you know listening to this talk you'll get a sense for that and then we'll we're going to take a little bit of a look of what's under the hood of superset and how it works and why we built it and on how we built it and hopefully we're going to have time for questions if not so every for every speaker today there's an office hour in a room that just as we come out on the right at the end of the hallway there so you can come and ask questions and I'll be I'll be you know happy to answer questions about my talk or about airflow if you're interested in an app - cool so I'm going to start with drill just to set the foundation because it's a really important part of our infrastructure at Airbnb of our real-time infrastructure at Airbnb so grid is a blazing fast real-time distributed column store I think it has you know more buzzwords but this is as many as I wanted to put in here but it's it has an open source community that is very thriving it's been battle tested at larger large companies I believe some of the people from Metamarkets are going to be speaking in this conference and I'm sure they're going to mention Druid so Metamarkets was the mother company for for druid I'm sorry I'm a little congested this morning as well as at Yahoo so Yahoo as a very large install they're very bullish on dread and they also use super set at at Yahoo and there's like you know hundreds of companies presumably that are running droid so it is so red does support real-time so you can load data in a batch form to then we're going to talk a little bit about how we use some of these features later there's a column store it is heavily indexed it is our Zonta lis scalable and the way a droid cluster works there's a lot of specialized nodes that take on different tasks that means that if you want to really optimize and control that infrastructure and ramp up certain classes of nodes on your droid cluster you can do so it assumes an OLAP type workload contrarily to other no sequel databases say like HBase or Cassandra what in some cases these databases assume that you know ahead of time the kind of queries you're going to run and it assumes that you're going to leverage that primary index that they have and you're going to structure your data a certain way with Druid it really supports arbitrary queries and it will perform well whatever you filter on group by that sort of thing so so you can really load your data into it and then ask all sorts of questions and drill will perform well that's deep support for sketches so sketches is this thing that represents usually a probabilistic count distinct said that scale really often when you do web analytics or different kinds of analytics you need to do Kalen distinct user ID count distinct you know you know user pairs and it's pretty useful to do that and it's usually pretty compute intensive and it's a problem for a hard problem for these types of databases and drew it as a really good support for sketches it's just really good and it's a feature that we leverage quite a bit so super set so that's just a quick intro like what is super set before we get into the demo so super set is a modern enterprise ready business intelligence web I hate the term enterprise-ready I also don't like the term business intelligence those are kind of antiquated and very corporate but it still describes what superset is pretty well enterprise really means you know it is secure and you can create roles and give access to different people to different databases or schemas or tables business intelligence because it is a data consumption application and really the idea behind superset is to explore your data create interactive dashboards and share discovery so it's really kind of taking on in the open source space we're taking a kind of head-on with vendors like mode periscope looker and even tableau all right so now I'm going to get into Airbnb s data infrastructure that is this slide is really the high level data infrastructure and it's not specific for real-time I will have another slide after this where we can explode the real-time portion of it and I'm going to type talked quite a bit about real-time works so at Airbnb we use air flow of course at the very top so this is how we schedule pretty much all of our batch jobs and this is also how we spin up in some cases some some real-time services as well so we can abuse air flow into running a long-running services because somehow we can and we do and then the flow you know the main data sources that we have at Airbnb are my sequel scrapes that we scoop and and load into or gold high cluster as well as event logs that you know we stream into Kafka and also somehow land into HDFS and our main high cluster we do have two Hadoop clusters because we for high availability reason the gold cluster is the one that has all of our raw data and all of our high SLA super important pipelines so the pipeline's that the data engineering team will create to create the core components or core schemas in the warehouse and Silver's is a replication that we can use for failover or for disaster recovery and it's also the sandbox and playground of our data scientist data analyst and information workers that want to run queries and derive data and do all that crazy stuff that data scientists do with data so we do you know HDFS we use HDFS we increasingly use s3 as a as a distributed file system - because it's cheaper and it's low maintenance and we use SPARC quite a bit so they like a lot of our ETL and and you know data processing scheme is done using hive n hql we increasingly use quite a bit of spark to do all sorts of batch processing now the bottom layer is more towards consumption and then you know of course superset I kind of made it look bigger because this is what we're talking about today but you can see that superset sources data mostly from druid so you see the pipeline here where you have the event logs going to Casca somehow make it into druid and then from super set we can analyze this data in real time we also take data from hive or from a Hadoop in some context so the the very hot data or the data that is usually our core metrics and core dimensions and large datasets that are too slow to consume and presto because they're big and important those we load load into druid - as batch and we consume some that data through super set and so of course we superset works well with presto and then we have these two other consumption data consumption visualization tools so air file which we're deprecating it's an open source project that is a sequel IDE on presto that we're replacing in favor a super set because super set is a super set of air pal we support all the features that air pal does so this network so we're kind of going end of life with with air pal and tableau somehow we use decreasingly but we still have a lot of good use cases for tableau internally and you know I could get if we have time or maybe in questions or in the office hours I can talk a little bit more as to when we use super set versus tableau a little bit more about the event logs here before I move on to the the real time portion which is going to circle them here so for event logging at Airbnb I wish would use Avro that Sid talked about earlier but we use the risk to enforce schemas so for everything that is long at Airbnb there is a thrift file which defines the schema of what we log and that allows us to enforce that that the messages we load our that we log our consistent and predictable and that allows for better integration down the line cool so now this is like looking more into real time or streaming at Airbnb and I'm going to start today we're mostly interested in the flow that goes from event logging to cascade a spark streaming to druid and I'm going to have another slide on this and this is what we're focusing on today but I want to talk about the other use cases or the other ways that we do that we do real time at Airbnb the first one I want to talk about is how we stream the my sequel bin log my sequel bin log is the event log from our production my sequel databases we push that into Casca with something called spinal tap and somehow in a spark streaming and then into HBase so we do replay the my sequel bin log into HBase which sounds kind of crazy that allows us to have an age base it's not a current snapshot of what's going on in production that is very low latency and there's this cool thing with presto presto is a distributed query engine that works very well with the hive meta store and hive tables and structures but Fred still knows how to talk to other databases so there is a presto connector for HBase that means we can write a presto query that will query essentially the equivalent of or my sequel productions a user table into HBase and that's useful because we can join that table with a high table or with other other things that impress the nose out of query but the main advantage of that of that streaming the binlong into HBase is that we can take instance snapshots of our my sequel tables into HDFS and and the reason why that matters is that the way you would do that so if you want to scoop your mind sequel database into Hadoop the way you might natively do that is that you would take a backup of your my sequel database at a point in time say midnight take that database restore it into some temporary my sequel database and then now you have like a static snapshot backup that's been restored then you would scoop the data out and then load it in at Airbnb that we had we had some mileage out of this technique but at some point in time our data SLA s withstood load core data which is going to the core portion of our warehouse by 9:00 a.m. in the morning and and it was just becoming impossible to do that just because the whole snapshot process and restore my sequel was taking so much time right because like just to take that backup restore that backup scooped it out load it in that became that it took you know 10 hours or something like that so we had no more time to do our actual compute and you know ETL so now that we have this we can take you know H it's really easy to take a snapshot from HBase into HDFS and we can do that instantly so at midnight o 1 we've got we got our snapshot and we're ready to run our ETL so it was quite a section here on on some of the streams some of the real-time stuff that we do you can see that we also use this thing called data dog who here is familiar with data dog so data dog is a vendor and they provide real-time analytics for your servers you can log your own event though we don't tend to do a lot of that so for logging your own events we tend to go the dredge route but data dog as a an agent that runs all of our machine that gets information about the CPU load and all sorts of like very operational metrics and we do you know stream the data to that company because data dog is a vendor and then we're able to analyze it and do ops type alert on it so that's another component of real-time at Airbnb that is somewhat complementary to the rest of the stuff that we have and then and then we have the drill use case with super set you can imagine super set right behind druid here and I'm going to kind of zoom in into that as to how we load data into druid here I wanted to to give you this pointer to this other talk that colleagues of mine that I've been saying setting up a lot of this infrastructure at Airbnb so Jingwei and leanne i've given a talk at the spark summit sometime last year or fairly recently anyways and the videos on youtube so if you want to learn more about the data infrastructure the real time data infrastructure at Airbnb I would recommend you go and check out this talk if you're interested now I want to talk about how the little abstraction we have to load data into droid so it's really not that easy to load data into Giroud and it's a common pattern at Airbnb we need to get data in there and we need to people to specify how they want to get that data in there and you know if we were to just say like write your own spark streaming job and figure out what the port and the host name is for the druid server that wins scale very well so we have this simple abstraction called air OLAP and I'm going to try to describe briefly how it works before we get into the super set demo so two use cases we load data into druid from casco in real time and for the hot data in hives that we want to make really fast and really interactive that is the second use case here you can see hopefully you can read from as far as you are but you can imagine that the way it works is we ask people to to set up a config file so the format here is called ho con which is a superset of JSON but it's a very simple config file that describes what are the dimensions and metrics that you want to load into druid and what is what is the casket topic and the thrift schema you want to reference and define you know what are the data types and now this should be loaded you can imagine that you could set up say a sampling parameter say I want a sample based on user ID 1% and things like that so this is how you configure all data is going to flow into druid so it's fairly easy for any engineer data scientist an engineer to go and drop a new whole con file in this folder and that's all you need to do for this to get picked up by the spark streaming job so the spark streaming job will reread that this folder discover it and based on what it finds it will start loading data into drill on the right side is the very similar config file to load a batch process into droid so here the main parameter is a sequel statement or hql statement so this is assuming that we're going to run a hive query and that the output of this hive this high query is going to get loaded into droid so this is just a batch ingestion portion so you can imagine that it would also have the first section here on the left and you define a number of shards what is your time stamp column because I drew it as a primary index on a on the the time a date type column so it assumes it's time series data and then you define your sources and as you define this their air flow job will discover this config file and somehow create a new mini dagger new tasks in your workflow and start loading data into Druid automatically alright so now we're entering the super set portion of this presentation so super set I want to talk a little bit about the original vision and and how it started and why would a company like Airbnb build a business intelligence web application instead of just buying one so so the original vision part of it was so we wanted to do real time we wanted to bring in Druid there was no tools out there that work with druid at the time I believe now there is there are a few options now to query and visualize data out of druid but if you wanted to query druid before you would have to write your own Jason and your own little application so for hackathon one day where other people had been working on testing out test-driving druid and doing a POC hundred internally I decided to build a set of tools to be able to visualize the data that was inside druid and then you know I've been playing with all sorts of d3 examples which are kind of represented here on the right side and it's always it's always a pain to take your own data you know take a d3 example and then you know munde your data into the right json format and load it in and then you got a static HTML file on your you know on your desktop on your local computer with a data visualization you don't know how to share it so i was like oh there's got to be a way for someone to build a tool where you can easily query any database out there including druid do a microscopic amount of work and get to any visualization and that was the scope of the project originally was what we call the explore view which i'll demo in just a little bit and eventually it grew into in popularity at Airbnb on top of druid and on top of I made it work with presto and eventually it started a lot of users started using it instead of tableau just because it was easier in low-friction and very easy to put something together one thing I'd like to say is that a trend that I see you know India working bi for for a long time is that the lifecycle of a dashboard or data set seems to shrink over time so that means that when you build a dashboard today it might be the new hot thing in your company for for a few weeks and people are going to consume and go to this dashboard every day until there's been the new hot thing which maybe people are talking about a new new analytics problems and then a new dashboard gets built and people people focus shift very well very fast so used to be that you'd build like the company dashboard and it would be in production for five years now it's more like you run you build your own little dash but for your team you use it it makes them decision at some point you might just move forward and start looking at a different dashboard that represents the the flavor of the week if you will so I think it's important for tools to for tools like super set and tablet to make it really low friction to build dashboards because dashboards have short life cycles right so this brings us to the live demo portion so we're going to going to hope that everything goes well so this is so this demo for context runs off of my laptop with a my sequel database running on my laptop so the performance is fine is just not running off of Druid I wish I could just VPN into Airbnb and show you all the cool dashboards that we do have at Airbnb and and show you no app crashes of their Airbnb app in real time and things like that but I can't really do that I would have to have everyone sign an SLA some sort of NDA and I would rather not have to do this I'm hoping lips this is my screen sorry I need to move this over here this was my desktop and I think that means I'm going to have to turn to my left while I do this demo maybe I will yeah I'd rather not figure that out right now okay cool so so this demo today I'm going to show you the kind of the core components a super set and hopefully you'll get a sense for how super set works and what it can do and what it cannot do in this case we're connected to this local database we're not connected to druid but you can assume the exact st. you would connect to rid the exact same way as well I'm demoing here today and one interesting thing is that you know you can install super set today on your laptop and it's very easy you can pip install super set follow a few instructions on our documentation and quickly you're going to get to this screen with the exact same examples and datasets that that you can start playing with and quickly you can also connect to your own databases whether that database is local on your laptop or on one of your servers provided you have attendant you know connection string and then you can start you know making reports on your own data so I'm going to start with just showing a few dashboards because that is kind of the top-down approach would be the show dashboards and we're going to look at individual charts and then into something called sequel lab so this is a dashboard you can see that we support your classic type of data visualization with some that are a little bit less common here this is the context you know where you would like move things around and set your things and save your dashboard and you know you could do we have this thing where you can change the CSS so this is where you would design and consume your your dashboard so this is the dashboard view and I'll show you a few other dashboards I guess we have only three to look at so this one here is like health data by country over the past would like 50 60 years this is the world population and I really this is a percentage of rural population per country you can see a little bit more you know there's different data visualization that we can use box plot and tree maps and bubble charts and spectrum Sackhoff son sunburst diagram and just good old tables and things we have a lot we support a little bit of a basic interactivity on the dashboard meaning that you know you can you can apply some filters if you want and these will be reflected in the different components I believe this shows more or less the dashboard view and I'm going to do this thing so one thing that's great about superset is any dashboards it's really easy to and very natural to go from the dashboard view and to the what we call the Explorer view and this explores you is where really the project started originally and the first iteration at hackathon we had essentially just this one-page app here and the context here is in this view you have all sorts of controls on the left that allows you to define a chart that is shown on the right side this is also where you would save these what we call slices and you can you know override then save them as and add them to existing dashboards or create dashboards from here and create a collection create a dashboard as a collection of slices as in slice and dice so what I wanted to show briefly here on the left side is that you have these controls you know which data source you're pointing to which visualization type you wanna like which way you want to visualize things you know some time filtering so the time filtering is either hard dates or you know you can do relative time so right things like now or three years ago or 15 days ago you can also change your time grain so that means say if you were in Druid looking at app crashes on to Airbnb app you could look at I want to look over the past 24 hour per minute per region and eventually you could drill down into like I want to look into the per machine is there a specific machine that's going poorly or something like that here this is where you pick your metrics so say and your your main group I so instead of grouping by country name or country code I might want to query by region and that's all I need to do so everyone that does not speak sequel knows how to consume a dashboard go into this Explorer view explore a little bit further assemble their own dashboard easily and share that with people so it's very low friction very easy to use and then you can see that you know we have things that are more like do you want to you know some more basic parameters like do you want to expose the legend or not do you want to use this rich tooltip just kind of a line a vertical line or a more simple tooltip so tons and tons and tons of options including some things that are a little bit more data munging so things like doing so if you want to do so here we're doing a period ratio so we're looking at the growth rate over the past ten years or you can be looking at a moving average and things like that so we have these functions that we added over time in the product so that you know I wanted to spend a little bit of time on the or review because that's kind of the heart of the application and gives you a good sense for how it works you can imagine that if you do change the type of visualization so if I go towards a word cloud here and I would like to look say a word cloud by country code and my query button then the list of controls that you have access to are very different because each data visualization does not have the same input so instead of taking the tableau approach you know we take this approach of adding very custom controls for each visualization so as five minutes less I'm gonna have to accelerate I want to show now so you have you know options you can see sorry easily download the CSV you can see the query that super set is running behind the scene all that good stuff and now we have this sequel editor which is an important component here where you know it's a classic sequel IDE that is multi-tab and you can browse your your database objects so say here I could look at the dashboard tables and see you know what's a foreign key and what is you know what seals are index and that's that kind of stuff and run my own queries and one cool thing is that you can go and actually visualize your result set at the click of a button here so it kind of ties together if you do speak sequel you can connect to all your internal databases and visualize them and create the create dashboards off of your own queries to cool so that is that ends the portion of the demo since I have a five minutes left there's a lot more to it I have a slide about security but you can imagine here that when I said Enterprise ready at the beginning of the presentation I meant it in a way that you can define who has access to which feature of the application very well and also who has access to which database schema and tables when they use super set cool so back to presentation all right so this is the fallback if the demo the live demo doesn't work I can fall back on these slides so now you're going to see me accelerate quite a bit because I have a few minutes left so the stack is a Python back-end we use pandas sequel alchemy flash gap builders we use flash and then this thing called flask ad builder is a layer on top that does authentication permission management and crud and then we use all sources like state of the art modern JavaScript like react and redux and pretty proud I would not have been proud of the state of our JavaScript code base maybe like a year ago now we're at a very very good place security briefly so we ship with different roles if you install super set and your your company you can define your own authentication scheme your own roles there's a lot of controls there for you to use we have this idea that I did not show in the application of a thin semantic layer so that means semantic layer is a little bit of extra metadata around your physical tables that you want to expose and superset and it says this is where you're both to say which field should be groupable this is also where you define your calculated columns and calculated user-defined metrics and things like that so there's this layer where you can control all of that we do do a lot of caching so if you set up a caching back-end the dashboards are going to load very very quickly because we behind each chart there is a JSON blob and we we cache that JSON blob so that if people are running the same queries we load them off of Redis memcache or filesystem what depending on how you set it up and the UI is really upfront about serving cache data it will show you very clearly and it will allow you to force refresh if you wish so there's also a cool airflow operator if you want to warm up your cache so say you load a table that's used by a certain number of dashboards you can say go and warm up the cache for for all of the dashboards that use this table so that you know sometimes your tableau dashboard people familiar with tableau it might take you know a minute to load and it's going to it's a really bad experience in this case all of our dashboards you know provided the cache is warm will load really fast so briefly superset is an open source project and it's really take enough as we speed so as we speak so 13,000 stars and lots of Forks and lots of washers and lots of commitment contribution so we're doing pretty well and we're hoping to like that this is just the beginning - we're joining the Apache Software Foundation so that means it's going to be called patchy superset and it's easy for a bigger company or companies to get involved so I really I really like today's renovation is in part a call to it please like try su percent use it join a community if you're interested interested to do so we're announcing a partnership with hardened works or they're putting some engineers full-time on the project because they're committing to droid and to superset so you're going to you're going to hear these names more and more over the next few years and I believe this is almost done I know I'm running out of time and so what's next we're growing community and that's part of the reason why the reasons why I'm here today to grow community and we're joining the Apache Software Foundation we've got work to do on polishing the UI so we're a team of four engineers at Airbnb with a p.m. and a designer and we want to make super set super slick so that essentially there's no friction you can just point it to a table and start getting your data and visualizing it and make charts and dashboard instantly we want to ship our JavaScript components so that people can use them in their apps there's this use case of integrating slice and dice capabilities to other data applications at Airbnb and everywhere in the world so we want to make it really easy for people to create to embed super set components that will interact with the super set back-end so you can picture these JavaScript components are really easy to setup and our that are highly interactive and lives in your application and then DSL versus semantic layer I won't get into details around that so check out arrĂ¡bida is super set this is their documentation there's an installation guide there I wonder if anyone did install supersets during this presentation it's totally possible to install it and in a few minutes and then you know all the actions in github we at Airbnb or at least like for my projects airflow and super set we do olive work we take pride in doing all of our work and be open to so we don't work behind the curtain and push some code over the fence every once in a while we do all of our work in the open so so please check it out and that's it do we have time for questions I think is true sharing the room I think we have time for questions awesome cool good thing I sped up all right yes microphone our idea Judy a is going to sink you maybe get a session anybody a question over there you so house assumer said comply to you implies you I so the first thing about implies do I I believe you do so originally there was a project called pivot out of imply or out of meta market and I think it was open source originally but then they lost their rights to it and Yahoo forked it and so I'm not sure what's happening with pivot I think Yahoo owns it now and an imply is selling a UI for druid and that really was pretty new so as of you know six months ago there's that there is not they did not have an offering in that space and I have not seen their tooling the main thing with superset is that we can connect to any sequel speaking database as well as druid so the scope of what we can do is is more you know it's it's broader in some ways but yeah so people I encourage people if you if you or you're thinking about dreads you should check out implies solution for visualization imply is the company behind druid so people that used to be at meta market left to create this company call imply and they sell services around it I'm excited for like competing with these guys and we're also partners with them I have lunch with them pretty often so it's cool to see druid takes up take off and to have like open source offerings in that space Hey so really awesome I guess my question is how you know there's a lot of startups out there well funded startups that are doing similar products you know good data mode Domo bars yeah and they're in taste they're creating entire startups based around what you've presented here which is essentially an open source product that you've created you know as inside of Airbnb so maybe you can just walk us through tell us a little bit of how how this came to be because it seems like it would take a lot more people than having a benefit that's relative yeah I'm really happy to answer this question so so the main thing is none of these solution could talk to druid at the time and I believe that is still the case for periscope mode Tablo and and a lot of them did not play well with presto as well so a lot of them are they work well with Amazon redshift and maybe my sequel and Postgres one thing so I do have a slide here that I kind of held for after the question slide that I knew I might have to use so it told the whole build versus by for us at Airbnb and also for others so there's some of the elements that are rational for us to build this so one is what Airbnb is going to be a big company and we already kind of are and data is very strategic and we don't want to depend on vendors I remember I used to be at Facebook and we cannot rely on any vendors of any of any form for the scale at which we were operating and sometimes we've tried to push in things with tableau and they just it's not on their priority list and their their roadmap and their iteration cycles are long and even though we'd like to help them on these things we cannot so sometimes we like to own some of these things Tablo did not really support any form of presto Indrid which are databases or choice and that is true of other vendors here I say tablet tablet that looks as that was our main our main focus or that is our main alternative at Airbnb tableau extracts don't play well like at beyond like a few hundred million rows it just breaks down and the reality is Airbnb we need to we often have tables that are multi billion records that we need to interact with and doesn't seem to be any solution and tableau side for us and you know druid can really say that on easily and we don't like you know buy-in and the fact that we're locked in and then there's going to play games in some cases we're committed to open source or whole stack is open source and we want to you know for the visualization layer to be as well we we do all sort of like deep integration and the stuff I was talking about it's just like embedding super set components into other data analytical application at Airbnb is really important to us and we can't really do that with vendor tools so that means we need to build all sorts of stuff anyways and then personally like I I'm an engineer I like to build stuff and I got hired at Airbnb and I would much rather build stuff then and be excited about it than supporting vendor packages that we don't really control and we can really collaborate with so those are some of the reason and fully that addresses your question Thank You max alright thank you very much
Info
Channel: Data Council
Views: 43,738
Rating: 4.9108634 out of 5
Keywords: analytics, data visualization, data engineering, data pipelines, DataEngConf, druid, real time analytics, apache superset, thrift, kafka, spark streaming, tableau, airbnb superset, data analytics, distributed column store, open source, OLAP
Id: W_Sp4jo1ACg
Channel Id: undefined
Length: 43min 35sec (2615 seconds)
Published: Sat Jun 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.