Apache Superset - A data visualization platform

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] cool so let me get started here and i'm gonna try like we're a small group of people so i'm going to try to make it a little bit upbeat today i've got some coffee i've got a coffee machine behind too so i'm going to try to have some some good rhythm here if someone can tell me to how much time is allocated i'm assuming like half an hour i'm not sure exactly how much time i should leave for questions but i'll uh i'll keep an eye on that maybe justin if you can 40 minutes perfect so i'll try to keep the talk like 30 to 35 minutes and then we'll have time for questions it looks like there are ways for people to maybe ask questions in the polls section um so whatever mean you have at least like think about your questions write them down i'll try to keep an eye that i'm on two tabs now so um i'll try and take questions at the end i think that's gonna be easier cool um so apache super set um so apache superset is a data visualization platform my name is max i'm the original creator uh of apache superset uh originally was called panoramics and then caravel and ultimately uh was renamed as apache superset and today we have a pretty simple agenda uh i'm going to do a demo it's probably the most important or more interesting part for the people that are new to superset i will talk about the past the present and the future of superset uh and and try to touch all the everything that's relevant to the crowd here so i'm assuming we have people from other apache software foundation communities too so we'll i'll talk a little bit about some of the things that we're doing around the communities that we can uh that that can be reproduced to but and and imitated by other communities at apache um so this first slide is a bit of a self intro so a little bit about me um my name is max i'm just uh passionate about building data tools that's what i've been doing in the open mostly under the umbrella of the apache software foundation over the past five or six years um i originally started i'm the original creator of apache airflow that i started at airbnb in 2014 and a year later i started apache superset at airbnb 2 as a hackathon a three-day hackathon project originally it took a year or so before it got traction and we kind of staffed it and really kind of uh sponsored it at airbnb and since then i've been working and focused very much on superset full time and kind of dedicated my my professional career to uh to superset over the past five years um more recently so a little bit more than a year and a half ago i started founded a company that's called preset that is a superset centric company so we build in and around apache superset we offer superset as a service and we also um we offer you know service and support training um all that good stuff on everything superset or related in in a way or another to uh to the project um i worked at a bunch of data forward companies that you see in this slide um so worked at lyft airbnb facebook yahoo ubisoft i wanted to say i'm super grateful for these companies that are sponsoring open source projects and letting people like me uh grow communities and build open source software i think it's amazing that we're doing that i think it's amazing that we have the apache software foundation to enable a lot of that too and enable that transition from company to company while keeping working on the same project right when i went from airbnb to lyft that was a really smooth transition and effectively i was still working for uh for for airbnb in many ways through through open source so i'm very grateful for that on the left panel i took my profile picture from our profile page from github's on mr crunch on github that's been my home base for a little while between slack and github so spend spent a lot of time just working collaborating people with people directly on github let me just check at the comments here to make sure that no one is writing something like we can't hear you or uh uh so it looks like everything is going if people don't complain in the chat i'll assume that everything is going well i see we're about 30 people now cool back to the slide so um so apache superset is a data visualization and data exploration platform uh it's very much a a full-on kind of web application where that you can install in your organization connect your databases it's very much a uh an alternative to things to proprietary tools like tableau looker chartio right so it's a place where you can connect to your databases perform all sorts of analysis visualize data share data assemble dashboards uh write sql so here you see a little bit of the gallery if you go on our website at supersaid.apache.org you'll see the same gallery and it's a little bit more interactive where you can actually zoom in to all these pictures so superset is dashboard and a dashboard dashboard editor and a place where you can create and share dashboards it's a place where you can explore data and kind of slice and dice and you can web form you know drag and drop style interface interface where you can connect to different data sets and pick different visualizations apply filters pivot you know so all this whole like code free slice and dice can analytics at the speed of thought um and it's also a sql ide where you can kind of prepare data sets uh write some sql so do things that um are a little bit harder to do in the slice and dice type interface uh where you can you know join data create data sets union data sets um and all this good stuff and it's also possible from here to pivot into the exploration mode where you can visualize things and assemble dashboards from sql um it's also geospatial so um i spent a year at lyft with an amazing team there working on real time and geospatial so we got a good integration with a super cool library out of uber that's called decgl and we built all sorts of awesome geospatial integration features inside superset i think i'll be able to show that a little bit during the demo um superset is also many visualizations um so that we have this long tail of visualization it's a bit of a patchwork of all sorts of d3 and different libraries visualizations that are made viable under the same umbrella and recently i'll talk some more about that but we kind of formalize an interface for visualization plug-ins so so that's a way that you can extend superset with plugins that suit your needs something that i'll i'll get back to plugins in a moment and also so um so superset is also a full-fledged web application so that means you can kind of pick your authentication scheme whether you're interested in things like uh rbac open id username passwords and set your own role based access control so we have very atomic maybe two atomic permissions scheme where you can really define what different people can do in the application and what data data and data sets they do have access to through superset it is also cloud native so that means you can run it on the database of your choice for the metadata database whether it's postgres or postgres or or postgres or maybe mysql but uh but yeah so you can pick you know uh the different uh pieces of infrastructure you want to run superset onto uh which web server you want to run which message queue uh and uh which caching mechanism you want to use for the data sets um it is extensible through plugins uh what i was gonna say here is that with these plugins you can build you can add new visualization but you can also build like little applications as plugins which is really interesting so you can do things that are more sophisticated uh we're really curious to see what the now that we formalized the interface to see what the community is going to come up with i think we should be seeing like more specialized visualization maybe like genomics or maybe things that are very um industry specific so we're excited to see the the ecosystem of plug-ins growing um it is also increasingly and that one is a little bit more inspirational but a platform to build data products so that means that superset exposes its building block to do other things with it uh including if you wanted to tap into the superset back end that knows how to talk to analytics database it knows how to handle things like caching auditing authentication permissions right so if you're building data products internally you might want to tap into some of the layers and the building blocks that superset offers including i think you know a set of react and visualization components um superset is also a thriving community so this is the insights page from our github and it shows that we've merged about like just about 200 pull requests over the past month which is crazy uh it's the pace at which um we're accepting contribution is incredible we have all sorts of people from different organization most notably i would say airbnb dropbox lyft preset we're uh an increasingly large contributor uh to the to the project too and we have like just a lot of stars on github uh and and everything is kind of up the through the roof um in terms of like how fast-paced this community is and at the pace at which is growing um here i wanted to talk about uh full stack open source analytics so it's really interesting to see how the apache software foundation and beyond i think here i have mostly asf logos but i have preso 2 which is part of two software foundations now which is really confusing there's a little bit of a fork in the presto universe but i won't get into that i will just i wanted to to bring up this idea that um that now there's more offering in the data space with open source and open source and the apache software foundation uh sponsor or hosted projects are becoming like an emerging stack that covers uh most most of uh people's needs right so i think in some areas or you know historically you needed a lot more duct tape and chicken wire to kind of get all that stuff to be working well together but more and more these things are playing well between each other and they also play well with the you know the cloud vendors and the things that are more uh up and coming and from the sas world things like in our case you know bigquery and snowflake that are um growing fast so uh so so these pieces of puzzles are all kind of coming together in a really interesting way and i think one thing i want to talk about a little bit today is open source up the stack uh which i have a slide here on that doesn't say much but i wanted to touch on that topic which i think is is super interesting and and relevant and important to us because we're a slightly different community at apache where we're building web application and web uh web tools or tools that are heavily kind of ux ui driven as opposed to a lot of the the products um around that are more infrastructure so we've seen really open source really take over and win the infrastructure layers things like databases compute engines orchestration systems right everyone wants and desires open source down the stack and and i've been wondering about why we haven't seen as much open source of the stack and i've got some a few points i want to touch upon today before i move on into the demo uh one is i think like part to me like the value proposition of open source is um first it's a better way to write software and that does uh and to collaborate on software right so i think that carries up the stack nicely um and the the other idea is it's a better way to distribute software too so as someone who consumes software you can kind of just get clone as opposed to having to you know deal with a vendor and have to attend a webinar and get spam to even see a screenshot of the software right so so i think these carry nicely um very nicely up the stack i think some challenges that we've had in the past that prevented open source to succeed up the stack are one is front-end engineering was uh was not necessarily um real engineering right so the toolkits and the the the jquery era was rough and front ends were kind of patched together and hard to collaborate on i think now the front end engineering has has come a super long way over the past five or six years where with the rise of npm um es6 re-app react type script um these things are making front-end engineering real awesome engineering that people can collab like truly collaborate on so that's one big shift and the other one is integrating i think open source has been really good at taking engineers and making them collaborate but we've been bad at integrating uh product managers and designers i think designers being really the the key that function of design is super important to build ui and ux and uh and open source and github and our communities have been kind of foreign to uh to designers so at preset we're really interested in trying to solve that uh and within our community too we're trying to find many ways to integrate integrate design as a core function uh in our community and and not only offer it offer kind of design as a service in the community but also like try to onboard more designers from uh from other organizations and get them to participate cool i'm looking again at the chat to see if there is anything wrong going on or if my internet's still up and it seems like things are good so i'm gonna keep going uh and i wanted to talk here um a little bit about our published roadmap so recently we published sip 53 so super set improvement proposal 53 uh which is our proposal to create a public roadmap for superset we looked around different communities to see how other people other communities are doing that and found that very few communities are publishing road maps so we thought it was a good opportunity to innovate a little bit on that so you can find out more about exactly the mechanics that we were proposing to do this this is uh ongoing discussions today so we're going to be voting on that over the next few days but the general idea is that if any any contributor committer uh pmc wants to put forward road map items to bring into the um kind of the global roadmap they're welcome to do so uh the minimum requirement is having a you know a title a scope defined and a rough timeline and um having an owner for the roadmap element and everyone within the superset community is welcome to not only contribute to this roadmap but we find that it's an opportunity to create more collaboration in the community so knowing where this project is going uh for the same reason that you need an internal roadmap for the products you develop i think they're very helpful at the community level uh there too so i invite people that are interested to check out our public roadmap you see the github here on the upper left so that's under apache superset superset roadmap and that's still in a proposal phase and we're welcoming people to kind of give feedback on the process and on the roadmap itself um community so here i'm going to go quickly through this we're already 20 minutes in so i want to save a good 10 minutes for the demo um so community where here i would characterize kind of what we're doing as we're doing most of the things that are expected out of a large growing community so github is very much a central hub we have a thriving slack that's always on we we do committers meeting we try to accommodate people in different time zone and we welcome everyone to these committers uh meeting um that are usually on fridays or thursday so if you're interested in that you can tap into our slack and and get hooked uh with these committers meeting we started a champions program so we found it was difficult for people to um to become committers so the path to committerhood um at apache i think can be difficult depending on the community and we felt like in our community we needed an accelerated path and we wanted to offer more direct support for people that are like i want to contribute but i don't know where to start um and i think really often at apache this um i would call it like the fostering committers it usually happens within sponsoring organizations uh but it's hard to see the new organizations that want to get involved and people really need to to push um there's this thing in open source where i think there's a lot of people that feel the imposter syndrome they feel like this is a kind of big stage that's hard to get into and that people are going to judge them so we really wanted to create an on-ramp for people who wants to contribute more check out you can learn more about the champions program probably i would say on our slack to get more information there's a channel dedicated to this i believe we send a monthly newsletter we've redesigned our website we're doing uh regular meetups so we're doing um all the things that um that you know great open source uh large communities are doing nowadays and to find links to these resources i would say go to superset.apache.org there is a community page that should have links to all of these resources and more cool so demo so let's see how this goes last time i was on i think i was running local this time around i'm not running local and i believe i'm gonna have to come here and share another screen here um so chrome tab i'm sharing different tab and it should be right here so now i'm going to ask you do you see my screen the question of uh i was gonna say the hour of the past six months do you see my screen yes great all right now i can keep going all right so now i landed on the welcome page of superset this is a list of dashboards so i think i'm gonna take you very quickly i'll just pop a few dashboards so starting with um our community metrics this one i built this weekend on top of big bigquery open data uh just for fun and i think i'll start with just these two dashboards oh and i pop them in different tabs so that's not gonna work so i'll just not pop them in different tabs and just navigate all right so this is our super set community metrics dashboard and i know that hop in takes a lot of resources here so i don't know how well that's going to go and i don't think i have a fallback do i all right it looks like we're we're loading slowly uh usually these dashboards look load pretty quickly let me just look at my top console to see how my cpu google chrome is at you know 200 percent all right so we're good i think that should go pretty well okay maybe all right um so this of course i tested and it worked very well before hopping on so let me do a little bit of a tour here so uh so here you can see that we have like you know these dynamic filters here that uh that we can apply so we have these interactive dashboards uh here digging into the community metrics here you can kind of see the different organization contributions a number of prs to date star gazers the recent leaderboard of people and how many pr's and comments and reactions uh you know we can see over the past uh who's has had the most uh reactions or comments kind of recently uh let's see here so it looks like i use reactions a lot personally you can see the history of contribution different people tough issues so just a dashboard and what you would expect in this particular dashboard we have a little bit more um of interaction where we can look on specific you know people maybe and in here let's let me just gonna use uh myself and you can see you know activity from different people um so that gives you an idea the kind of stuff that the kind of dashboard that you can build here uh let me go to show another dashboard or two so the other one maybe i'll get into here is uh i'll pop this one quickly so that's uh one using bigquery oh by the way going back to the community dashboard we publish a blog post or i publish a blog post on the preset blog around how to get to this exact dashboard so if you're looking for a short project you can kind of pick up the the notebook that i published and run the same things and like load the the information from your own communities into a superset dashboard and make it work i would normally navigate to the blog here to show you maybe i'll point to a few blog posts while i'm here and then i'll get back to this afterwards okay so let me get the dog out here all right luna get out of here all right this is what happens when you run a conference uh from your home all right so here i wanted to show that you know we have a blog post that uh that shares this open source dashboard so you can kind of do the same by you know running uh calling the github api and uh loading this that this dashboard template in your own local superset instance cool um now going back to the superset demo um by the way i wanted to point out too if you're interested in visualization plug-ins we have this awesome um post as to how to put together plugins that's very detailed um that's the one here so if you're interested in playing maybe you're looking for a weekend project or a full-on little project you can create your own visualization you can get to hello world very very quickly and then you know pick your own npm libraries and and build your own visualization on top of superset cool back into the demo i should be able to just go here nice and i'll show you i'll flash um one one more uh one more geospatial dashboard uh let's see here so we're looking at like random data over san francisco here um but you can see that we have like this nice kind of 3d engine to um to run queries um going to maybe a chart that's maybe more typical of the kind of stuff that you would do here um let's see what am i gonna show you maybe i'll go into the world's bank data here and then uh i'll show you so we've been in the dashboard so far i didn't show you the editor but you know you can imagine we i can click the headed button and and move things around now i'm going into this pops into a new window which can be interesting let me just share my whole browser application window sharing and i believe this is chrome here all right look at that we have the tunnel going on do people see my screen as i switch tabs [Music] i would love to get like a nice yep all right um cool so now i am more in a slice and nice view so this is the place where uh you know you point to a certain data set right so we're pointing to a data set here we can create some some calculated columns we can see the columns that's in that table and we can uh switch to different visualization type which i'll probably do in a moment and here i wanted to show how you can easily point to different metrics apply some filters so say in this case i might be interested in adding a filter on a region and say let's only look at north america and we should only see north american countries or i can remove this filter and say hey i would like to group uh group by region instead um and visualize this and maybe i don't know i want a an area chart instead um rerun this query and uh maybe i'm interested to look at like what kind of sequel was generated here and like noticing that the sequel is a little bit complicated i'm like where's their subquery here it's because of this little series limit here where you can say like i only won the top n time series so we do that with a fancy little sub query there's also other all sorts of options here where you can label your axes or use different um color scheme uh you can here i don't know i could write time you know and uh and you know remove the legend for instance and publish this either to an existing dashboard or create a new one from here uh or create a new chart from here so that shows you a little bit like the explorer um i'll show you too how say like a common thing is you probably go from um from from perhaps like a bar to a bar chart or something like that um so so all these interactions are pretty easy and at the tip of your finger and then the third part i'm gonna show you is a sql lab so i showed you how you can kind of view the query from here but you can kind of go deeper so we've gone from the dashboard to the slice and dice view where presumably you can answer some of your own question create your own visualizations and then from here you can go deeper into what we call sql lab which is our sql ide so from here you can go and navigate schemas so here we were looking at a table let's just look at this table because it's at my fingertips and you can get some simple data for the table and here you have a nice little editor where you know you have autocomplete and um and and you're able to just write the sql that you need there's some more advanced features um here i think it's disabled in this environment but you can schedule share save you can also parameterize your your um your sql if you're working on maybe like uh airflow script that uses parameters like you know time bounds and things like that you can you can work on parameterize sql right in here so this is the very quick tour so here you can manage your connection to different databases uh manage your data sets you know upload csvs uh create of course like charts and dashboard write sql and uh and here there's a little bit more around like data access templates and other higher level kind of concept that you can use and reuse within the application so that's the super short crash course um what is you know super set about so i'm going to head back into my slides here and i have a few more slides before we jump into questions and i believe i am what 35 minutes since i've had a few minutes so we start we kicked off graduation recently um so i think we're we're ready and we've been ready for graduation so we started the process so this is imminent uh so it's a matter of doing uh the the paperwork i believe or i'm hoping uh we're hoping we can we're gonna graduate very soon we started we were looking for a new visualization library to have more consistency and have a well-managed maintained um charting library that we use consistently for the core visualizations we found echarts which is another awesome apache software foundation uh hosted project uh that has a lot of traction now too so we're uh we connected with that team and it's really great for us to find another community that has a similar governance scheme and that we're very familiar with and we know we can uh we can collaborate with and we understand um how they operate we love like the guarantees sometimes it's hard to offer the guarantees that that we need to offer to be part of the apache software foundation but it's really great to find projects that do offer these guarantees too we're working on superset 1.0 and it looks like the timeline is until the end of the year or so so there's a lot to do around quality polish and usability uh visualization plug-ins have been shipped but the interface really needs to settle and 1.0 is where we're going to support the interface you know for uh moving forward and and prevent you know make sure that we support backward compatibility uh security is a big theme um import and export alerts and schedule uh deliveries of report of reports charts and dashboards and there's just so much more there's so much product surface here that uh really you have to take a look at their road map that i linked that i mentioned earlier to see just kind of how much superset 1.0 covers and with that i believe uh that is it so all the resources that i mentioned should be you should be able to find through superset.apache.org and we're really excited to welcome more people in our communities we want it we want this community to be the most welcoming around so please reach out and uh you know we're always happy to um to onboard more people and find a path forward for people to get more involved in our community cool so that's what i have and that leaves this i believe does that leave us any time for questions yes five minutes um where do i go do i go to the polls no that's different so for question i suggest people maybe copy paste questions in the chat um and if all right awesome do you think superset is better suited with druid or pinot for best for best slicing dicing so um i've been talking about uh analytics at the speed of thought i think it's really awesome where you can be engaged with a data set and interact with it and ask the questions and get answers as you go for that you either need a small data set or you need a database that fits the size of your data set so if you have a really large data set um and you want to get really fast answers you'll need a fast database or you'll need a bunch of summary tables right that have been like prepared and created for you so um so for a large i said as you hit you know the the hundreds of millions or the billions of peros um having kind of in memory uh reverse bitmap index the kind of stuff that you know druid and pino offer um is great but if you have the data sets that we expose or you know um the ones i've been playing with i uh say the covet data set or like a lot of um toy data sets out there are pretty small they fit in memory and they're they're pretty interactive even with uh with a small postgres database so it really depends on the size of your data um what you need but you know it's never a great experience to be working with a tool like superset when the database is just like you know uh spinning and taking like 30 to 30 seconds to a minute the flow of uh analysis really breaks down quickly um let's see uh i'm gonna try to touch on druids basically like that our kind of founding story or the create that the the create the creation story of superset was bound with druids uh originally superset was designed as uh won the first like open source druid ui out there so that was before they released something called pivot that they've pulled since then right and uh originally we were really coupled with druid uh very early in the history of the project we moved away from that are not necessarily away we also work very well with druid but we work with just any sql speaking databases very well we work i would say very very well with all the popular choices that people are making these days so things like bigquery snowflake presto um even um redshift is still fairly popular uh postgres my sequel we have a list of supported databases and that's just growing but pretty much any analytics database you can think of we can connect to cool let's see uh well will there be a first class uh db support so that's a recurring question that comes all the time people are like oh we have this these big like you know document stores or um or like key value stores like things like even in the apache ecosystem you have things like cassandra you have things like you know hbase that are more um key value stores with um key range scan capabilities so so some of these offer um sql engine on top of it um for super set so we really the the primary way that we connect to things is through python's abstraction one is called db api so that's the standard for writing a pyth a driver in python so that includes things like you know odbc and jdbc and uh and then we need to have a sql alchemy dialect uh which is just like basically adding a little bit of metadata on the sql dialect around like hey how does this specific engine do certain things like um quotes and um and key like time functions and things like that so um so sql is kind of the way we support very very well and then there's a custom native druid connector which means you can write pretty much any custom native connector but it's pretty involved to do so um there's there's all sorts of things that are possible here but you really need to have an engine that can do grouping uh and filtering and do the kind of things that sql does very well cool i'm looking for more questions here as as there's also uh answers already um and i believe um let's see we started so we still have like five minutes or so i believe and if there's no more uh e charts also does maps do you see a split or over overlapping geospatial we have not evaluated very well like the e-charts map support and put a lot of work into deck gl so i think we're fairly committed to deck gl at this point but the beauty uh with the plug-in ecosystem is that uh is that people can build their own plug-in packs right so like right now there's there's a little bit of a question of like what is what are the core visualizations that the super the core superset community wants to manage and maintain and uh and kind of uh make a first class citizen then we're assuming that out there there's going to be a lot of innovation around plugins uh and that people are going to build all sorts of plugins some with different level support and quality or some that are maintained better than others but we're very welcoming to like say create a central page that doesn't uh um like federates all the plugins and send links to the different plugins and let people decide which ones they want to use in their environment all right it looks like the next session will be uh starting soon so thanks thanks everyone uh we're viable through all sorts of channels so uh find us we're pretty easy to find and we uh we'd love to engage with uh everyone who will listen uh about about superset database apache uh we're excited all right thank you everyone i'm gonna click the leave button now so bye you
Info
Channel: TheApacheFoundation
Views: 10,943
Rating: undefined out of 5
Keywords: apache, asf, apache software foundation, open source, software, floss, free software, apachecon, acah2020, The Apache Software Foundation
Id: VEuBZqdSoHk
Channel Id: undefined
Length: 40min 2sec (2402 seconds)
Published: Fri Oct 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.