Democratizing Data at Airbnb — Chris Williams and John Bodley, Airbnb

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so my name is John I'm a software engineer at Airbnb and firstly I need to apologize for the American spelling in the title oh sorry so this up noon Kristen I'd like to talk about the data portal it's an internal data tool that we're developing to help with data discovery and decision-making and specifically we're going to talk about how we modeled and engineered the solution sort of sent it around neo4j so firstly what is Airbnb so it maybe is an online marketplace that connects people to unique travel experiences so both cursing myself we work in an internal data tools team where our job is already help ensure that MMB makes data informed business decisions so what's the problem here so the problem described today and what the data portal project attempts to address is the proliferation of tribal knowledge relying on tribal knowledge often stifles productivity so the baby grows so the challenge is around the volume the complexity and the obscurity of data so in a large and complex organization for the sea of data resources use often struggle to find the right data so we often run an employee survey and consistently score really poorly around the question the information I need to do my job is easy to find data is often siloed it's inaccessible and it lacks context so I'm as you're recovering data scientist who wants to sort of democratize data and provide context wherever possible so they've been B we have over 200,000 tables and a high data warehouse that's spread across multiple clusters I'm going to join the FMB last year it wasn't evident how you could actually find the right table so we build a prototype leveraging previous websites giving users the ability to search for the metadata so we quickly realized with somewhat myopic in our thinking we should sort of explore well beyond just data tables so there may be we have over 10,000 subset charts and dashboards supersets and open source so the data analyst platform we have an excess of 6,000 experiments and metrics we have an again 6 over 6000 tableau workbooks and charts and over 1500 knowledge posts and that's an open source code of knowledge sharing platform that data scientist used to share the results as well as like eliminate a literary litany of other data types but most importantly deserve a 3,500 employees at Airbnb and I really can't kind of stress enough how sort of valuable people are at the data resource servicing who may be the point of contact for a resources just as pertinent as a resource itself to further complicate matters we're dispersed sort of geographically with offices with over 20 offices worldwide so what's the sort of mandate of the data portal so quite simply it's to democratize data to empower Airbnb employees to be data informed by aiding with data exploration discovery and Trust so let's sort out the concept of what this is so at a very high level we just want to search for something the question is how do we frame our data in a meaningful way for searching and we have to be very cognizant of the fact of sort of ranking relevance as well so it should be fairly evident while we su feed into our certian disease it's all these data resources and their associated meta type but the question for us is are we missing something here and we feel yes our ecosystem is a graph the data resources are nodes and the connectivity is the edges or relationships the relationship sort of provide the necessary linkages between our siloed data components and the ability to understand the entire data ecosystem all the way from logging to a sort of consumption so you know relationship extremely pertinent for us knowing who created or consumed a resource in this case is a tableau chart this sort of is just as valuable as a resource itself so we actually gather information from a plethora of disjoint tools and that's would be really great if we could sort of provide additional context so let's sort of walk through a kind of a high-level example here so you sort of leveraging event log so we actually discover user consumes a tableau chart which lacks context so digging through I'm sorry so piecing things together here we actually we discover that the chart is from a tableau workbook the direction this edge is somewhat ambiguous but we prefer the mini to one direction from botha flow and sort of a relevancy perspective so digging a little further both these resources were created by another user now we have this indirect relationship between these uses the created mixer is a really good point of contact that can provide additional context around the data we then discovered that the workbook was derived from some aggregated table that lives in highs just exposing the add line data to the user then we pass out the hive order locks and we determined that this tables actually then derived from another table and that's actually provides us with the underlying data and finally both these tables are associated with the same height schema which may provide additional context with regards to the nature of the data so how do you go about sort of constructing this concept so we leverage all these data sources and we build a graph comprising of the nodes and relationships and this resides in hive we pull from a number of different things so so actually hive is Alfred a persistent data store with a table schema sort of mimics near floor J so we have like the notion of labels and properties and maybe a 90 we pull from over six databases that come through scrapes that land up in hive we create a number of api's be that Google slack and with some logging frameworks and that all goes into an air flow dag and air flow as L open source workflow tool that was also developed the air BnB and then this workflow is run every day and the graph is left the soak to prevent what we call graph flickering so how do we address this issue so our photograph is somewhat climb agnostic it merely just represents the most recent snapshot of the ecosystem the issue is certain types of relationships are sporadic in nature and that's causing the graph to flicker so we resolve this by introducing the sort of notion of relational State and we have two such things it's persistent and transient so persistent relationships sort of represent a snapshot of time of the system so it's like it's a result of a DB scrape so in this example the cradle a ship will sort of pissis forever so transient relationships on the other hand sort of represent events that are somewhat sporadic in nature so in this example the consumer relationship would only sort of exist on certain days which would cause a graph to flicker the remedy to solve this is we simply expand a time period from 1 to a trailing 28-day period window which sort of acts as a smoothing function so this ensures we the grass doesn't flicker but also we capture only recent and thus relevant and for consumption information into our graph so let's talk about how data ends up in neo4j and sort of downstream resources so this is a very simplified view of our data path which itself is a graph and so given that relationship relationships have parity with nodes it's pertinent we also discuss the conduit which sort of connects these systems so every day the data starts often highs we use air flow to push it to Python so in Python we have itchy the graph is represented in network X as an object and from this we actually compute a weighted PageRank on the graph and that helps a lot to improve search ranking the data is then pushed to near 4j by the new fuzzy driver we all know what neo4j is but we have to be kind of cognizant of how we do a merge here so there's the graph databases live and every day we're pushing updates from eyes into the graph database so that's sort of something we could be quite cautious of so from here it's sort of Forks into two directions the nodes get pushed into elasticsearch via a near foreground new sorry graph aware our plugin which is based on transaction hooks and from their elastic search will serve as our search engine and this is a fairly common technology used at Airbnb finally we use flash the lightweight Python web app and this again is used with other data tools so results from elastic search queries are fetched by the web server and additionally results from neo4j queries pertaining to connectivity are fetched from the by the web server via near 4j you same driver so what do we choose near 4jrs our graph database there's four main reasons one it kind of felt logical right now our data represents a graph so it felt logical to use a graphic databases web store the data it's nimble we're using a we wanted a really fast performant system it's popular rates the world's number one graph database and the communication was free this is really super helpful for sort of exploring and prototyping and finally this integrates really well like it in great dwell with Python or elasticsearch these existing technologies you want to leverage so this your lovely symbiotic relationship between elastic search and neo4j and that's all courtesy of some graphic web plugins and there's two such plugins the first ones was near 4j plugin and what it does is ace include asynchronously replicates data from near 4j to elastic search that means we actually don't need to as well actively manage our elastic search cluster it's sort of all our data for the contain a persist and so we use Nearpod a slivers is the source of truth and then the second one is a plugin medic she lives in elastic search that allows the elastic search to consult with near fuji during a the new for database during a search and this allows us to enrich search rankings by leveraging the graph topology so an example was we could sort by originally created which is a property on the relationship or most consumed we actually have to explore topology of the graph so let's just look at the sort of this is sort of how we represent our data model and so we sort of defined a node label hierarchy as follows this enables us organized data in both neo4j and hide these sort of the top-level entity label is really represent some base abstract node type and it will explain the relevancy of that later so let's just walk through a few examples here our schemer is created in such a way that the nodes are globally unique in our database by combining the set of labels and the locally scoped ID property so in the first example we have a user who's keyed by their LDAP user name we have a table that's keyed by the table name and the final we have a tableau chart that's keyed by you know the corresponding DB instance inside the tableau database so so the graph queries are heavily leveraged in the UI and they need to be incredibly fast so we we can officially match queries by defining Kerr label indices on the ID property and we liberties for fast fast access here we're just explicitly forcing the use of the index because we're using multiple labels so ideally would love to have a sort of a more abstract representation of the graph sort of moving from sort of sort of local to global uniqueness so how do we go about doing that so we actually leverage another graph aware plugins called the UUID plugin and what this does it actually globe assigns a global uu ID our newly created entities that cannot be mutated or in any way and this allows us to use sort of so this sort of gives us global uniqueness and now we can sort of talk about entities in the graph by just this one unique UID property in addition to the entity level this helps us use parameters queries which leads to faster query in execution times and this is especially relevant when we do bulk loads so every day we're doing a bulk load of data and we need that to be really performant so here's the same sort of example before now we've simplified this we can just purely match an entity using this UID property and its global so we have a restful api and the endpoints of sort others form so the first one you can sort of match a node based on it sort of labels and IDs this is useful if you have like a slug type of URL the second one you can sort of match a node by spilling the UUID and the third one that's how we'd match a crate we'd get a creative relationship based on leveraging these two uu IDs so this is a good segue to handed over to Chris who's going to talk about the front-end which leverages this API you hear me cool all right yeah so as John said I'm Chris I'm a data visualization engineer at Airbnb I also work on data rich user interfaces and I'll be talking about the front-end so basically now that John has introduced the fascinating data resource graph that we have on the backend which i think is interesting in and of itself I'll be describing how we enable Airbnb employees to harness its power through the web application so first I want to start off by saying that the backends of data tools are often so complex that the design of the front-end is an afterthought this should never be the case and in fact the complexity and data density of these tools makes intentional design even more critical I'm sure everyone here appreciates the friendly neo4j UI and so one of our project goals is to help build trust in data as users encounter painful or buggy interactions these can chip away at their trust in your tool on the other hand a delightful data product can build trust and confidence and so therefore it's a data portal we decided to embrace a product mindset from the start and ensure a thoughtful user interface and experience so to do this we actually interviewed users across the company to assess needs and pain points around data resources and tribal knowledge and from these interviews three overall user personas kind of emerged someone point out if you span data literacy levels in many different use cases so the first of these will call Daphne data she is a technical data power user you could me of a tribal knowledge holder she's in the trenches tracing data lineage but she also spends a lot of time explaining like pointing others to these resources next we have manager mouth perhaps use less data literate but you still need to keep tabs on her team sources shared them with others and stay up-to-date with other teams that you interact interacts with finally we have Nathan knew maybe he is new to Airbnb maybe he's working with a new team or he's new to data in any case he has like no clue what's going on and really quickly needs to get ramped up so if these personas in mind we basically built out the front end of the data portal to support data exploration discovery and Trust through a variety of product features which I'll describe in more detail in the next slide at a high level these radley include search more in-depth resource detail and metadata exploration and then user centric team centric and company static data so I also want to point out that we're not really allowing freeform exploration of our graph as the neo4j UI does this is a pretty highly curated view of the graph which attempts to provide utility while maintaining guardrails where necessary or less data illiterate employees will jump in here so the data portal is primarily a data resource search engine so clearly it has to have pretty killer search functionality from the screen capture you can tell that we tried to embrace a clean and minimalistic design so this aesthetic allows us to maintain clarity despite all the data content which adds a lot of complexity on its own we also tried to make the app feel really fast and snappy slow interactions generally disincentivize exploration I'll point out a couple other aspects of our search experience here so at the top you can see that we have search filters that are somewhat analogous to Google rather than images news videos we have things like data resources charts groups or teams and people the search cards have a hierarchy of information overall goal here is to help provide a lot of context to basically allow users to quickly gauge the relevancy of something so we have things like the name the type we highlight search terms the owner of something when it was last updated the number of views etc and we also try to show the top consumers of any given result set and this is just another way to surface relationships and provide a lot more context so continuing with this flow a the search result users typically want to explore a resource in greater detail so for this we have content pages and this is an example of a high of table content page so at the top there we have a description link to the external resource and social features such as favoriting and pinning so users can pin a resource to their team pages I'll describe more in a second below that we have metadata about the data resource so who created it when it was last updated who consumes it etc is John said the relationships between those provide context and this is really unique this isn't available really in any of our other siloed data tools so it's something that makes the data portal really unique tying this entire ecosystem together another way to surface graph relationships is through related content so direct connections to this resource for a data table this could be something like the charts or dashboards which directly pull from this data table you'll also notice we have a lot of links the idea here is that we want to promote exploration so you can see who made this resource you could then want to find out what other resources that they work on see if they're maybe like flirting with you by favoriting your resources things like that also highlight some of the features we built out specifically for exploring data table so one of these is that you can explore basically column details and value distributions for any table additionally tracing data lineage is important so we allow users to explore both the parent tables and the child tables of any given table we're also really excited about being able to enrich and edit metadata on-the-fly so adding table description column contents and these are pushed directly to our hive meta store into this abstract a complex process that is actually even data scientists are pretty reluctant to do right now also highlight that each of our content pages so we have a Content page for every type of resource and they're all kind of different this is highlighting our knowledge posts which is again is where data scientists can kind of share analyses dashboards and visualizations and something to note here is that where we're typically I framing these data tools so that generates a log then our graph picks up and it will trickle back into our graph effect PageRank affect the number into use and stuff all right so onto users users are the ultimate holders of tribal knowledge so we created a dedicated user page that reflects that so on the left you can see basic contact information if you need that on the right we highlight you can view resources that the user uses frequently that they created that they've favorited in groups to which they belong so really to help build trust in data what we wanted to be transparent about data you can look at what anyone what any resources a person views what your manager views etc along the lines of data transparency we also made a conscious choice to keep former employees in the graph so if we take George here the handsome intern that all the ladies talk about he created a lot of data resources he favorited things if I wanted to find a cool dashboard that he made last summer that I forgot the name of this can be really irrelevant another source of tribal knowledge bombs in an organization is teams so teams have tables that they query regularly dashboards the state goes that they look at go to metrics definitions etc and we found that team members spend a lot of time telling people about the same resources and they wanted a way to organize and curate basically to quickly link people to these items so we that we created group pages so there's group overview you can see who's in a particular team and to enable curate and content we decide to borrow some ideas from Pinterest so basically you can pin any sort of content to a page you can have there's basic organizational functionality in the case that a team doesn't have any content that's been curated we have a popular tab which you can't really see here but rather than displaying an empty page we can leverage our graph to inspect what resources the people on a given team use on a regular basis and basically give context that way also want to highlight that we try to leverage thumbnails for maximum content text so we gather something like 15,000 thumbnails from tableau or knowledge repo and our superset like internal data tool through a combination of api's and headless browser screenshot I just wanted to quickly highlight what the pinning and editing flows look like on the Left you similar to Pinterest you can pin an item to a team page on the right you can basically we have a lot of flexibility in terms how you customize and rearrange the resources on a team page alright finally have company metric data so we found that people on the team typically keep a tight pulse on relevant information for their team for a lot of times as the company grows larger they'll feel more and more disconnected from the company level high level metrics so for that we created a high level Airbnb dashboard where they can explore up to date company level data so this quickly I wanted to give an overview of the front end technology stack this is similar to many teams that are being B we leverage modern JavaScript es6 we use NPM to manage package dependencies and build the application use an open source package called react from Facebook which is really common now for generating the Dom in the UI we use redux which is an application state tool we use a pretty cool open source package from Khan Academy called Aphrodite which just essentially allows you to write CSS and JavaScript we use yes lint to enforce JavaScript style guide also open source from Airbnb and enzyme mocha and chai for testing now I'll jump into a few of the challenges that we have in this project we'll be facing the project the first of which is that again we're an umbrella data tool we're trying to bring together all of our siloed data tools and generate a picture of the overall ecosystem so problem with this is that any data tool any umbrella data tool is vulnerable to changes in upstream dependencies so this could include things on the back end like schema changes which could break our graph generation or URL changes which would make a break the front end additionally data dense design like creating and a UI that's simple and so functional for people across a large number of data literacy levels is pretty challenging to complicate it most internal design patterns aren't built for data rich applications so we had to do a lot improvising and creation of our own components John kind of alluded to this we have a non-trivial get like merging in the graph that happens when we scrape everything from hive and then push that to the production India for Janssen and the data ecosystem is quite complex and for less data literate people this can be confusing so we've used the idea of proxy nodes in some cases to abstraction those complexity so an example is John mentioned that we have lots of David's data tables which are often replicated across different clusters non-technical users could be confused by this so we actually accurately model it on the back end and then just expose the simplified proxy node on the front end so I'll wrap up with a couple of interesting future directions where they're thinking about or scheming the first disease is a more proper network analysis so determining obsolete nodes in our case this could be things like data tables that haven't been queried for a long time and are costing us thousands of dollars each month critical paths paths between different resources one idea that we're exploring is more of an active like curation of data resources so if you search for something and you get five dashboards that's the same name it's hard it's often hard a few black context to tell which one of those is relevant to you so we have these passive mechanisms like PageRank and surfacing metadata that would hopefully push down the crap in our graph and surface more relevant things but we are thinking about more active forms of certification we could boost in search ranking further we're also excited about moving from active exploration to deliver more relevant updates and content suggestions through alerts and recommendations so this could be things like your dashboard is broken the table you created it hasn't been queried for several months and it's costing us X amount of dollars this group that you follow just added a lot of new content and then finally what feature set would be complete without gamification so we're thinking about just providing like fun ways to provide content producers with a sense of value so surfacing things like you have the most viewed dashboard this month or something I will quickly give a shout out to the team so John and I are two software engineers on the team Michelle is a third software engineer we have some part-time support from designer Eli and our product manager Jeff we're all based in San Francisco and then a cookie give a shout out if you're interested in this topic or want more details we'll have a blog post that should be rolling out tomorrow morning California time and that will be our medium if you could check that up so with that we wanted to say thank you not sure if we have time for you [Applause]

Info

Channel: Neo4j

Views: 8,035

Rating: undefined out of 5

Keywords: neo4j, graph databases, graphs, nosql

Id: gayXC2FDSiA

Channel Id: undefined

Length: 29min 17sec (1757 seconds)

Published: Tue May 30 2017