Delivering High Quality Analytics at Netflix

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] nothing but water running convenience on the way ask any stable person should I watch and they will say look away look away Oh go away you've been warned welcome my name is Michelle aford and I am your humble narrator for today's talk I also lead a team at Netflix focused on data engineering innovation and centralized solutions and I want to share with you some of the really cool things we're doing around analytics and also how we ensure that the analytics that we're delivering can be trusted Netflix was born on a cold stormy night back in 1997 but seriously it's 20 years old which most people don't realize and as of q2 2017 we have a hundred million members worldwide we are on track to spend six billion dollars on content this year and you guys watch a lot of that content in fact you watch a hundred and twenty five million hours every single day these are not peak numbers On January 8th of this year you guys actually watched 250 million hours in a single day it's impressive as of q2 2016 we are in 130 countries worldwide and we are on I think around 4,000 different devices now there's a reason why I'm telling you this we have a hundred million members watching a hundred and twenty five million hours of content every day in a hundred and thirty countries on four thousand different devices we have a lot of data we write 700 billion events to our stream and adjusted pipeline every single day this is average we peak at well over a trillion this data is processed and landed in our data warehouse which is built entirely using open source Big Data technologies we're currently sitting at around 60 petabytes and growing at a rate of three 100 terabytes a day and this data is actively used across the company I'll give you some specific examples later but on average we do about 5 petabytes of reads now what I'm trying to demonstrate to you in this talk is that we can do this and we can use these principles at scale but you don't need this type of environment to still get value out of it so now for the fun stuff the unfortunate events and events is actually a play on words here when I say events what I'm really talking about is all of the interactions you take across the service so this could be authenticating into a app it could be the content that you receive as to you like what we recommend for you to watch it could be when you click on content or you pause it or you stop it or when you click on that next thing to watch all of those are considered events for us this data is written into our justin pipeline which is backed by kafka and it's landed in a raw ingestion layer inside of our data warehouse so just for those who are curious this is all 100% based in the cloud everything you see on this it's all using Amazon's AWS but you can think of this s3 is really just a data warehouse so we have a raw layer we use a variety of big data processing technologies most notably spark right now to process that data transform it and we land it into our data warehouse and then we can also aggregate and to normalize and summarize that data and put it into a reporting layer a subset of this data is moved over to a variety of fast access storage where that data is made available for our data visualization tools tableau is the most widely used visualization to what Netflix but we also have a variety of other use cases so we have other visualization tools that we support and then this data is also available to be queried and interacted with using a variety of other tools every person in the company has access to our reports and to our data but this is this is what happens when everything works well this is what it looks like let's talk about when things go wrong when you have bad data but really I don't like the term bad because it implies intent and the data is not trying to ruin your Monday it doesn't want to create problems in your reports so I like to think of it as unfortunate data this is a visualization it's actually I think one of the coolest visualizations I've ever seen it's a tool called visceral and it shows all of the traffic that is coming into our service and every single of those small dots is an API and those large dots represent various regions and are eight of us so if you look here you might notice one of these circles is red and that means that there's a problem with that API what that means is that we're receiving for the most part data but we might not be receiving all data we might not be receiving one type of event so in this example let's say that we can receive the events when you play on click play on new content and when you get to the end and you click that click to play next episode we don't see that right so let's just use that as a hypothetical so now we have our bad data our unfortunate data that is coming in and let's say just for example purposes that this is only affecting our tablet devices maybe it's the latest release of the Android SDK caused some sort of compatibility issue and now we don't see that one type of event it could be that the event doesn't come through at all it could be that the event comes through but it's malformed it could be that it's an empty payload the point is that we are somehow missing the data that we're expecting and it doesn't stop there and makes its way through our entire system goes into our ingestion pipeline it gets landed over into s3 it ultimately makes its way to the data warehouse it's going to copied over to our fast storage and if you're doing extracts that's going to live there too and ultimately this data is going to be in front of our users and the problem here is that you guys are the face of the problem even though you had nothing to do with it you created the report and the user is interacting with the report it is generally not going to be your fault I sometimes it's your fault but most of the time what I've observed is that reports issues with reports are quote unquote upstream what does that mean every single icon and every single arrow on this diagram is a point of failure and not just one or two possible things that could go wrong a dozen or more different things could go wrong and this is a high-level view if we drill down you would see even more points of failure so it's realistic to expect that things will go wrong compounding this problem for us is that according to San vine Netflix accounts for 35% of all peak traffic in North America so I've described the problem we have some bad data we have some unfortunate data that is not the issue itself right I mean what does it really matter if I've got unfortunate data sitting in a table somewhere the problem is when you are using that data to make decisions right that's the impact and it is my personal belief there is no other company in the world who is more data driven than Netflix there are other companies who are as data driven and they're not as big and there are bigger companies and they're not as data driven if if I'm mistaken please see me afterwards I would love to hear but when you're really a data-driven company that means that you are actively using and looking on that looking at that data you're relying upon it so how do we ensure they still have confidence well first look at all of the different roles we have we start with our data Engineers and and from my perspective this is just a software engineer who really specializes in data they understand distributed systems they are processing that 700 billion events and making a consumable for the rest of the company we also have our analytics engineers who will usually pick up where that data engineer left off they might be doing some aggregations or creating some summary tables they'll be creating some visualizations and they might even do some ad hoc analysis so we consider them sort of full-stack within this data space and then we have people that specialize in just data visualization they are really really good at making the data makes sense to people I'm curious though how many of you would consider yourself like an analytics engineer you have to create tables as well as the reports Wow that's actually more than I thought it's a pretty good portion of the room how many of you only do visualization show hands okay so more people actually have to create the tables than they do just the visualizations so those are the people that can that I consider these data producers or they're creating these these data objects for the rest of the company to consume then we move into our data consumers and this would be our business analyst which are probably very similar to your business analyst they have really deep vertical expertise and they are producing analysis like what is the subscriber forecast for amia we also have research scientist and quantitative analyst and our science and algorithms groups and they are focused on answering really big hard questions we have our data scientist and machine learning scientist and they are creating models to help us predict behaviors or make better decisions and these would be our consumers so these people are affected by the bad data but they're not there's really no impact yet it's not until we get to this top layer that we really start to see impact and starts with our executives they are looking at that data to make decisions about the company's strategy many companies say they're data-driven but what that really means is that I've got an idea and I just need the data to prove it oh that doesn't look good go look over here instead and so they can find the data that proves their points you can go off in the direction they want being data-driven means that you look at the data first and then you make decisions so if we if we provide them with bad data bad insights they could make a really bad decision for the company our product managers we have these across every verticals but one example would be our content team they are asking the question what should we spend that six billion dollars on what titles should we license what titles should we create and they do that by relying upon predictive models built by our data scientist saying here's what we expect the audience to be for a title and based upon that we can back into a number that we're willing to pay this is actually a really good model for this because it allows us to support niche audiences with small film titles but also spend a lot of money on things like the the Marvel and Disney partnerships where we know it's going to have broad appeal we have our algorithm engineers who are trying to decide what is the right content to show you on the site we have between 60 and 90 seconds for you to find content before you leave and I know it feels like longer than 60 or 90 seconds when you're clicking next next next but that's about how long we have before you go and spend your free time doing something else and then we have our software engineers who are trying to just constantly experiment with things and this could be everything from the user experience that you actually see to things that you don't see like what is the optimal compression for for our video encoding so that we can lower the amount of bandwidth that you have to spend while also preventing you from having like a really bad video experience so these are the the impact is really at that top level so how do we design for these unfortunate events I mean we we have the data we've got lots of data we've got lots of people who want to look at it a lot of people who are depending upon it you know and I think that you have two options here the first option is you can say we are going to prevent anything from going wrong we're gonna check for everything and we're just gonna we're just gonna lock it down and when something gets deployed we're gonna make sure that that thing is airtight that that sounds good in principle but the reality is that usually these issues don't occur when you deploy something usually everything looks great and then six months later there's a problem right so it's a lot in my perspective a lot better instead to detect issues and respond to them than it is to try to prevent them and when it comes to detecting data quality now I'm gonna get a little bit more technical here please bear with me I think that all the stuff that was really relevant for you guys and I think that there's some really good takeaways for you so there's a reason I'm showing you this so we're gonna drill down into this data storage layer and we're gonna look at at this concept of a table right and so in Hadoop how many of you actually work with Hadoop at all how many of you work with only like a you work with like a enterprise data warehouse but it's on something else like Terra data okay so in Tara data a a table is both a logical and a physical construct you cannot separate the two in Hadoop you can we have the data sitting somewhere on storage and we have this concept of a table which is really just a pointer to that and we can choose to point to the data we can choose to not - one thing though that we've built is a tool called medic at and whenever we write that data whenever we do that pointing we are creating another logical object we're creating a partition object and this partition object has statistics about the data that was just written we can look at things like the row counts and we can look at the number of nulls in that file and say okay we're gonna use this information to see if there's a problem we can also drill down a little bit deeper into the field level and we can use this to say like some really explicit checks we can say well I'm checking for the max value of this metric and the max value is zero and that doesn't make sense unless it's some sort of negative value for this field but chances are this is either a brand new field or there's a problem and so we can check for these things you don't have to do things the way that I'm describing but the concept I think is pretty transferable having statistics about the data that you write enable more powerful things okay so now that we have a statistics I mean we can use them in isolation and say oh we got a zero that's a problem but typically the issues are a little bit more difficult to find than than just that and so what we can do is take that data in chart for example row counts over time and you can see that we've got peaks and valleys here and this is really denoting that there's some difference in behavior based upon the day of week and so if we use a standard normal deviate distribution we can look for something that falls outside of like a 90% confidence interval and if it does we can be pretty confident that maybe there's not a problem but we definitely want someone to go look to see if there's a problem and so when we compare this for the same day of week week over week for 30 periods we start to see that we have some outliers we have some things that might be problems we can also see that the data that we wrote most recently looks really suspect because I wrote 10 billion rows and typically I write between 80 and a hundred billion rows right so chances are there's a problem with this particular run of this ETL so we can detect the issues but that doesn't really prevent the problem of the impact the perennial question can I trust this report can I trust this data I have no idea looking at this if there's a data quality issue and what's really problematic and what is really the issue for you guys is when people look at these reports they trust the reports and then afterwards we tell them that data is actually wrong we're gonna back it out we're gonna fix it for you there was no indication to them looking at this report that they couldn't trust it but now the next time they look at this report guess what it's gonna be there in the back of their mind is this data good can I trust it so what we've done is built a process that checks for these before the data becomes visible and all of the bad unfortunate stuff can still happen we still have the data coming in it's still landing in our ingestion layer but before we write it out to our data warehouse we were checking for those standard deviations and when we find exceptions we failed at ETL we don't go any further in the process we also check before we get to our reporting layer same thing what this means is that your user is not gonna see their data right they're gonna come to the report and it looks like this your user your business user is going to see there's missing data and now they're going to know there was a quality issue and we don't want them to know that right wrong we want them to know there was a problem because it's not your fault there was a problem it's there's so many things that could go wrong but simply by showing them this explicitly that we have no data they retain confidence they know they're not making decisions based on bad data and your business should not be making major decisions on a single day's worth of data right where it becomes really problematic is when you're doing trends and percent changes and you know those things even bad data can really have a big impact so one thing that you do have to do to make this work is you have to surface the information so that users can really see when was the data last loaded when did we vet last validate it through so there's two things not showing them bad data and providing visibility into the current state we're also this is a view of our big data portal it's an internal tool that we've developed I think there's other third-party tools out there that might do some of their things we're also planning to add visibility to the actual failures and alert so that business users can see those so now we've we've detected the issue we've prevented there from being any negative impact but we still have to fix the problem right they still want the data at the end of the day there's two components to fixing the problem quickly the first one is as I just mentioned visibility but this time visibility for the people who need to understand what the problem is and they need to fix it so one of the things that we're doing is surfacing this information you know the question might be why did my job suddenly spike in in run time right why is this taking so long and you can look here you can easily see oh it's because you received a lot more data and then this becomes a question well is that because somebody deployed something upstream and now it's duplicating everything I mean it gives you a starting point to understand what are the problems we also directly display with the failures and then give you a link to go see the failure messages themselves so that when users are trying to troubleshoot it again we're just trying to make it easier and faster for them to get there so this is the lineage data you know how do these things relate what things are waiting on me to fix this and who do I need to notify that there's a problem now I'm gonna cover this real quick this is about scheduling and polling versus pushing not something that you guys here would implement but something that you should be having conversations with your infrastructure teams about traditionally we use a schedule based system where we say ok it's 6 o'clock my job's gonna run and I'm gonna take that 700 billion events that I'm gonna create this really clean detailed table and then at 7 o'clock I'm gonna have another job run and it's gonna a grenade that data at 8 o'clock I'm gonna have another process that runs and it's gonna do normalize it to get it ready for my report I'm gonna copy it over to my fast access layer at 8:30 and by 9 o'clock my report should be ready in a push based system you might still have some scheduling component to it you might say well I want everything to start at 6:00 a.m. the difference is that once this job is done it notifies the aggregate job that it's ready to run because there's new data which notifies the de-normalized job that it's ready to run which notifies or just executes the extract over to your fast access layer and your report becomes available to everybody by maybe seven forty two you could see the benefits here getting the data out faster to your users this is not why you guys care about this this is probably why your business users might care about it why you guys care about it is because things don't always work perfectly in fact they usually don't and when that happens you're gonna have to reflow and fix things so this is an actual table that we have we have one table that populates six tables which populates 38 tables which populates 586 this is a pretty run-of-the-mill table for us I have one table that by the third level of dependency has 2000 table dependencies so how do we fix the data when the data has started off on one place and has been propagated to all of these other places in a full system you rerun your job and my my detailed data my aggregate my two normalized view all of these views get updated and my report is good but the other 582 tables are kind of left hanging and you could notify them if you have visibility to who these people are that are consuming your data but they still have to go take action and what's gonna happen is you're gonna tell them hey we've had this data quality issue and we flowed and it's really important that you're on your job and they're gonna think okay yeah but I deprecated that and yeah I might have forgot to turn off my ETL and they have no idea that somebody else has started to rely upon that data for the report right happens all the time people don't feel particularly incentivized to go rerun and clean up things unless they know what the impact is in a push system we fixed the one table it notifies the next tables that there's new data which notifies those tables that there's new data and everything gets fixed downstream this is a perfect world it's very idealistic you this is like a very pure push type system what you should be having discussions with your your internal infrastructure team is is that you should not need to know that there is an upstream issue you should just be able to rely upon the fact that when there is a problem your jobs will be executed for you so that you can rely upon that data nobody should have to go do that manually it doesn't scale part four so we've gone through this and we talked about some of the different ways that that unfortunate data would impact us but that's not the reality for us the reality is that our users do have confidence in the data of reproduce that doesn't mean there's not quality issues but overall generally speaking they have confidence that what we produce and what we provide to them is good and because of that were able to do some really cool things our executives actually looked up the data for how content was being used and the efficiency of it and they made a decision based upon the data to start investing in originals and over the next few years we went from a handful of hours in 2012 to about a thousand hours of content in 2017 we've ramped this up very very quickly and we have set a goal of by 2020 50% of that six billion dollars that were spending will be on new content that we create but this was a strategy decision that was informed by the data that we had we also have our product managers here are looking at the data and they are I mean they're making some pretty good selections with what content we should be purchasing on the service we've had some pretty good ones and they're going to continue to use the data to decide what is the best thing for us to buy we have our software engineers who have built out constantly evolving interfaces user experience for us and this doesn't happen all at once this isn't like a monolithic project where they just roll out these big changes instead they are testing these they're making small changes they're testing them incrementally and they're making the decision to roll it out we we have about like a hundred different tests going on right this moment to see what is the best thing that we should do and so you can see what we look like in 2017 or 2016 here's what we look like in 2017 [Music] [Music] so you can imagine for a moment the amount of complexity and different systems that are involved in making something like that happen and before we really make that investment we go and test it is our theory correct one thing that I think is really interesting is that we find oftentimes we can predict what people will like if those people are exactly like us and usually people are not exactly like us and so instead we just throw it out there whenever we have ideas we test them and then we we respond to the results and then we have our algorithm engineers these are the people who are responsible for just putting the intelligence in our system and making it fast and seamless and so the most well-known case is our recommendation systems I could talk about it but I actually think this video is a little bit more interesting [Music] Christmas [Music] [Music] [Applause] [Music] [Applause] the judgment is yours yours alone [Applause] [Music] now we did not create the vape the video but we provided the data behind the video 80% of people watch content through recommendations 80% we did not start off there at that number it was only through constant iteration looking at the data and getting or responding to the things that had the positive results that we got to this place okay so we'll start wrapping up the key takeaways obviously I don't expect you guys to go back and do this in your environment but I think that there are some key principles that really make sense for a lot of people outside of just what we're doing here at Netflix the first one is that expecting failure is more efficient than trying to prevent it and this is true for your your data teams but this is also true for you as data visualization people how can you expect and respond to failures and two issues with the data and with your reports rather than trying to prevent them so shift your mind mindset and say I know what's gonna happen what are we going to do when it happens stale data you know I have never heard someone tell me I would have rather had the incomplete data faster than I had the stale accurate data I am sure there are cases out there where that is true but is almost never the case people don't make decisions based upon one hour or one day's worth of data they might want to know what's happening they might say I just launched something and I really want to know how it's performing I mean that's it that's a natural human trait that curiosity but it is not impactful right and so ask yourself would they rather see data faster and have it be wrong or would they rather know that when they do finally see the data maybe a few minutes later maybe an hour later that it's right and this is actually I know this is really hard it's easy to tell you guys this I know that you have to go back to your business users and tell them this I realize that but you know if you explain it to them in this way usually they can begin to understand and then the last thing this is really validation for you guys I've had a lot of people ask me how do you guys do this and we have all these problems and the reality is that we have those problems too this stuff is really really hard it's hard to get right it's hard to do well it's hard for us and I think we do it pretty well I think there's a lot of things we can do better though so don't just know that you're not alone it's not you it's not your environment take the slide back to your boss and show them like this stuff is really really hard to do right and that's my talk [Applause]
Info
Channel: Netflix Data
Views: 64,473
Rating: 4.9504642 out of 5
Keywords:
Id: nMyuCdqzpZc
Channel Id: undefined
Length: 33min 38sec (2018 seconds)
Published: Thu Oct 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.