How Netflix Handles Data Streams Up to 8M Events/sec

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Peter Bacchus I work in Netflix before we get started I want to kind of take a gauge of the crowd how many folks are on the data science data engineering infrastructure so how many folks for data science ok and the engineering side data engineering or infrastructure ok cool all right so let's go a little bit about here we go a little bit about me as pizza and run real time data infrastructure at Netflix prior to Netflix I spent time at ala Yahoo PayPal everything from payments to behavioral targeting to display advertising my background is primarily distributed systems large-scale infrastructure when I am NOT at work I advise several startups primarily around data security and containers so if anyone's that a start-up and want to talk about something always happy to take a call or an email and and chat about what we've done what I've seen things of that sort so always always happy to try to help as any way I can all right so Keystone Keystone is our common data pipeline how many people here are familiar with data pipelines and data movement and things of that sort all right how many people here realize that's a very big problem that a lot of organizations alright so the job of my team is to essentially provide the paved path for your data engineers and data scientists to be able to do what they need to do so at Netflix we have a concept of paved path it's the way that everyone can use a service to be able to do what they need to without having a building infrastructure themselves now having said that Netflix is a culture where we believe in freedom and responsibility and so if someone wants to go off-road they're fully free to go do what they need to do we optimize for agility and not a slave for efficiencies so if someone comes to me and says hey I want you to be able to do this and it's not on my roadmap or I can't get to it in time they're perfectly free to go do and build what they need to so that they can deliver their product and their service and the manner that they need to now having said that when you do go off-roading sometimes you get stuck so what we try to do is really build that that common paved path for for people what I'm hoping to talk about today I want this to be interactive these are some things I put down if people have questions please ask for every question that is asked Pete has a Netflix bottle opener in the form of a question sir and a form of a question you could get one so I want to talk through the architecture design principles that that we put in place while we were building Keystone some of the technologies by the end of this it will be buzzword bingo so there's a lot of technologies in place but also best practices I think one of the things I want to have people take away is even if you're not at Netflix scale there's certain things in certain ways that we've done things that could benefit people and that's you know hopefully whether you're doing a million events a second or 100 million it doesn't matter there's still some principles and best practices that that could help okay so how many people here know what Netflix does so you think we're really a logging company I believe Adrienne had a quote about us being a logging company that occasionally streams movies and this is true so what does that mean that means and here's the big numbers we process we ingest over 600 billion events every day at peak we do 11 million a second about 24 GB hundreds of different event types over 1.3 petabytes a day so this is daily numbers what are these events these events are things like session trace annotations their business events that we use to do things like build recommendations things of that sort a B test streaming client logs things of that sort we don't handle operational metrics so this is purely on the application server business events side there's another system called Atlas that's been open sourced that handles all the operational metrics and so it's kind of been be coupled between the two systems aah-ha-ha so we process over a trillion events every day this handled in our around service we also hit a trillion events ingested these are all unique events during the holiday season so while everyone was home over the holidays watching Netflix not spending time with families we were at about a trillion events so now if you notice we've gone down and I'll talk a little bit about that here in a little bit well it's great to stand up here and say look we do a trillion events and we can check that off our philosophy is really we want them to be useful right so producing an event for the sake of a person event doesn't do anything decide put like really big numbers on so as data scientists as the engineers you know people talk about big data and fast data it's really useful data that that we want Sal so at this point I want to take a step back and talk about where we were about a year ago a little bit over a year ago and why we made some of the decisions to move to Keystone and build Keystone so originally our data pipeline looked like this very simple we had events being produced and collected ask a question and you get a bottle of it based on a patch each aqua so essentially application services and publish events we'd collect on we'd write them to s3 for all our offline processing so very simple great things work great people were very happy until they weren't so a little bit after that people started wanting to have the ability to do things like at least once real time and so we created introduced a branch off mainline that's he traffic into Kafka we routed it to elastic surge and secondary kafka clusters were then it was consumed by either consumer applique purpose-built applications spark mantis mantises are open source it's not open source yet it's our in-house stream processing system that we have so that's can conceptually where people consume either from this Kafka or that Kafka the problem that we found is there's a lot of duplication if you notice there's a lot of functionality that's very similar it also confused a lot of people so producer to come and say hey I want to produce an event do I send it through here do I send it through there what happens if I need guarantees and so about a year ago we decided we wanted to be able to simplify what we had and be able to start offering higher level services so as we looked at this we said okay well we need to be able to do at least once people want to be able to do things in real time so we could invest and build things ourselves or we can look at what's out there what parts of our infrastructure provide some of this functionality is there a community behind it and what can we do and so what we decided on was this so if you look at this pipeline it's much simpler we've essentially flattened out what we had before so we removed chakra we put a fronting kafka so essentially handle the fan in handle a fan out the routing service is different that's enough I'll talk about that in a minute so when we were looking at the routing service you know as we're adding in durability and at least once if our routing service was not able to provide the same semantics an SLA is it served no purpose and so we ended up using Sam's and I'll talk about the routing service and what that looks like but the primary point of doing this was well being a snowflake is awesome common core infrastructure is not that different and what we want as a team is to innovate where we can provide additional value to the business so when we looked at Kafka versus what we were doing at least once had durability had a huge community behind it and so we decided that our effort was better spent getting Kafka to where we needed it to be versus writing it completely over ourselves now having said that there are challenges Kafka was designed to run in a physical data center running an AWS there's different challenges that after you have to adjust with but for us it was better to try to solve those problems than to be the only company that had something that no one else was using and we had to support the entire thing ourselves folks here know Kafka okay all right so I'll skip through this stuff unless someone's curious about Kafka we could talk about it afterwards there's a lot of great material out there that people could read up on so common terminology you'll hear producer-consumer topics partition brokers this is a very if you google Kafka you'll probably see this alright so one of the things that we do differently is we don't use just a stock Kafka producer what we've done is we've put a wrapper around the copy producer for a couple reasons for us right now it's best out for delivery we prefer dropped and block so our philosophy is we should never block an application because we the point of our service is to stream movies and so that is the important piece so if an application or service needs to do their function and be blocking the ability to send an event that's bad so we our philosophy is drop not block the other part is integration into our ecosystem so like eureka atlas things of that sort so for those who have used netflix so open source software or know about it we have a ton of stuff that we've built and so it's a lot of integration into our services as I talked about philosophy is not to block so if we have an outage the existing application should be able to perform their function so if that's doing playback of a movie they should be able to do that because they can't send an event should not mean that your experience is any worse at the same time if all of a sudden we get I need to scale up or launch new instances because our service is not available does not mean that your service should suffer you should be able to operate your service in the way that you need to the other one is once we restore service it shouldn't require you guys and do anything your application should be able to start sending events again just like normal without requiring any action so we provide a buffer on the client side and I'll talk about this a little bit later on but once you fill that buffer it'll start dropping messages and so that's there you go you really wanted that it's nice isn't it yes was that yeah yeah so in general our our philosophy is like I said it's we should not block like any other services having said that one of the questions that we are talking about right now is the value of data not because not all data is created the same and so there are certain events that we don't want to drop and so now how do you create a different class of service to ensure that those events go through versus other ones and so those when I talk about higher level services and what we offer that's the stuff that we're talking about psych-k data's not all created equal what does it look like do these events do we need to have multiple copies of this do we need to guarantee do we need to provide quotas so that since it's a multi-tenant system if you're behaving poorly it doesn't affect someone else and those are kind of the some of the questions that we're starting to have right now what are we partition it so right now what we're doing is we have each event type is a topic and then based on volume will partition across now part of it is we need to make sure that clusters are balanced because otherwise what you'll do is you start seeing some interesting behaviors but that's that's kind of what we do there's some work that's being done on automatic rebalancing and things of that sort that'll be coming out but yeah the question yeah I mean so that's we're doing so this is v1 of the new pipeline but there are a lot of conversations around you know retries and what does that look like what do we what signals we provide back to the application so right now we assume a lot of the responsibility as a service what we want to do is we wanna provide signals back to the apps so they can decide what they do with it you could decide on your own this is really important I want to do a retry or I want to write it temporarily somewhere else and then send it or I could say you know what this isn't an important message I'm gonna drop it but we don't have the ability right now but that is that is work that we're looking at doing so we have suite control libraries so we could we could see what sent versus what we ingest No we'll get to it so your buffer question right here see so we drop when we float buffer the other thing that we do is we do only one act instead of two we don't want to wait for that second act in case that takes well we don't want to block on it so we're willing to lose a message rather than wait for two acts now having said that as we start looking at a higher quality service there may be situations that that we do that so all right sticky partitioning we batch when we produce events primarily we want to be able to reduce burden on CPU and network yes yeah I mean part of it is you could do things like you could write to disk you could I mean there's other things you can do temporarily to weather situation what we've designed for is on the producing side and our biggest challenge right now is around the producer and drop-in events especially as we're running at scale once you start getting replication lags things of that sort you have brokers that are misbehaving and so we don't have a huge buffer because we don't want to have you know take up a huge amount of memory for all applications so I mean there's a couple of things we could talk about some of the strategies that we're looking at and depending on what events you're doing and things of that sort retries and things of that but we could if you want afterwards I'm happy to like so for like deep questions happy to go in after so we talked about batching so how do you produce events using platform library it's a log event as an attainable we also provide so Netflix is primarily a Java shop but there are other other languages that are around and we're starting to come more polygons we do provide a proxy service so if you're running Python or something else you can access your service the same thing with the with a sidecar we inject metadata so gooood timestamp host app in general we have our own wire protocol it's really it's invisible to source and sinks we currently support JSON with Avro on the horizon so that's kind of legacy of where we were and so in our migration from the old pipeline into the new we wanted to minimize the moving pieces and so this actually did cause a little bit of an issue so we had the old pipeline running we essentially shadow traffic onto the new one as we were building it out and when we changed our wire protocol we actually introduced compatibility issues downstream and so in hindsight this is one of the ones if you're doing a massive migration try to minimize the number of moving pieces and make it as simple as possible this probably added you know weeks to our timeline because we had to go back and validate because now we got to introduce a change and we did it for the right reasons but in hindsight like I said it's one of those things where it's like okay if you could minimize get from existing legacy on to new and then iterate afterwards having said that the entire process was about a year and so we ran from where we kicked it off to end alas here where we cut it over so you know as I talked about 600 billion events or whatever we were actually doing 1.2 trillion because we were shadowing both so so we haven't so Park a is 4 so we do use Park a but that's on the HDFS I so we'll have to actually translate we'll have to transform from Avro to parque so what we do is as a sequence files that we deliver to s3 for EMR for all the Hadoop stuff they use parque so but I'll go into like that's a highly contested conversation at Netflix right now but I'll talk more about that and anyone who's interested in Avro versus anything else like proto before we could go into that so we did packages as a jar primarily because we wanted to be able to evolve it independently it is fairly lightweight we don't add a lot of size to each event the other thing that we have from legacy is a 10 MB size on it's a pretty big event we are trying to move down to supporting probably a one bag but this is one of those things where we decided not to introduce changes that our producing apps would have to do okay so that's where you have to sue it's very batchi so typically what happens there it's not a constant steady stream is the app that's sending it is in there and it'll batch to that size and then send but it makes it a little bit more complicated as far as now balancing what your bytes n are and things of that sort and whatnot and part of it is our philosophy our average payload size is a couple K and so to be an order of magnitude so that's so those are conversations that we have to have with the team they have good reasons for doing it but you know one of things we've talked about is like okay well what if you write that to s3 and then just send a location through our pipeline and then your app can pick it up afterwards so we're trying to work with those teams but ultimately we want to drive that down about one mag like I said one especially given our average payload size I think that's four that's it's fair yes yeah so what will happen is you it will drain what's in the buffer so what'll happen is you'll fill up the buffer all the other messages drop it comes back up what's in buffer will get flushed and then they'll start producing so we've provisioned for peak plus so part of it the challenge that we have is being a stateful service we we don't necessarily have the ability to auto scale yet we're working on that scale e MEPs a little bit easier but we've had to provision for p+ failover and so while it is possible we haven't seen that so and usually what will happen is we won't lose like everything it'll be isolated I'll talk a little bit about the fronting Kafka and the purpose that it serves so if you look at the architecture we have fronting Kafka and the sole purpose is to essentially create a buffer to be able to sustain all incoming requests so that's like said provision for peak Plus failover so from a failover perspective if we fail out of a region will shift traffic and so we've calculated what that is and we provision for that and we have the ability to add additional partitions and things of that sort but now if we do get overwhelmed once again the intent is to drop what we've done here is we've actually it's not one cluster it's eight clusters per region so we run eight clusters per region so we have 24 on the fronting side we've also isolated on the zookeeper side so each cluster has his own zookeeper cluster so the reason why we ended up doing that is we did have a situation where a zookeeper leader reelection caused some interesting behaviors and at the time we were running four clusters something like 14 or 15 thousand partitions and one of them and those clusters just weren't able to recover after that re-election and so what we decided at that point is to minimize our blast radius we split it up into smaller clusters and so now we kind of max out how many partitions and if we need will just add new ones and each of those clusters have their own zookeeper so we try to keep it under 200 nodes per per cluster and partitions probably around 10,000 so yeah I mean we'll get into that it's it's a it was a long night and I'm yeah so we could talk about that issue and a lot of the network things that we're seeing especially being in a cloud so we're so for folks out there were easy to classic most everyone not here that in Amazon's probably VPC will be migrating to V PC this year and we should get a lot better network performance and so like that's for us that that's that's a big incentive but yeah I'll go into details about what happened and why it was a long night and yeah so in addition to those eight clusters what we do now is we have two priorities normal and high pretty complex stuff a so as I mentioned the fronting Kafka is a buffer and what we've done is based on priority we have a retention period of 8 hours to 24 hours what this does is it allows us if we have a downstream of that if something's going on with s3 or we have some back pressure with elastic surge or something happens with the routing service it allows us that long to be able to handle in ich an incident without losing messages and so you know we do two copies eight hours and 24 we are going to offer the ability to select one two or three copies depending on priority what we want to do is we want to push a lot of what we're doing now down to one copy like I said it's what's the value of data right there are certain things where we want to make sure you don't lose there's others where some loss it's okay and so as we're going through and looking at things like cost attribution and the value of events and whatnot we want to be able to split that out to start with we went with two copies billion versus error logs password changes versus a sign up versus someone doing a trace on their service so and there's other things that we then used for recommendations or a/b tests so you don't want to lose certain events cuz now that makes skew your results for an a/b test and all of a sudden you're getting recommendations that you're like what so so that's kind of an evolving piece so we'll see we'll see how it goes before we go to one copy there's a lot of producer improvements that we need to make right now where we lose the majority of our events and by loss we're talking you know point zero zero zero one per side but it's it's at from the producer to the broker so size wise we have over 3,000 million 3,000 brokers that we run for Kafka that's spread across three regions multiple ACS within each of those regions we are using d2 Excel for us it was a balance between performance and cost those were those are newer instance types but but yeah that's kind of the footprint there we do do replication assignments that are zone aware and this has some benefits for us around availability so if we lose an AZ it's okay we still have other a Z's where data is distributed so we don't have loss we also can sustain a host failure so if you have a host that has multiple instances on it we're still okay an added benefit of us being able to do this is around cost of maintenance we now have the ability to remove an entire AZ upgrade perform any maintenance and add it back in and so that was kind of a nice side effect of us gearing for a higher availability was that no so we have the ability it's like said it's it's seamless to our customers they don't they don't know the difference we also have I guess I could talk about this as we good so part of our failover strategy is in region so we have the ability to spin up a new cluster shift traffic there do whatever so if we have a cluster that's having problems we can essentially just route traffic to a new stand in cluster resolve it blow it away whatever and then once once the resolved shift it back so we have that it's all automated so if there's loss that's happening we have very tight SL A's as far as how much data we could lose and so if something's happening we'll spin up route there is a time that it takes to probe a little provision that also has its own keeper clusterer that we use however its shared amongst all the failover clusters so each of the eight clusters has their own and there's an additional ninth one that if we have to fail over any of the other clusters that would use a shared one but we figured from a risk perspective we're okay with that so that's that's one of the things that we've worked on over the past few months is the ability to seamlessly do in region fail overs and whatnot one of the services that we have is an auditor kafka auditor there's actually really really useful for us we will be open sourcing this over probably the next few months hopefully a little bit before that but it does provide us broker monitoring consumer does heart beating we could do performance metrics with it it's a standalone service so we'll be open sourcing that and it's a lot of the things that we use to see the health of our system we use our auditor for current issues as I mentioned trade-off between cost and performance using the db2 excels we've noticed that our performance does curate the more partitions we have and so we try to limit so what we could do top at migrations so we have the ability to automatically migrate topics over and seamless as far as loss for customers and whatnot and we also see a replication lag so one of the things that we're interested in with one copy is to see if our drops actually go down because we no longer have that replication lag that's causing some anomaly within that broker that's been trying to be written to routing service any other questions so the the problems that we had with the d2 xl's were they were new and initially a lot of the placement so we had one physical host that had a lot of instances and so if something happens to that it would actually we'd see large amounts of brokers going offline the density has gotten a lot better and the placements gotten a lot better I am interested to see how it is in B PC the one things that we're looking at is actually going with a lot larger instance type so we'd go fewer but we'd have more and so we wouldn't have the noisy neighbor so some of the times we see some of that pop up also we have looked at other instance types we're starting experiment little bit with EBS as well to see what that looks like preliminary results seem pretty good so there's a lot of performance improvements that we're looking at over probably the next few months and you know from instance types to kind of storage we're using things of that sort and part of it is if we can now start saying we need less or less or retention or have you know more network bandwidth we can start playing with some of those configurations but for us the d2 s from a cost performance kind of blinded was kind of in the middle and so so we worked out a lot of stuff amazon's been great with that so they worked with us really closely early on as far as I trouble shouldn what the issues were and whatnot so Amazon has been a great partner for us in that sense yes nope we're all in all right so I'll do a little dance well you think about the other one so we have a lot of in-house built tools so the one area for those who are using Kafka there's a lot of things that we've built were we're focused and I'll talk about this later around self-service these are just common things that everyone has to do whether to manage our environment or how you do you know topic creation and things of that sort so if anyone's interested in collaborating like please reach out to me the one thing that my team is passionate about is working both internally externally to try to solve these problems like I said we're not we're not a snowflake the reason why we chose to go with something that has the community is because it has a big community and granted some of the challenges that we have due to scale other people don't have but how you manage Kafka like whether it's ten nodes or three thousand you still have to manage it so if people are interested are doing stuff glad to collaborate so did you remember that I did I stall enough all right yeah so we in general we routinely fail out of regions as a company we do exercises all the time where we'll shift traffic around we run Cass monkey on brokers they die all the time one of the reasons we went to an isolated zookeeper is because we didn't necessarily want our blast radius so I mean we have leader election happens all the time and the most part it works fine but part of it for us is risk mitigation but you know I talked about her failover strategy we routinely practice it so just having the conversation this morning with my team first thing that on-call does is failover a cluster because then you know it works so every time every week where someone's new like we failover at least once a week and that's just how like we know things are gonna fail the question is how fast can we react how do we mitigate the risk what's a blast radius how do we communicate to our upstream and downstream partners what's our SLA like what visibility do we have and so so yeah I mean it mean we have a small team it's nine engineers so and most folks are relatively new to the team over the past year we built the team out but the way we test failure in the way we automate it's the only way we're able to manage something of this scale is through automation and there's a lot of effort being placed in self-healing systems so if a broker something's going on with a broker we'll just kill it and replace it right because we have a very tight SLA and so being able to detect those signals and instrument things in such a way that if this happens automatically just kill it bring up a new one so that's a lot of effort that we've put in especially over the latter part of the year to try to productionize things so that when failures do happen like the big one we had were able to react and mitigate what the risk is to the business this we are so Kafka we're not or routing service we are so if you look at great segue you should get two bottle openers so our routing service based on Samsa care I'll go a little bit about a topic most people sound like you're familiar with Kafka and topics and things of that sort so docker my sequel Samsa and checkpointing cluster so we did a significant amount of work to Samsa to be able to run it in standalone we run it in containers we use my sequel to be our system of truth so the way our routing service works is you have jobs that are signed partitions and their job is to read those and move them to whichever sinks so it's a per per topic per sink we run in containers it is a fairly simple overall service but very complex SLA around what happens how do you launch and what not we do have the ability to auto scale or a routing service and so if you look at what we're doing we're running about 13,000 containers in production about 1300 hosts we actually had less containers more hosts we've done some optimizations over the past month part of that is as we were rolling out our service our focus was on being able to move from old service on to new knowing that we could do performance improvements and minimize some of the spend and whatnot but yeah so we're we're up to thirteen thousand containers most of them are fairly similar they handle between 1 and 12 partitions we pack across a box fairly consistently we don't have a need for any complex scheduling because her jobs are very much long-running the breakdown is 7,000 4s3 so that's all the offline Hadoop piece we have 4500 ish in Kafka and another 1,500 for elasticsearch the thing that's interesting is that's about a 55/45 split when we first started so when I started Netflix a year and a half ago we were only doing about 15 to 20 percent on the real-time branch so what's happening is we're starting to see a lot more workloads and a lot more desire to do things in real time we've seen a huge explosion and the number of elasticsearch clusters that are running the nodes and like I said it's you start to see a lot more folks interested in the real time piece one of the things that we're doing on the self-service perspective is giving people the ability to also determine not only their quality service but their destination so it becomes opt-in so right now everything ends up in s3 whether we want it or not there are certain events that we really just we really want to process them right now and after that it doesn't matter but by default everything and that's kind of a from a legacy if you looked at the original pipeline it was that was the sync yeah so technically we don't need to run in containers so our routing service is the start of our stream processing offering and while routing service is very fixed as far as resources we know what it looks like when you're doing general stream processing you may have a huge difference in workloads and so we want to we want to have the ability to offer people to be able to do different workloads along you know across a shared infrastructure and so for the routing service itself it was an investment in where we're going versus where we're at does that make sense yeah something we have it and it so part of it is if you start looking at the ability to manage resources across so eventually what we run here there'll be other workloads not just processing workloads and so the ability to do resource management and appropriately and then use your peaks and troughs you become a lot more efficient also kind of the isolation and whatnot we could for this for all intents and purposes we could have you know ran it and spun of VMs for it but for us it was an investment into where we're going versus doing something temporarily and then having to redo it down the road so yeah I mean for routing it wasn't necessary five five all right more numbers more numbers people I'll post this up people can grab it as I talked about we have isolation per job per se so each job reads from a topic certain number of partitions and it'll write it to the appropriate destination I'll get your question after since we have five minutes a couple things there the isolation is good we used one of the problems we had before was we had a routing job that would write some multiple sinks and so if you're getting back pressure coming up from elasticsearch it would then cause other jobs to back up as well and so from an availability from a balance perspective we and I'm splitting those out see talk about at least once so our intent is to offer at least once we once events get into Kafka we have pretty good ability to do that however as I mentioned our biggest issue right now is from producer to broker we are putting a lot of work into it the 0.9 release helps out a lot so we're currently in process of upgrading to 0.9 we're we have the test environment we're evaluating and we'll be doing that so hopefully that should solve some of the problems that we have data loss we touched on basically buffer full network errors so from a Kafka perspective the ability to be spread across multiple AZ's allows us to mitigate some of that risk but it can't happen if you have an I clean leader election especially since we have a hack of one so if you write to a broker before replicates if something happens we do have that potential we haven't seen that but it is a risk that we do have at the router loss is less likely duplicates are probably more so but from a lost perspective if something happens a routing service and you exceed your 8 to 24 hour window you do have that potential also what we've done there as I mentioned we've spent a significant amount of time around monitoring and alerting we have threshold of learning for anything above point one being tenth as I mentioned the intent of the fronting Kafka is to buffer the intent of the router service is to read and move as fast as possible so as long as there is no lag between the two or very minimum you shouldn't lose messages we shouldn't have that ability but it can't happen duplicates duplicates where you could see that happen is if you're trying to write to sync and what we do is we don't commit the offset until after we have written so if the offsets not written and a job dies will automatically launch a new one and so it'll read from the previous offset and so you may get duplicates we do provide ague it so you can't be dupe at the sink level duplicates is not as big of a problem as I mentioned we can de doop so more numbers average latency x' for our s3 sink it's three seconds about a second for our real-time branch for stream processing elasticsearch there is some back pressure there so we are able to deliver messages a lot faster then elasticsearch can handle and so we do have some back pressure and so you're about 400 seconds percentile lies like I said I'll share all this stuff I want to get to more questions all right there you go so alerts if we're dropping more than 1% of messages will alert on that lag as I mentioned from the routing perspective is really important so it's a point one consumer stuck on partitioned dashboarding very important what we do is we provide multiple views for people so as this is our general view for producers or consumers you can see data flow it provides kind of a 10,000 foot you can drop down you can see what's going on by topic by app what's the loss what are your throughput rates internally we track slightly different things one of the things that we're building is people to be able to do self-service data streams when you get that you'd automatically get your integrated monitoring all your dashboarding for people so really kind of prioritizing the visibility this is actually pretty interesting it's a visualization of our environment and fortunately or unfortunately for this demo or this talk everything was green but the intent of this is it provides us visibility into our environment and we quickly can see the health of our clusters and we can do a table form or chart so for example there trace annotations health was 99.5 3 but these will change colors depending on the health and so it allows us quickly to go through a large environment and be able to identify things that are misbehaving so next steps for what we're doing self-service tools that touched on it a bit through this talk better management for people that are interested or running these environments please reach out we're always happy to talk with people about what they're doing how they solve problems on the routing layer we need a more capable control plane Avro we talked about the other two big things that we're working on this year are providing a messaging as a service general purpose messaging the service for the rest of the company if you look at the pipeline there's multiple steps and so what we're gonna end up doing is providing the ability for people to do pub/sub but have that as a service the other one is stream processing as a service so there's right now a few different technologies out there what we're trying to solve ultimately is a multi-tenant system where people could submit their jobs they don't have to worry about the underlying technologies I may want to do a join and a filter I may want do some computation and so you have variants in what those workloads look like so we're gonna build a general-purpose stream processing environment for people any time for questions or did I talk the entire way alright any questions so Samsa works really well with Kafka and part of what we need and a routing service is the ability to checkpoint handle back pressure well because we're trying to do at least once and so we looked at spark streaming the problem with spark streaming at the time and that was earlier versions is it didn't handle back pressure and from a scale perspective like I said we're processing a trillion messages right now and so we have worked with the community the spark community I believe either the one in five or one six was supposed to have back pressure work done we are gonna we are evaluating it for a stream processing mantis or in-house stream processing framework there's right now a lot of use cases that are using mantis so so yeah I think there's four routing we had very specific latency and SLA requirements and Sam's a fit that profile really well thank you much
Info
Channel: Data Council
Views: 30,098
Rating: 4.7968254 out of 5
Keywords: data engineering, data processing, data pipelines
Id: WuRazsX-MBY
Channel Id: undefined
Length: 54min 52sec (3292 seconds)
Published: Tue Mar 08 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.