How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right good afternoon thank you all for coming out and joining us this afternoon we're going to talk about the visible Network how Netflix uses Kinesis streams to monitor applications and analyze billions of traffic flows I'm Alan McGinnis I'm a Solutions Architect at AWS I am just the opening act I'm not going to be up here for too long it's going to do a few introductory pieces and then john bennett from netflix is going to walk you through in detail the specific use case what i'm going to do before we hand it off to do bennett here is just talk about streaming data in general kind of general terms talk about streaming data talk about some of the services in AWS that many customers are using to process streaming data and we're going to talk about what we're calling the decision latency and so what i mean by that if we take a look at this slide we've got a train let's assume it's like a Hyperloop train very fast and we've got a cargo ship right one fast one is slow in your business when you're talking about making decisions about anything it could be things like how do i optimize my products that i'm showing on an e-commerce website how do I know operationally if my systems are running successfully these are things that if you can answer questions about them quickly you can make changes and make optimizations that better improve your business operations better improve things for your customers so that's what we mean by decision latency so when we look at the value of data we talk about what is important how does it become more valuable if we look at this chart and this is from a white paper that was published by Forrester called perishable insights it's a good paper that talks about the value of data over time if we look at the far right of this chart this is pretty traditional batch processing we're talking about data that you might review on a daily or weekly basis so there's certainly value in that to you and your business however it's old right it's a week old it's not going to tell you whether your site your website has an exam Apple is operating successfully right now because it's a week old data if you have to make time critical decisions you want to know what's going on immediately and that's to the left side of this chart the time critical decision things that you want to make decisions on in seconds maybe up to minutes and so recent data is valuable if you can act on it in real time so you have to capture the value of your data and so on AWS we have a set of services that make that pretty simple there Kinesis Amazon Kinesis quick show of hands how many folks in the room are either familiar with or already using one of the three Kinesis services in your systems today pretty good okay so I've talked about maybe I don't know maybe a fifth of the room so I'm not going to spend too much time going into details about the services I know you guys just want to hear from Bennett now Netflix is solving some problems but to set a little bit of context I'll talk about the three sets of services I think it's good to understand that you know I know that Ben is going to talk primarily about streams but there are two other services that comprise the Kinesis suite of services so Kinesis streams generally you'll use this when you want to process custom custom process your data so you have streaming set of data coming in you want to apply some custom processing to that data and then send that data further downstream either into a database or into s3 for storage or maybe onto another stream so you could do some like change processing but basically with camisa streams you're going to write some data producers and you're going to write some custom code to do the data consumption that's streams and then it will talk a lot more about how they use that firehose while there is a street it is a streaming service the idea behind firehose is let's just make it simple for you if your end game is just to take your streaming data and persist it somewhere you're not doing a lot of data transformation although you can do that now with firehose but if your end game is just to take my streaming data and persist it into one of three destinations today those destinations are s3 redshift and the elasticsearch service so those are three services in AWS if your end game is just to take your streaming data and put it into one of those three serve fire hose is the tool for you there's no code you have to write to do that it's all a managed consumer is the easiest way to think about that think of it like a managed consumer on a stream it's going to take the data buffer it put it into the destination and it's also going to build you know have the the built-in retry mechanisms and failure scenarios that kind of thing the third service is Kinesis analytics so the nice thing about this is whether you're streaming your data through streams or you're streaming your data into firehose if you want to get some insights into that data right now like how many times has this event occurred in the past five minutes Kinesis analytics can do that for you very simply by just writing some SQL so SQL I'm guessing many of you are very familiar with sequel by writing a simple sequel statement using Canadians analytics to run that sequel against your streaming data and get results as soon as that window closes so like I said I've got real-time streaming data every five minutes I want to count some amounts on some kind of an event and I want to spit that out so I can make some business decisions about it that's where Kinesis analytics fits in simple to use these asleep will so with all of this with Kinesis streams and connect the firehose of your front-end we kind of have the full flow of data right we ingest we process using some Amazon Kinesis enabled application on the year we have a we have a library called KCl at won't bore you with the details of that it's a library that you can use to process streaming data we have lambda and kinesins analytics if you're processing and then we have good integration with other AWS services so we have the full flow of data from ingest to process to react and then of course to persist your data so if we go from left to right on this slide we've got a number of services where we're ending up apparently we have a connected scooter that's sending its data to a stream and we might do some real-time processing and end up the data in s3 or maybe redshift so I could do some bi on it so Kinesis acts as the front end of the streaming data and nice integration with a number of different AWS services to help you process and react and take action so that your business can learn something about what's going on immediately so who is using Kinesis today a number of different customers I'm not going to talk about all these in detail but let's just quickly talk about maybe like so knows probably many of you're familiar with so knows the connected streaming audio speakers they run real-time streaming at analytics on data that's being sent from all layer devices MLB advanced media another good customer so they stream 17 petabytes of game data MLB major-league baseball game data and then lastly lastly I'll talk about just quickly about Hearst so Hearst analyzes 30 terabytes of clickstream data through Kinesis so first if you're not familiar with them they they are a large kind of publishing and media company they own hundreds of different magazines and websites for TV channels they own newspapers so a very large large company that owns a lot of media properties and anytime you might not know but if you're on a website that's owned or run by Hearst right the clickstream data being used to determine which articles are successful which one should I promote all that data being streamed through Kinesis and they're using consisted to make decisions in real time about what they can do better to improve the content on their website so why do these customers choose Kinesis lots I'm not going to touch on every point but I'll talk about what I see when I work with customers is one of the main reasons they choose Kinesis is one lower costs and to the a kind of a combination of the increased agility and the ability to scale very simply those other products in the industry that you could use to stream data Kinesis is a managed service particularly Kinesis streams very simple I want to create a I want to create a system that's streaming in data from from my applications a couple of clicks I get I get a stream and I can start streaming data into the system within like five minutes of starting it no systems no ec2 instances to manage the fully managed service and so those are some of the main reasons that we see that customers take advantage of this is its simplicity simplicity the agility that you get from using it and of course it can also lower your cost and so we talked about a couple of customers before I hand it over to Bennett here at Netflix so they're using it like I said at the very beginning on the title right to stream billions of network traffic flows in real time and so they do this because they obviously want to get some real-time insights into what's going on across the Netflix set of infrastructure that's running on AWS so they stream a lot of data get some real-time insights and what to their app into what their applications are doing and so to talk more about how they do that and what they're doing and talk about what is Netflix or the decision latency I'd like to introduce John Bennett from Netflix thanks Alan so what I thought to talk about during the rest of this session is you know what kind of problems Netflix faces when it comes to monitoring applications and the solutions that we came up with you know evolved over time just like many of you we do a bunch of trial and error most of it doesn't work so I'm hoping to maybe kind of provide a little bit of guidance from our experience to help you solve similar problems in your domain we're going to be looking at analyzing network traffic as an example but really these techniques are broadly applicable they may be used and say like a financial domain or e-commerce it's very it's essentially able to be used in the general sense and in particular I'm going to talk about how we use Kinesis in almost all of our solutions and then finally we'll kind of review some of the results that we came up with so folks I don't know Netflix is big we have over a hundred million subscribers we're spread across the entire world and every day we're getting bigger so under the hood we use dozens of accounts we're spread across multiple regions there are hundreds of micro services that are working in concert to make Netflix better and all of those are deploying production changes every day and at the same time they're scaling up and scaling down in response to changes in load and in the end we end up with over a hundred thousand instances and that doesn't include all the containers that are running underneath so in like any other complex system things break right and one of the more convenient scapegoats when CITIC a systems engineer or application developer can't figure out what went wrong it's almost always right like what's wrong with the network its blame that it's really easy because a lot of the ways we lack tooling to really tell us whether or not the network was a problem even earlier today I think there was an issue in a u.s. East one where we weren't actually able to tell like hey is there actually a connectivity problem or is it our own systems so we can try to figure out a way that network engineers talked about reducing this thing called mean time to innocence which essentially is like how do we tell everyone that it's not our fault right and then another question that we see is why is the networks so slow right I'm expecting things to happen within like a couple of seconds but really one of the more tricky things to figure out is are you actually following this optimal network path you know it may seem very easy to figure out but in a complex system that problem can be several times more complicated if you're using dependencies that you're not familiar with or dependencies that are changing under the underneath you and may actually be taking a suboptimal path so we'll be talking about how we end up monitoring that and lastly my service can't connect to whatever it's surprisingly difficult to get any application developer to tell you comprehensively give me a list of all the things that you need to connect to right because they may understand that hey I need to talk to service a and service B that's it but maybe there's a bunch of things that need to that it needs to connect to during like startup or maybe there are things that you know are making subsequent connections and maybe they have they have no idea so even in the distributed system we're making decisions about how to enable all of these different network paths and making decisions on whether or not this should be allowed or not allowed and if we don't know what dependencies each application needs then we're kind of just shooting in the dark right and so part of the reason why we're looking to analyze network traffic laws that tells us exactly what applications need to connect you so the list of challenges are very long when it comes to doing this in the network the Netflix ecosystem one we have no access to the underlying network right that's a de business job that's why we pay them the but at the same time our traffic volume is huge right logs of network traffic ends up being gigabytes per second right billions of flows every day and at the same time these logs end up being things it's just it's a log between two IP endpoints which isn't necessarily all that meaningful to us right IP addresses are very ephemeral they could be attached to an instance for a minute a day a year and they could be changing every other second so really it's they're randomly assigned it caught it's it's unpredictable and in a way really what we what we want to do is map this information to some longer-lived logical group typically when you want to analyze network traffic and you're in a data center you would use a something like s flow or net flow but in AWS they actually provide this great API for us that tells us hey these are the kinds of network traffic that you're seeing in your account it's called a data flow log it's good it covers all the traffic within the VPC and it gives you a single point to read and has all the core information that we need like source and destination IP but like everything else right it's not perfect there's a 10 minute capture window so essentially a flow could happen in your system one second but maybe you don't see it in a vpg slow logs for 10 minutes right that doesn't really help us have a very real time based system and at the same time these logs are stateless so don't tell you whether or not this was a request response that's really what we care about so we'll get back to that later so again I'm talking about examples where maybe a flow is between one source to a destination at a given time right IP for IP who knows what those IP is mean but really what we want to get to is hey this is the source metadata to the destination metadata I know that application fool I'm sorry service a is talking to service B this is the account it was in this is the the zone now I started have a better insight into what's actually happening in the system so our goal was to build this new data source that we can actually use to do analysis of our network traffic we knew that it needed to involve multiple dimensions like accounts and regions and zone but it also needs to have Netflix centric metadata like hey what application is it or what cluster is it and we want to be able to do fast aggregations right this is not something that I want to submit a query and check back in a couple of hours after lunch at the same time we don't know what kind of queries we actually want to run we can try and predict but it's we're almost always guaranteed to do that badly but if we sort of make sure that the system can do this sort of ad-hoc queries then we can slice and dice the data however we want in the end what we want to do is want to add visibility to the network right we want to go beyond hey this these bits are traversing between these two IPs we want to know hey these two applications are communicating so that's what we built dredge this is our internal tool and what it does is it enriches traffic logs and then aggregates based on this dimensional data that I was talking about now dredge has evolved in a lot of ways one because there were so many questions that we had no I no idea what the answers were you know how much data is actually generated in network right what would it actually take to do that in real time is it worth it so what I want to discuss now is some of the patterns that we've applied unknowingly and then knowingly to sort of tackle this large value streaming data so we can actually figure out how do we actually derive this this meaningful information so just a quick overview right we're starting off with these raw VPC flow log events right there the communication between these two IPs they're getting a process in some sort of way and we want to come out of that system is this enriched version of the flow log right to IP is now mean - these are the actual source and metadata behind those now the first step that we tried was batch processing and Alan discussed this a little bit earlier you know when you can use Canisius firehose this is a totally reasonable way to start right if you're there's no need to take on the burden of doing things in real time if you don't know if it's even worth it but most batch systems will have this sort of interval that you need to work on this this amount of data typically they do this every day so for us we started off having our interval stat at 24 hours now the important characteristic well batch processing is that essentially you're going to be processing a fixed amount of data that's not changing and the way you measure yourself and whether or not the system is working efficiently is how fast you can crunch do that now there's some limitations to that in that as we're reading these network traffic logs right these IP to IP communication we need to reach out to some sort of data source that's going to tell us what those IP is mean so we may need to reach out to some remote database or maybe we put a cache in front of the database to make it faster or maybe we even bring that database locally to the batch processor to get rid of the you know the network round-trip time and we'll go through these limitations in a in more detail so oops sorry here's how an architecture might look if you were going to do it in batch right we're taking in these VPC flow log events and instead of doing it in real time we're actually going to send them directly to storage so now that the flow logs are sitting in storage at the same time we have this source of metadata changes right whatever that and the being in our case we're pulling AWS api's and also using our internal systems and we were sending that metadata change to some to an external database so when the batch processor kicks in it says hey I'm going to grab whatever the recent amount of storage data is in storage I'm going to cross-reference it with our metadata and then I'll pump out these in rich flow logs if you want to do this using a double yes tools you could end up using Kinesis firehose so you can take that stream of vpg flow logs and dump it into some like s3 bucket weight you can use lambda to do the batch processing at some given interval and I don't know we ended up using DynamoDB to store metadata so again totally reasonable way to start but like I said the big characteristic write a batch processing is this delay right are you able to to withstand some 24-hour delay and when value is sorry that is happening between the time and can actually get value you know that's not exactly acceptable for us because in 24 hours our systems are completely different so you know what uses that so here's where has got a little bit more time when we're doing stream processing right the delay is much much lower in our case we were able to get it down to say 5 to 7 minutes but really that 5 to 7 minutes is actually because of the capture window that I mentioned a little bit earlier from VPC flow logs right those those are being provided on every every 10 minutes so really if you want to do and we could be actually doing this in real time if those were provided in real time but the big difference between stream processing and batch processing is now you have no idea how many events are going to occur within some given window right because your your operating on them as they happen and in this case instead of measuring by throughput really what tells us whether or not we're operating on the stream efficiently is how far behind are we from the very head of the stream right are we keeping up with it always falling behind a lot of you will be familiar with like this producer/consumer pattern right is it consumer keeping up with the rate at which data is being produced and they have the same same limitations as batch processing you can use some remote database you can use caching or you may even bring that database closer to the processor let's dig into a little bit more detail so instead of batch processing we're going to do stream processing but one of the big problems when you're going to be interacting with some remote database is now you need to actually bridge this gap between the rate at which your your raw your raw data is being produced and the rate at which you need to actually query this external database so let me try to back it away from like the networking example instead of DBC flow logs say it was sensor data from that inner or the Internet connected scooter right say that was the sensor data and you actually needed to marry it with like environmental readings or say instead of vbg flow logs it's a log of I don't know financial transactions and you need to marry it with some other rate in order to figure out what the actual transaction looks like that's what that's what would represent these flow logs and this meta data but the problem becomes when that this rate of VP these flow logs could be millions and millions per second right that means like for every flow that comes into this system you would actually have to query the database at that same rate and that's definitely going to be no bueno right because now you're talking about millions of queries going to a single database and now you can have resource contention and that's you can even run your queries in parallel and that's just going to totally overload the database so typically when you run into those problems where you're trying to bridge a gap in performance you go ahead and put a cache in front of it and actually I'll go through that in just a little bit but it's going to be the Kinesis firehose we can use Kinesis streams and we'll do this custom processing on the stream instead of dumping at s3 right at the same time we're still taking those metadata events and we're pumping them in some DynamoDB so we're going to read from the stream and we're going to at the same time query this remote database but if things are slow we can put a cash in front of the metadata database in order to to improve our read performance these are things that are very typical right in that if you're if you need to allow more reads and writes or something like that hey you need to maybe create an index or a secondary index or put some sort of cash in front of a DV layer very common if you want to do this with AWS you could use their own like memcache D or and then still use that DynamoDB but the problem with using caches ends up being and would be more problems than it's worth how do you manage like cache and validation right because if you have this cache sitting in front of this PB in some way you've got a set like a TTL on the items in that cache if it's too low but you're always going to getting you're always going to be getting cache misses so you might as well just like ditch to the cache if the TTL is too high now you're going to be getting stale data every time you hit that cache and then it's up to you on whether that whether or not that that risk of joining with still data is worth it in our case it wasn't so one of the insights that a gentleman named Martin equipment writes about in some of his blogs on a book that he put together called data intensive applications is that these database indexes caches materialized views these are all derived data that we use in order to speed up read performance right they all come from they're all sort of hidden away from us you know when we have a some sort of database a database is receiving like a stream of changes but really when you query the database you're just getting that fixed snapshot in time so there under the clip there is a stream of changes it's just completely hidden from you but these indexes that come from building out the database can be derived from this original stream so what he proposes is why not just expose that stream of changes directly to the user to our applications instead of having to create some cash or using database indexes and he calls this process change data capture the kinds of tools that we use to do that ends up being like a log based message broker like Kafka or Kinesis and you can use that to send these send these change events and expose the change log as a first-class citizen so instead of querying some external database we would actually consume both streams of data and then do the joint ourselves in a way you're sort of you're creating like this alternate view of your data locally so that you don't have to make a round-trip time to to an external database and as you're reading both streams you can update your sort of local view of this of this change log so that it's optimized for your use case right everybody might do it differently let me try to make this a little bit more concrete so we have these two streams right we got BPD flow logs and we have another stream of this metadata changes really what we can do is we're reading from this stream of traffic we can create this customized data structure in our case I thought we actually needed to to leverage some exotic data structure so I'm like a radix tree or skip list I don't even know I'd have to go back and read a book but really just a hash map and an assorted list totally worked out fine for us so as flow models are coming in we sort of have this map of given this IP these are the metadata that happened that are valid during a different time interval and every time we get changes in our metadata we can just update this table and this is all custom within memory right we don't have to now we don't actually have to go and request from a external database hey you know I have this IP and this timestamp can you tell me what that means you already have this sort of pre computed cache there locally so what the final architecture end up looking like is this everything is much simpler there's just a single stream processor and you're really consuming from these two streams and you're doing this joint right we're doing this small streams why not we don't cross the streams that's a little bit more dangerous but still in the end we're coming out with this enriched flow log where we don't have resource contention there isn't a network round-trip time things are totally optimized to our application and one of the things that you might notice is that we don't actually use Kinesis to do our stream of metadata changes and I'll explain why that is in just a little bit but you can see how nice this is a constant theme through all of these patterns right it's worked out for us in so many ways because it's integrated very strongly with other AWS services from flow logs all the way to elastic search and it scales really well it's stable and they provide this great library that I think Allen mentioned earlier called the Kinesis client library and in the end we actually have spent less using Kinesis that we would if we were using our internal costal clusters so just to speak to some of the integration points one of the biggest fears that we had taking on this problem was alright there's this crazy amount of network traffic how are we actually even going to deal with it right who's gonna who's going to ship it everywhere who's gonna how we even going to read from something like that and VPT flow logs and Kinesis can be used together to actually remove the the need for you to write any code in terms of producing that data with a couple of API calls we could set up systems to that all of our accounts in all of our regions are actually sending their traffic logs to a single Kinesis stream that was a huge win for us because in this whole system no code on our product to actually take traffic logs and put them into the Kinesis stream right because that's the stuff that we don't actually care about we really really want is we want to read this data like I don't care about writing the data so it's a really huge win for us at the same time it allowed us it allowed us to experiment hey can we do it with bat right what's involved it's very it's a very straight set up using Kinesis firehose and then we could just fire data into s3 and deal with it later we've even experimented using elastic search and redshift and we had to change maybe two or three lines of code that was really really good for us because at the same time we don't even know if this is worth it right because there's a lot of data and at the same time there's all this complication when it comes to systems that are being launched to enrich it so a lot of sort of experiment and figure out which ways where's that good balance between our data being processed and data being valuable to us and then when we were ready to do it in real time we could add transfer and migrate to kinesin streams this is a graph of the throughput of our traffic logs over a week period and you can see it's very in a sinusoidal diurnal I don't okay where there's a peak everyday right a trough and it that's responding to our our traffic load throughout our entire system and this sort of pattern can be very problematic to other systems like Kafka but for teenage sisters wasn't a problem and even if you look in more detail every hour things are very spiky and that's because VPC flow logs are generated in this very defined 10-minute capture window this can also cause a problem for things like Kafka but it wasn't a problem for Canisius the Kinesis client library was really a pleasant to use way it kind of automated all the boring stuff stuff that we didn't care about right load balancing and then reading from records and checkpointing all of that is handled because it integrates with dynamodb and it provides sort of the stream level and shard level metrics that help us figure out if things are actually working properly and the last thing i wanted to mention is that this total cost of ownership that's so low that we haven't really even thought about creative since we I'm using it it's it's handling on its own it's running itself nice manage entirely by AWS all we need to make sure is that there are in a charge for us to be able to do things in parallel but there are some use cases where Kinesis isn't the right tool for the job right if there are if you need to actually have so many shards that ends up being a problem that could end up meaning that can uses doesn't fit because then you've got to like stand out to some other cases stream or increase the amount of shards in some unreasonable way but also one of the things that is a limit on Canisius is that the farthest back that you can keep data in the stream is seven days so I don't know if you remember from architecture diagram earlier if we have this stream of metadata changes that change for say I give an IP could happen yesterday it could happen today or could have Aston several months ago if you have a stream where you're only able to keep seven days worth of data you want to be able to go far back enough for changes that may have happened a lot earlier so cough scans are providing that sort of system for us and they provide this log compaction mode which sort of clears away old data in case you don't need it so let's talk about some of the results after all that work right where we're enriching millions of flow law flow loggers gigabytes of data per second what we end up coming up with we're doing seven million flows per second every flow that comes in ends up being about five minutes old so that's that's generally still useful for us obviously if we can bring that down to second to be better but really all we have is one stream and hundreds of shards so let's revisit some of the questions that we had earlier what's wrong with the network so this is where dredge comes in and it helps reduce this mean time to innocent right is it the network's fault or is it some application developers fault here's what dredge can do it groups things by fault domain I had mentioned a little bit earlier that with traffic logs we can break things down by dimensions like account and region and things like that we can group things by fault domain which essentially means hey everything in this account this zone that sort of is a single fault domain and we can compare what those connect to on the other side what account and zone do those services hit so if like service a can't talk to service D but say everything else and it's shared fault domain can still connect on the other side then that's probably a bad code push right you push it back to the developer and be like you go figure it out I'm gonna go and you know take off but say for instance all of those services on the single fault domain all share this like this degraded performance problem then that sort of may be an indication on whether or not there's a network outage really really valuable why isn't it work so slow my favorite because it's trickier than it seems right dredge can actually identify whether or not you're taking these highly in C path so one of the things that you want to be doing when you're using AWS is you really want to be following what we call zone affinity right a service should really try to strive to communicate to other services within its own that's going to mean that you're going to have less than like a millisecond worth of latency right - the time the application takes but if you go cross zone from say US East 1d to us peace money you're going to be adding I don't know probably a couple of milliseconds maybe less than that and maybe that's not a big deal but if you've got large volumes of traffic adding those couple of minutes seconds every to every transaction may end up being meaningful to you but here's what the big ole journal comes in right what if you're going across region now we're talking 30 to 300 milliseconds that can be very problematic right no one wants that especially if you want to assistant ly be performing now this could be straightforward if you can identify whether or not service a is talking to service B cross region but what happens if you're connecting to service B and you're in zone you're like I got it I'm golden I'm hitting my I'm hitting services within my own zone but then it stands out to say another zone or another region and now you're still going to be eating this penalty of high latency and dredge sort of helps us understand whether or not these high latency paths are being taken because now we're reading all traffic flows within the network and we're breaking things down by logical groups like accounts and regions and zones that tell us not only this is the source of traffic but we understand what the destination is this is really huge win for us in that one it helps us understand how we can actually improve the performance river of our systems but also we can actually reduce the cost when network traffic goes between zones or even between regions that is several times more expensive than staying within the same zone so some initial findings is that we figured out that almost a quarter of our traffic ends up being cross zone which is crazy you would think that we had somehow like optimize that but we haven't in a lot of ways our systems have evolved so quickly that engineers aren't necessarily thinking about the latency that's being added to the system by doing cross zone traffic but even more disconcerting is that 14% of that ends up being cross region some of that is intentional but some of it is definitely shooting yourself in the foot lastly my service can connect to this dependencies so dredg really helps identify what is inbound to a service on what's outbound to a service some of our existing tools end up trying to do this with distributed tracing some of you may be familiar with google's dapper and this sort of is like understanding what breadcrumbs a request hit along its way throughout as it makes its way through the system and it does this through sampling but it's very JVM centric at the same time those things not necessarily having a lot of coverage right if it's JVM centric how do you do that for sake Python or Jas or go and it has other problems in terms of capturing say startup dependencies or protocols other than tcp/ip v4 so just to give me an example of one of our oil sounds writes a service a is actually talking to Cassandra and memcache D like that's what my system is like I know because I wrote it but really if you actually dig into the network traffic you find that you're not only talking to that but you're talking to discovery you're talking to a conquer cluster you're talking about three you're talking to ask us a lot of these application developers didn't even know that that was happening on their systems because they were more focused on the flows that are under their control versus say like a client thing they might be using so we found that there's a significant discrepancy between dredge and the sort of dependencies that are serviced by our distributing system distributed tracing system almost 100 percent of the services I sampled found this this Delta between the two but almost always dredged data was much larger than the data that was being produced by distributed tracing it's consistently under under reporting and we have high coverage now because this doesn't know our traffic doesn't have anything to do with the runtime they're using the language or the framework right if you basically if you using the network dredge is going to see it also this was super helpful for us to understand what AWS services applications talk to you because those those calls would never be instrumented in a way that would be surfaced to our system engineers so security has actually been using this data source very heavily because now they can understand what applications need to talk to each other and then they can sort of reduce the blast radius in case those services end up being compromised one of the the long-standing values that Netflix is freedom and responsibility right we we try to provide as much freedom to our to our application developers say we may open up security groups very wide for everyone to talk to each other but those application developers may not know how to rein that in but by using dredge dependency information we can sort of reduce that scope and make sure that systems aren't compromised in suboptimal ways also flow logs give us the only source that tells us whether or not security groups actually work rightly there's no other way to figure out whether or not some security group rule actually rejected traffic on some port or protocol at the same time we also understand what applications have an increased risk profile because they're communicating with things on the general internet either outbound or inbound so in the end what I wanted to be able to communicate is that by enriching and aggregating this data we ended up putting together this really powerful source of information it can be used for monitoring for triage for cost analysis for security auditing and it takes this invisible network when we lack tools to actually adding visibility and giving us actionable items to get take from there and make our systems better so that's it you can actually reach me on Twitter if you have any questions I think we actually have a few minutes for questions unless I was talking so fast that that pretty much just went over [Applause] hi yeah so the question was can i elaborate on the joining of the kidneys Astrium at the caucus dream yes that's happening on a single instance so they're both sort of having the equivalent of those consumer groups reading from two streams and then doing a join so the question was are we using tables K tables not necessarily K tables we're just using us just an in-memory map mm-hmm ah so the question was what was our experience using elasticsearch when we were using Kinesis fire hose we could send traffic logs to an elastic search cluster but it wouldn't be enriched so the data that would end up in elastic search would just be our IP and important information so it's sort of added an intermediate step that we didn't think was necessary it's like oh you have traffic logs that's IP IP communication and now it's an elastic search it's still I keep IP communication and then we have to take things out of elastic search do the enrichment and then just rip it off somewhere else yeah so the question is uh you can use lambda now to pick up from either s3 or I last a search yes you can use that allowed to do that all right good that's a good question so the term is a cold start right if you're bringing up seeing a new instance right which is all of our applications it should be prepared to do that how do you actually handle that in terms of being able to make sure that you can continue to enrich new data and our instances end up taking five to six minutes to read from the very beginning of our streams in order to to gather all the data it needs so that I can it can start enriching new flow logs so yes the the part of the reason why would you concur to do that is because we can maintain data from the very beginning as opposed to only like the recent see like the last few days I just recorded that I can answer it the same way same question just our changes of metadata sorry the question was what data lived in Kafka so essentially given some IP at a given time this is the metadata connected to it the interest of the ends of wind up putting that into a Julie cluster it's a tool that was built by folks from Metamarkets it just allows us to do this like ad hoc queries that we wanted to end up doing in a way that doesn't force us to like pre and like create indexes and have like a very small subset of queries that we can ask any other questions one over here oh yeah ah so the question is uh I guess what do we how do we use the KCl the Canisius client library we use it out of the box in a way that that the documentation provides they outline a interface that your that your Java or scholar applications need to implement so that as data records comes in if processes the data or it does like checkpointing it's pretty straightforward and that was I would say that was one of the more where I was the most reluctant right because if you're gonna be reading like this large volume of of data from any data source the client library needs to actually be able to handle all the failures and load balancing yeah it was actually really helpful for us are there any questions for Alan that's okay no yeah mm-hmm so the question is whether why use druid for like ad-hoc queries and OLAP stock queries instead of say last search I think when I look at elastic search I go hey do I need full-text search if not like I think it's probably the wrong it's the wrong fit but it's it's definitely something that is usable anything else thumbs up comes down close it oh okay he was okay wait you guys aren't just saying that yes yep yeah yeah the question was can we use install druid in AWS yes but it's not without pain it's like it's like 15 different subsystems another another alternative that I think folks at CloudFlare use is a click house which comes from Yandex yeah click house a much simpler system left to less moving parts much easier to reason about right you don't end up leveraging things like zookeeper which is notoriously hard to operate yeah any other question I'll be outside in case anyone wants to throw things at me or I don't know chibi candy no okay all right hey
Info
Channel: Amazon Web Services
Views: 31,232
Rating: undefined out of 5
Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, AWS Chicago Summit 2017', Netflix, Kinesis, BDA403
Id: 8tsIqfvizpU
Channel Id: undefined
Length: 50min 0sec (3000 seconds)
Published: Wed Aug 02 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.