Evolution of Edge @Netflix

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good evening imagine for a second that you and i we start a new company we've got an idea we've got some funds and we gathered talented team of engineers to work on our idea it's a great great place to be at we will work hard for sure but success is not guaranteed what is guaranteed is that over time we will evolve our product our customer our infrastructure and our edge today i want to talk about evolution of edge and i want to use netflix as a case study for that granted netflix is not a startup but it went through some changes and with these changes infrastructure and business evolved my name is vasily and for the last seven years i worked on edge services i started with apple built ad services for icloud and itunes in 2018 joined netflix to work on api platform and later on push notifications and api gateways and lately yesterday i learned that if speaker tweets then the speaker is more credible so i wrote my last last night i wrote my first tweet full disclosure since we are going to cover a lot i'm not going to go deep it's going to be an overview session so when i was preparing for the talk i went online you know to the infoqueue website and i checked what people were interested in and one of the questions that was very popular was what is edge for the purpose of this talk let's agree that edge is something that is close to client but it's not a category we will not say something is h and something is not let's treat this as a quality that is more or less pronounced for certain concern right so for example you take dns and you take database starts with g right but they are less or more edgier dns is more edgy than a database unless you expose database to the client which probably is a bad idea all right let's go back to our business we have money but we don't have a lot usually that's the case so key thing that we optimize for is time to market we need to roll fast how do we do that basically we do not over complicate things we introduce same practices and we rely on good judgment of these people who we started the company with what does it mean well let's start with simple let's start with three-tier architecture 3-2 architecture means that we have a client we have api and we have a database that this api talks to all the concerns we put in this api application the beauty of this approach is that we don't really need to spend time on building standards or tooling whatever we put in the code base on this application is a standard now our edge right now is our api application it's the only app and all we have is edge we also have a load balancer which usually does terminate you know encrypted traffic and sends plain text traffic to the application itself and our dns is usually very simple we just need our clients to be able to find us right if we flip slides we will see how netflix architecture looked like in the early days so please try to find some differences [Music] so the difference is the name of the application is nccp not api nccp stands for netflix content control protocol and it was the only application that was exposed to the client and it powered all of the experience there was a hardware load bouncer terminating tls in front and there was one domain name simple record that was good enough to start the business and i hope it's good enough for us to start the business all right summarizing three tier architecture monolithic application it is fine our edge concerns are very simple you know our api is edge dns and load balancer we don't need more we are good if we are successful successful enough we get our customers and now it's time to grow with growth of the customer base with growth of engineering what happens is we add features we add add add with added features we add engineering we add features add engineering money flows beautiful at this point of time we want to preserve engineering velocity we are still small relatively small and we don't want to step on each other toes so how does this manifest in the changes to our ecosystem whenever we see something big on what on which a lot of people work on we have a tendency to split this apart and lately we were splitting everything into monoliths oh sorry into microservices so let's introduce some microservices good so we took some concerns from api application and put them as a separate apps they probably have separate database they probably have dedicated teams to work on so beautiful it works well and our clients didn't notice that we did that that's because api is our edge and it is a level of interaction for all of our clients over time as we build more and more applications separate applications we still build logic to orchestrate within api so a request coming from a client goes hits api and then api needs to execute some code to execute underlying microservices over time the amount of this code grows and while we can say thank you api team for making everyone's life easier we know that you will become a monolith very soon so in a talk a couple of years ago josh evans referred to this problem as return of a monolith well a monolith never went away it just grew slower so over time api gets bigger and bigger and bigger what do we do with bigger things right we try to split them so let's split api but unlike services that we introduced on the previous page api is an edge service and it is a contract to our client so what do we do we need to change clients [Music] there are at least a couple of ways to do that on the client side there is an orchestration that needs to happen if we split edge application one of the ways to split it is to kind of introduce additional domain name and say hey everything goes to this domain name for this concern for other domain name for this concern we are good the only challenge comes when the client that does this needs to change a good practice is to introduce another service that tells client for which concern where to go so basically in case of netflix it could be playback goes to this service discovery goes to that service alternative way would be to introduce an api gateway and route traffic transparently transparently routing traffic is essential for splitting our monoliths or edge bandwidths further let's do a small quiz who thinks netflix did client-side orchestration raise your hand all right few of you who thinks that netflix did api gateway orchestration all right majority of people interestingly enough you are all right the beauty of these approaches is that they are not mutually exclusive so what netlist did initially netflix split two-up applications using side orchestration basically nccp netflix content control protocol was split into two and ccp stand uh stood there for playback experience while apis started to handle discovery experiences there were multiple domain names separate domain names and there were multiple load balancers over time as ecosystem evolved netflix introduced api gateway to the picture as there was a need to split a lot of functional uh functionality further on the edge and that paid off they doffed in the sense that over time more and more services were added on this slide you can see there is a node.js uh servers that are added we call them backends for front-ends they allow ui engineers to run their services that can call other services on the edge get data from them and form a response in the format devices expect so introduction of api gateway it was a big step forward so what does it help with first of all api gateway reduces coupling between client and ecosystem services how does it do that by providing two contracts and bridging two contracts together there is contact with the client and contract with the services let's take a look at the example and authentication is a perfect example for api gateway at netflix there are two different use cases one use case is streaming clients basically when you open netflix application this is called streaming and there is also content engineering use case it is an enterprise part of business cyclejit a colleague of mine uh had a gave a talk yesterday on how authentication works in a streaming world so i will focus on content engineering or you know enterprise world in the enterprise world we have several types of authentication to name a few it's oas and mutual tls and you can imagine what it would like to implement authentication on every edge service but not only that if you implement on every edge service sometimes you find vulnerability and imagine what it takes to rotate let's say 60 services and deploy uh you know security patch so api gateway we call it zul because we have this open source technology called zoo so we use this terms interchangeably is the one that handles authentication flow so in in the case of us it can be you know user comes in the redirect is sent user you know authenticates gets back to the service and then what happens the request from the user once user is authenticated goes to the underlying service the backend service the backend service doesn't need to know about how user authenticated doesn't need to know about flow but what it needs to know it needs to know about identity an identity is passed as a header we craft an end-to-end identity that's what we call it it's a draw talking that assigned and the signature can be verified at any layer and we pass this identity talking with our request the service that receives it can forward this request and forward identity with some requests even further so the whole invocation chain knows on behalf of whom the school is executed without even worrying about you know the nitty-gritty details of authentication another feature that helps a lot is routing remember we want to decouple clients from knowing the underlying structure and the underlying shape of our edge that's why we kind of send everything to api gateway zul and let it figure out so imagine uh at some point an engineer on api team knows that the api is huge and handling too much traffic it needs to show to be sharded and we create new cluster we create new cluster with api and we decide we will only send traffic that is for tv devices there so the dotted line here is a potential route for the traffic now that's not all that's just the edge layer right what about meet your services api needs to talk to a b a b is a service that tells you whether customer is part of any a b test so imagine i'm an engineer on a b team and i want to test my bleeding edge changes i don't want to you know merge them to master yet i really want to see how end-to-end flow works maybe that's not the ideal way to test but i really want this so what i do i create my cluster i call it with my name and i need to route some traffic there probably it's not a good idea to route customers traffic to the bleeding edge application that i just built on you know on my own so i go to zul and i configure a routing rule saying that for this particular customer who is myself i want this customer to be routed to this bleeding edge when request comes from tv that i own with my account zuul checks what rules has have to be applied it applies the rule to route to api tv cluster but it also checks that i am the specific user that is whitelisted to go to this bleeding edge application and adds a header that says that i should be sent to this bleeding edge instead of standard cluster it's called deep override rule and then my traffic will be routed to this bleeding edge perfect now imagine a situation someone else comes from laptop opens laptop the role throughout 2tv ui cluster is not triggered and you know we go to api global and since there is not a customer that is whitelisted the traffic is routed to standard a b cluster this feature is very important because it allows us to decouple routing and sharding and everything and enable everyone at netflix to manage their own traffic within the data center and at the edge the question is what can go wrong well routing is a configuration is configuration raise your hand if you ever had to deal with incidents when the wrong config was pushed to production all right so majority of you so you can relate this is exactly what is happening and happen several times so the config is pushed to production this is config and the route is applied globally and you can send 100 of the traffic to this a b bleeding edge or somewhere else the problem with that is that people usually don't know how much traffic they affect so instead of closing down this functionality and dedicating an operator who can see it and you know just the whole day configure routes for everyone we decided to educate people so instead of applying rules right away what we do we run a job that estimates how much traffic will be affected by this applied rule and then there is a pop-up that says hey you're gonna send 100 of your traffic to your local machine do you want to do that well you can still do that you know freedom and responsibility netflix culture but at least you do this consciously and after that we will talk all right the second statement both statement that api provides insights and resiliency client perceived resiliency i would say how does it do this so api gateway is centrally located all the traffic goes through this it's a choke point so whatever concern you apply there is applied to all of your traffic it's a perfect place so what do we do we report metrics so when we report metrics from one place they are always consistent right so all the backends for all the domains for all the clients we have consistent metrics and all the other teams at the company they can build their tooling on this matrix we can build dashboard alerts i don't know can re analysis et cetera et cetera you can do so much with that we have a system called atlas it's an open source real time dimensional database and if you don't know about that if you haven't used it i recommend you check it out metrics are great so something is happening we see metrics spiked error spike what do we do we debug how do we debug well we need to see individual requests how do we do that at netflix there is a system built called raven raven is a ui where you go and you create a filter saying whatever request matches this filter send it to mantis another term so mantis is an open source technology it's a platform to build real-time processing applications very powerful i recommend you check it out so here for example i have an outage i see that errors for ios devices spiked what do i do i create a filter i say whatever starts with ios and if the response code is higher or equal 500 sample at five percent and send me those requests and responses as well so this is the way for me to debug without the paying the price for logging everything and indexing everything so insights beautiful since we already have integration with mantis it's a streaming platform right what can we do of course we can build anomaly detection mechanism because all the traffic goes through the one single place we have this you know uniform picture and we can alert and react to this so we stream meant all the errors to mantis and there is a job that runs we call it raju it's a service actually that calculates acceptable error rate for every single backend and if we cross the threshold for a long period of time there is an alert that is being sent cool let's quickly talk about resiliency resiliency when you build api gateway you have so many features that you can put in so we decided to put a custom load balancing we decided to go with random choice of two approach and that helped a lot to mitigate certain issues uh during deployments of services that didn't go well choice of two load balancing is basically you randomly choose two instances and then you decide to which one you want to send traffic based on the criteria that you control so for us we really wanted to control this criteria and there is an agreement between api gateway and backend service where backend service can send some health information to us and we can use it if we don't have our own you know view of of this instance another thing that we do is we retry on behalf of clients and this is why i said client perceived resiliency we don't necessarily improve resiliency but if we chose the instance that is bad and we send some traffic and it returns 500 we can retry on behalf of the client it's not always possible but in more cases than not we do retry on behalf of the clients good so let's summarize stage two we wanted to optimize engineering velocity we started by introducing microservices and we said thank you to api team who helped us hide the complexity but then we started splitting edge and introducing additional services to support that we introduced the additional domains we introduced init service so that our application can go before it starts fetch some information about what to call and then work and we also introduced the concept of api gateway particularly zul as one of the implementation the reduces coupling between clients and services and it's a leverage point to put all the cross-cutting concerns such as authentication rate limiting enrichment of requests etc etc so what's next we have a business successful one we scaled our engineering organization what's next next is resiliency and quality of service before we talk about resiliency and quality of service i wanted to put on the screen another bold statement which is most of the incidents are self-inflicted so if our underlying infrastructure promises a lot of nines it does mean something but it doesn't mean a lot because the only way to prevent outages is to not do anything so we don't want to not do anything we do want to you know be very very very agile and deploy very often how do we do that let's just look at netflix example and what netflix did in 2013 and 2014 netflix invested a lot of effort into supporting multi-region deployment not only they started deploying in multiple regions but they also staged deployments and they made sure that all the regions are active what does it mean if one region goes bust we can simply reroute client to a different region and they will get service for that not only active active data replication had to happen but there was also a lot of edge concerns that had to evolve let's talk about more educa edgier concern that than it was so we talked about api gateway now we get to this layer of dns so far we only had multiple dns records let's take a look at one and see how netflix did their failover api netflix.com is a canonical name to api geonetflix.com it is resolved by a system called ultra dns based on the physical location of a resolver so assume physical location of a client so if you are in united states west you will be sent to u.s fest now aws region north america east will be sent to east south america will be sent to east and more or less the rest of the world some parts of asia excluded would be sent to europe that resulted in east being a heavier region because more traffic is going there but that also allowed us introduction of this you know virtual force region allowed us to split traffic more granularly since i mentioned that since i mentioned that uus east was the biggest region let's try to evacuate it let's see what happens when there is an outage we decided that it's time to evacuate the region first of all we change records to point to different load balancers in different regions so ultra dns still returns the c name that is uh specific for for for your region but the underlying ips are not the same so underlying ips start pointing to the load balancers in the region that you are routed to so us east north america would be routed to u.s west and south america would be routed to europe simultaneously with that zuul our api gateway will open http connections to regions that are healthy this is very important because dns has this property dns is being cached and there is this property called ttl on dns we set it at 60 seconds so we assume that within 60 seconds clients will come and refresher their dns and they usually do so dns ttl is a myth but it's a myth that is widely accepted and many resolvers believe in that but some do not and that's why you have you see a little bit of traffic still going to this region which was evacuated so we never saw 100 percent of the traffic evacuated from a region we always see a trickle of traffic and we cannot punish customers or our clients for their resolvers right that's why we stay in the state of cross region proxy for the traffic make sense good let's talk about our stage three we wanted to focus on resiliency and while focusing on resiliency we also improved latency a little bit because right now we have three regions right we can send clients to the region that is closer to them and potentially they will have a bit better service active active data replication is very complicated but thank you great engineers who did that it was working edge concerns had to evolve we needed to get to a level of dns or geodns traffic steering it's not just one record it's not multiple records it's a dynamic system right now and has to be managed carefully we built tooling around this dns management why well simply we introduced so many domain names imagine doing this flip manually we really need a tool and we implemented cross-region traffic proxy between gateways to help customers who cannot rely on their dns so let's recap real quick we started with three tier micro uh three tier uh architecture monolith introduced microservices split our edge and worked on our resiliency what's next the next concern that bothers us is speed of light guesses why speed of light exactly so speed of light is finite and we didn't find how to solve this therefore we need to work around this distance affects round trip time between two places on earth the latency exists if we want to transfer information let's get back to our example a customer in south america trying to connect aws us there is a latency and let's say the latency is 100 milliseconds round trip it's very optimistic by the way but 100 milliseconds what happens when you establish connection to a server most of the communication these days at least let's wait for quick protocol or http 3 but these days at least most of the communication happens over tcp most of the communication happens thankfully over secure protocol called tls so in order to send bytes to a server what i need to do is to establish tcp connection then after i establish tcp connection i need to do usually tls handshake there are tricks to work around this but not all the clients support this so this is how tls handshake and ssl tls handshake works if we assume client latency of 100 milliseconds between the client and aws we spent 100 milliseconds for the tcp handshake and we do spend 200 milliseconds for the tls and shake because client needs to send client hello server response with server hello and certificate and they finally finish the key exchange so it takes two round trips to do tls and only then we send requests so in order for me to send my first request i need to spend 300 milliseconds as i said in the optimistic very optimistic case there are other change challenges with a client sitting far away from the data centers for example we all use wireless networks and wireless networks are lossy so whenever i have a connection problem i usually have a connection problem between me and my isp but in order to repair a tcp packet loss with my long connection to aws that is 100 milliseconds i need to pay quite a bit of time usually it's one round trip to detect connection uh tcp loss and 1.5 round trips to fix packet loss another problem is you heard this uh metaphor the internet and pipes right so pipes get congested and the longer the distance between two points the higher the chance is that the pipes will be congested all right real quick tcp tls lossy connections congestion quite a bit of problems how do we solve them we trick clients we put an intermediary in between client and our data center we refer to this as a pop populous point of presence think about pope as a proxy that accepts clients connection terminates tls and then over a backbone another concept that we just introduced sends the request to the primary data center or aws region what is backbone think about backbone as private internet connection between your point of presence and aws it's like your private highway right everyone is stuck in traffic and you are driving on your private highway so how does this change the interaction and quality of clients experience if we put a point of presence in between them much more complicated diagram but you can see that the connection or round trip time between client and point of presence is lower this means that tls and tcp handshakes are happening faster and therefore in order to send my first bytes of my request i need to wait only 90 milliseconds compare this to 300 milliseconds when we send request to point of presence point of presence already has been established connection with our primary database data center it's already scaled and we're ready to reuse it it's a good idea to also use protocols that allow you to multiplex requests so on the same tcp connection we can send multiple requests one of them would be you know http 2. not all the clients can speak http too but here you control a client within point of presence and you control the server on the other end in our case it is zul right we control the code base it does http 2 and the point of presence does http 2. so we can take http 1 traffic control it into http 2 and improve client's connectivity summarizing point stage 4 we wanted to improve client's connectivity by reducing time spent doing tls and tcp handshakes congestion avoidance and tcp packet loss recovery from packet loss needs to be improved for that we introduced the concept of point of presence we introduced the concept of backbone the private internet connection it can be built it can be bought it can be rented some cdn providers these days allow backbone to be rented aws has this a service called i think aw it's called global accelerator so the idea is pretty much the same i didn't talk too much about the steering of traffic to the pop but there is alternative ways of steering that you can explore other than dns and uh because we control codebase in the pop and we control code base in the data center it makes it possible for us to introduce protocols that we cannot roll out to all the clients so just to summarize in roughly 40 minutes we made a journey that some companies can only go through in several years so it's quite a success in my opinion if there is one takeaway that i want to summarize with it would be the statement that a well-designed edge enables evolution of the business and think wisely when you make choices that affect your age what's that let's open it up for questions thank you leslie questions you mentioned that pops nowadays can be rented if that's the case does tls still terminate at pop and how secure is it to do something so the question is uh if pop is rented so you don't control pop and you put it in isp location how do you terminate tls in this case um your best bet would be kls sessions because you would probably upload your last termination to something that you control you do this once and then issue tls session ticket to the client and then we use this ticket so i think that's the best approach i think i heard you mentioned earlier in your slides how uh zuul and alb were considered kind of interchangeable in a way in some other slides i saw alb being a node located in between the zool below it and the like api.used1.netflix above it uh can you maybe elaborate the difference between the alb and zul yeah so alb is still in the picture alb is used to terminate kls because that's not the concern that zul wants to have and then route traffic to to zul instance in some cases that's not what's happening in some cases because we need http 2 support and we want to support lpm we terminate tls on on zul itself simply because right now amazon as of now does not support alpn we do this on zol but most of the traffic goes through lb alb does it answer your question uh you were talking earlier about having a like a routing configuration and when traffic would come into zuuls we would look at the routing config to figure out where to send the traffic uh and i was curious is that only from zuul to wherever it goes after that or is that like all of the internal like layer various layers that are that traffic might go through so the question is whether uh we can route to the next hop only after zoo or can we route after that so yeah so there are two different use cases first use case is to steer a lot of traffic from one place to another and that's mainly made on the first hop but there is also a rule type that we apply at zoo layer that is called crr custom request routing so if request matches criteria i can override target not only for the next hop but for any call in the chain that wants to call some other service we have this concept of web at netflix so basically at zoo layer we will say for a b system for a b application please overwrite veep to this new vip and this rule will be honored down the chain of invocation does it answer your question good questions all right thank you vasily [Applause]
Info
Channel: InfoQ
Views: 10,118
Rating: 4.902041 out of 5
Keywords: Netflix, Software Architecture, Edge, DevOps, Case Study, Operation Management, QCon Plus, InfoQ, Agile, Project Management, Transcripts
Id: k01PHa5YDpQ
Channel Id: undefined
Length: 43min 1sec (2581 seconds)
Published: Mon Mar 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.