Mastering Chaos - A Netflix Guide to Microservices

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Nora's ReInvent or Velocity talks are more up-to-date and dive a bit deeper. My Velocity talk from earlier this year dives deeper on minimizing blast radius. I believe all of these are available on YT.

Also, we wrote a (free) book!: http://www.oreilly.com/webops-perf/free/chaos-engineering.csp

👍︎︎ 3 👤︎︎ u/aaronblohowiak 📅︎︎ Dec 11 2017 🗫︎ replies

Captions

all right Wow full room I'm just going to jump right in about 15 years ago my stepmother and we'll call her Frances for the sake of this conversation became ill she had aches through her whole body she had weakness she had difficulty standing and by the time they got her to the hospital she had paralysis in her arms and legs it was a terrifying experience for her and obviously for us as well turns out she had something called guillain-barre syndrome anybody heard of this before I'm just curious Oh lots of people good well not good hopefully you haven't had first-hand experience with it it's an autoimmune disorder and it's a really interesting one it's triggered by some kind of external factor which is fairly common with autoimmune disorders but what's interesting about this one is that the antibodies will directly attack what's called the myelin sheath that wraps around the axon that long section in their self and it essentially eats away at it such that the signals become very diffuse and so you can imagine these symptoms of pain and weakness and paralysis are all quite logical then you understand what's going on the good news is is that it's treatable either through plasmapheresis where they filter your blood outside your body or antibody therapy and that was successful the latter was successful for Francis what was interesting also was her reaction to this condition she became much more disciplined about her health she started eating right exercising more she started to take doing Qi Gong and Tai Chi to improve her health and she was committed to having this kind of event never happened to her again and what this event really underscores for me is how amazing the human body is and how something as simple as the act of breathing or interacting with the world is actually a pretty miraculous thing and it's actually an act of bravery to a certain extent there are so many forces in the world so many allergens and bacterial infections and various things that can cause problems for us and so you might be wondering why the hell are we talking about this that our micro services talk but just as breathing is a miraculous act of bravery so is taking traffic in your micro service architecture you might have traffic spikes you might have a DDoS attack you might introduce changes into your own environment that can take the entire service down and prevent your customers from accessing it and so this is why we're here today we're going to talk about micro service architectures which have huge benefits but also talk about the challenges and the solutions that Netflix has discovered over the last seven years wrestling with a lot of these kinds of failures and conditions so I do a bit of an introduction I'll introduce myself and I'm gonna spend a little bit of time level setting on micro service basics so we're all using the same vocabulary as we go through this and then we'll spend the majority of our time talking about the challenges and solutions that netflix has encountered and then we'll spend a little bit of time talking about the relationship between organization and architecture and how that's relevant to this discussion so by way of introduction hello I'm Josh Evans I started at Netflix had a career before this but I this is the most relevant part in 1999 I joined Netflix about a month before the subscription DVD service launched I was in the commerce space as an engineer and then a manager and got to see that transition from an e-commerce perspective how we integrated streaming into the existing DVD business 2009 I moved over right into the heart of streaming managing a team that today called playback services this is the team that does DRM and manifest delivery recording telemetry coming back from devices I also managed this team during a time when we were going international getting onto every possible device in the world and just this slide project of moving from data center to cloud so it's actually quite an interesting and exciting time for the last three years I've been managing a team called operations engineering where we focus on operational excellence engineering velocity monitoring and alerting so think about things like delivery chaos engineering a whole wide variety of functions to help Netflix engineers be successful operating their own so this is in the cloud so you'll see that there's an end date there I actually met left Netflix about a month ago and today I'm actually thinking a lot about Arianna Huffington catching up on like sleep for the first time in quite some time taking some time off bending time with my family trying to figure out what this work-life balance thing looks like actually this we mostly life balance which will be great they'd shift from what I was doing before Netflix as you know is the leader in subscription internet TV service it produces or licenses Hollywood independent and local content has a growing slate of pretty amazing original and at this point is that about 86 million members globally and growing quite rapidly Netflix is in about a hundred and ninety countries today and has localized in tens of languages that's user interface subs end of thousands of device platforms and all of this is running on micro services on AWS so let's dig in and let's talk about micro services from the abstract sense and I'd like to start by talking about what microservices are not I'm going to go back to 2000 my early days and Netflix when we were a web based business where people put DVDs in their queue and had them shipped out and returned and all of that so we had a pretty simple infrastructure this was in a data center hardware based load balancer actually very expensive hardware that we put our that we used as our Linux hosts running a fairly standard configuration of an Apache reverse proxy and Tomcat and this one application that we call Java web kind of everything that was in Java that our customers needed to access now this was connected directly to an Oracle database using JDBC which was then interconnect with other Oracle databases using database links the first problem with this architecture was that the code base for Java Web was monolithic in the sense that everybody was contributing to one code base that got deployed on a weekly or bi-weekly basis the problem with that was is when a change was introduced that caused album it was difficult to diagnose we probably spent well over a week troubleshooting a slow-moving memory we took about a day to happen we tried pulling out pieces of code and running it again to see what would happen and because so many changes were rolling into that one application this took an extended period of time the database was also monolithic in even a more severe sense it was one piece of hardware running one big Oracle database that we called the store database and when this went down everything went down and every year as we started to get into the holiday peak we were scrambling to find bigger and bigger hardware so that we could vertically scale this application probably one of the most painful pieces from the engineering perspective other than the outages that might have happened was the lack of agility that we had because everything was so deeply interconnected we had direct calls into the database we had many applications directly referencing table schemas and I can remember trying to add a column to a table was a big cross-functional project for us so this is a great example of how not to build services today although this was the common pattern back in the late 90s and early 2000s so what is a micro service does anybody want to volunteer their understanding or definition of what a micro services sort of curious well get here somebody some brave soul there we go what's the micro service say it again context bound and data ownership I like that I'm going to give you the Martin Fowler definition there's a good place to start that's definitely a key piece I'm going to read this to you if you don't have to micro service architecture style is an approach to developing a single application as a suite of small services each running in its own process and commuting with a lightweight Mac and lightweight mechanisms often an HTTP resource API I think we all know this it's a somewhat abstract definition it's very technically correct but it'll really give you enough of a flavor I think of what it means to build micro services when I think about it I think of it as this extreme reaction to that experience that I had back mm with monolithic applications separation of concerns being probably one of the most critical things that it encourages modularity the ability to encapsulate your data structures behind something so that you don't have to deal with all of this coordination scalability they tend to lend themselves to horizontal scaling if you approach it correctly and workload partitioning because it's a distributed system you can take your work and break it out into smaller components which make it more manageable and then of course none of this really works well from my perspective unless you're running it in a virtualized and elastic environment it is much much harder to manage micro services if you're not doing it in this kind of environment you need to be able to automate your operations as much as possible and on-demand provisioning is a huge huge benefit that I would want to give up our building this going back to that theme of the human body and biology you can think of Micra services also as organs in an organ system and these systems that then come together to form the overall organisms so let's take a look at the Netflix architecture a little bit and see how that maps there's a proxy layer that's behind the ELB called Zul that does dynamic routing there's a tier that was our legacy tier called NCCP that supported our earlier devices plus fundamental playback capability and there's our Netflix API which is our API gateway that today is part of really core to our modern architecture calling out into all of the other services to fulfill requests for customers this aggregate set that we've just walked through we consider our edge service there's some a few auxilary services as well like drm that support this that are also part of the edge and then this soup on the right-hand side is a combination of middle tier and platform services that enable the service to function overall to give you a sense of what these organs look like these entities here are a few examples we have an a/b testing infrastructure and there's an a/b service that returns back values if you want to know what tests a customer should be in we have a subscriber service that is called from almost everything to find out information about our a recommendation system that provides the information necessary to build the lists of movies that get presented to each customer is a unique experience and then of course there's platform services that perform the more fundamental capabilities routing to get to the micro services can find each other dynamic configuration cryptographic operations and then of course there's the persistence layers as well these are the kinds of objects that live in this echo system now I also want to underscore that micro services are an abstraction we tend to think of them very simplistically as here's my nice horizontally scaled micro service and people are going to call me which is great if it's that simple but it's almost never that simple at some point you need data your service is going to need to pull on data for a variety of reasons about the subscriber information it might be recommendations but that data is typically stored in your persistence layer and then for convenience and this is a really a Netflix approach that I think many of us have embraced but definitely specific to Netflix as well is start providing client libraries and this we're mostly Java based so client libraries for doing those basic data access types of operations now at some point as you scale you're probably going to need to front this with a cache because the service plus the database may not perform well enough and so you're going to have a cache clients as well and then now you need to start thinking about orchestration so I'm going to hit the cache first then if that fails I need to go to the service which is going to call the database it'll return a response back and then of course you want to make sure you backfill the cache so that it's hot the next time you call it which might just be a few milliseconds later now this client library is going to be embedded within the applications that want to consume your micro service and so it's important to realize from their perspective this entire set of technologies this whole complex configuration is your micro service it's not this very simple stateless thing which is nice from sort of a pure perspective but it actually has these sort of complex structures to them so that the level set on microservices and now let's go ahead and let's dig in on the challenges that we've encountered over the last seven years and some of the solutions and philosophies behind that so I love junk food and I love this image because I think in many cases the problems and solutions have to do with the habits that we have and how we approach micro-services and so the goal is to get us to eat more vegetables in many cases we're going to break this down into four sort of primary areas that we're going to investigate there are four dimensions in terms of how we address these challenges dependency scale variance within your architecture and how you introduce change we're going to start with dependencies and I'm going to break this down into four use cases within dependencies interest service requests this is the call from microcircuit a micro service B in order to fulfill some larger requests and just as we were talking about earlier with the nerve cells and the conduction everything's great when it's all working but when it's challenging it can feel like you're crossing a vast chasm in the case of a service calling another service you've just taken on a huge risk by just going off process and off your box you can run into network latency and congestion you could have hardware failures that prevent routing of your traffic or the service you're calling might be in bad shape it might have had a bad deployment and have some kind of logical bugs or it might not be properly scaled and so it can simply fail or be very slow and you might end up timing out when you call it the disaster scenario and we've seen this more than I'd like to admit is the scenario where you've got one service that fails with improper defenses against that one service failing it can cascade and take down your entire service for your members and god forbid you deployed that bad change out to multiple regions if you have a multi region strategy because now you've really just got no place to go to recover you just have to fix the problem in place so to deal with this that was created history which has a few really nice properties it's got a structured way for handling timeouts and retries it has this concept of a fallback so if I can't call service B can I return some static response or something that will allow the customer to continue using the product instead of simply getting an error and then the other big benefit of history is isolated thread pools and this concept of circuits if I keep hammering away at service B and it just keeps failing maybe I should stop calling it and fail fast return that fall back and wait for it to recover so this has been a great innovation for Netflix it's been used quite broadly but the fundamental question comes in now I've got all my history settings in place and I think I've got it all right but how do you really know if it's going to work and especially how do you know is going to work under at scale the best way to do this going back to our biology thing is inoculation where you might take a dead version of a virus and inject it to develop the antibodies to defend against the live version and likewise fault injection in production accomplishes the same thing and that was created fit the fault injection test framework in order to do this you can do synthetic transactions which are overridden basically at the accounts or the device level or you can actually do a percentage of live traffic so once you've determined that everything works functionally now you want to put it under load and see what happens with real customers and of course you want to be able to test it no matter how you call that service whether you call it directly whether you call it indirectly you want to make sure that your requests are decorated with the right context so that you can fail it universally just as if the service was really down in production without actually taking it down so this is all great it's a sort of a point-to-point perspective but imagine now that you've got a hundred micro services and each one of those might have a dependency on other services or multiple other services there's a big challenge about how do you constrain the scope of the testing that you need to do so that you're not testing millions of permits patience of services calling each other this is even more important when you think about it from an availability perspective imagine you've only got 10 services in your entire micro service infrastructure and each one of them is up for four nines of availability that gives you 53 minutes a year that that service can be down now that's great as an availability number but when you combine them all the aggregate failures that would have happened throughout that year you actually will end up with three nines of availability for your overall service and that's somewhere in the ballpark of between 8 and 9 hours a year a big difference and so to address this let's let's define this concept of critical micro services the ones that are necessary to have basic functionality work can the customer load the app browse and find something to watch it might just be a list of the most popular movies hit play and have it actually work and so we've taken this approach and identified those services as a group and then created skip recipes that essentially blacklist all of the other services that are not critical and this way we can actually test this out and we have tested this for short periods of time in production to make sure that the service actually functions when all those dependencies go away so this is a much simpler approach to trying to do all of the point-to-point interactions and has actually been very successful for Netflix and finding critical errors so let's now talk about client libraries shifting gears completely when we first started moving to the cloud we had some very heated discussions about client libraries there were a bunch of folks who had done great work from Yahoo who had come to Netflix who were espousing the model of bare bones rest just call the service don't create any client libraries don't deal with all of that just go bare-bones and yet at the same time they're a really compelling argument for building client libraries if I have common logic and common access patterns for calling my service and I've got 20 or 30 different dependencies do I really want every single one of those teams writing the same or slightly different code over and over again or do we want to simply consolidate that down into common business logic and common access patterns and this was so compelling that this is actually what we did now the big challenge here is that this is a slippery slope back towards having a new kind of monolith where now our API gateway in this case which might be hitting a hundred services is now running a lot of code in process that they didn't write this takes us all the way back to 2000 running lots of code in the same common code base it's a lot like a parasitic infection if you really think about it this little nasty thing here is not the size of Godzilla it's not going to take down Tokyo but it will infest your intestines it'll attach to your blood vessels and drink your blood like a vampire this is called a hookworm and a full-blown infestation and actually lead to pretty severe anemia and make you weak and likewise client libraries can do all kinds of things that you had to have no knowledge of that might also weaken your service they might consume more heat than you expect they might have logical defects that cause failures within your application and they might have transitive dependencies that pull in other libraries that conflict in terms of versions and break your bills and all of this has happened especially with the API team because they're consuming so many libraries from so many teams and there's no cut and dry answer here there's been a lot of discussion about this it's been somewhat controversial even over the last year or so the general consensus has been though to try to simplify those libraries there's not a desire to move all the way to that bare bones rest model but there is a desire to limit the amount of logic and heat consumption happening there and you want to make sure that people have the ability to make smart thoughtful decisions on a case-by-case basis so we'll see how this all unfolds is a sort of an ongoing conversation and mostly I'm bringing up here that all of you can be sort of thoughtful about these trade-offs and understand that now persistence is something that I think necklace got right early on there isn't there is a war story here about how we got it wrong and let me tell you how we got it right we got it right by starting off thinking about the right constructs and about cap theorem I assume how many people are not familiar with Cass theorem of district URIs okay so we've got a few let's go ahead and level set here this is the simplest definition that that allowed me to get my brain around what this really was in the presence of a network partition you must choose between consistency and availability in this case here you might have a service running a network a and it wants to write two databases a copy of the same data into two databases that are running in three different networks or an AWS this might be three different availability zones the fundamental question is what do you do when you can't get to one or more of them do you just fail and give back an error or do you write to the ones you can get to and then fix it up afterwards Netflix chose the latter and embrace this concept of eventual consistency but we don't expect every single write to be read back immediately from any one of the sources that we've written the data to and Cassandra does this really well it has lots of flexibility so the client might write to only one node which then orchestrates and writes to multiple nodes and there's a concept of local quorum where you can say I need this many nodes to respond back and say that they've actually committed the change before I'm going to assume that it's written out and that could be one node if you really want to take on some risk in terms of the durability but you're willing to get very high availability or you can dial it up the other way and say I want it to be all the nodes that I want to write to so let's move on I'm just going to briefly talk about infrastructure because this is a whole topic unto itself but at some point your infrastructure whether it's AWS or Google or things that you built yourself is going to fail the point here is not not that Amazon can't keep their services up they're actually very very good at it but that everything fails and that's the mistake if I were going to put blame on anybody in terms of what happened in Christmas Eve of 2012 when the EOB control plane went down was that we we put all our eggs in one basket we put them all in US East one and so when there was a failure and by the way we've induced enough of our own to know that this is also true there was no place to go and so Netflix developed a strep multi-region strategy with three AWS regions such that if any one of them failed completely we can still push all the traffic over to the other surviving regions so I did a talk on this earlier in the year so I would encourage you to take a look at it if you want to get really deep into the multi region strategy and all the reasons that it evolved the way that it did at this point I'm going to put a pin in this I'm going to move forward so let's talk about scale now and the scale I'm going to give you three cases the stateless service scenario the stateful service scenario you sort of fundamental components and then the hybrid similar to the diagram that we were looking at earlier where it's an orchestrated set of things that come together okay another question what's a stateless service anybody have an idea to throw out their idea their definition of it breaks all good okay that's close that's interesting I start with it's not a cash or database you're not storing massive amounts of data you will frequently have frequently accessed metadata cached in memory so there's the non-volatile nature of that or configuration information typically you won't have instance affinity where you expect a customer to stick to a particular instance repeatedly and the most important thing is that the loss of a node is essentially a non-event it's not something that we should spend a lot of time worrying about and it and it recovers very quickly so you should be able to boot up and spin up a new one to replace a bad node relatively easily and the best strategy here is one going back to biology is one of replication just as with mitosis we can create cells on-demand or cells are constantly dying and constantly being replenished auto-scaling accomplishes this I'm sure people are familiar with auto-scaling but I can't underscore enough how fundamental this is and how much this is table stakes for running microservices in a cloud you've got your men and your max you've got a metric you're using determine what you need to scale up your group and then when you need to have a new instance bun up you simply pull any image out of s3 and you spin it back up the advantages are several you get compute efficiency because you're typically using on-demand capacity your nodes get replaced easily and most importantly actually is when you get traffic spikes to get a DDoS attack if you introduce a performance bug auto scaling allows you to absorb that change while you're figuring out what actually happened so this has saved us many many times strongly recommend it and then of course you want to make sure it always works by applying chaos chaos monkey was our very first sore chaos tool and it simply confirmed that when a node dies yet everything still continues to work this has been such a non-issue for Netflix since we've implemented chaos monkey and kind of want to knock wood as I say that this just doesn't Britt a car service down anymore losing an individual node is very much the non-event that we want it to be so let's jump in let's talk about stateful services and they are the opposite obviously of a stateless one it is databases and caches it is sometimes a custom app and we give this a custom app that has internalized caches but of like large amounts of data and we had a service tier that did this and as soon as we let multi region and tried to come up with generic strategies for replicating data this was the biggest problem we had so I strongly recommend you trying to avoid storing your business logic and your state all within one application if you can avoid it now in this case what's meaningful is again the opposite of stateless which is that the loss of the node is a notable event it may take hours to replace that node and spin up a new one so it is something that you need to be much more careful about so I'm going to talk about and I'm going to sort of tipping my hand here two different approaches for how we dealt with caching to sort of underscore this and again we as I said we had a number of people who are from Yahoo who were who had experience using a proxy and squid caches and a pattern where they had dedicated nodes for customers so a given customer would always hit the same node for the cash and it was only one copy of that data the challenge is of course when that node goes down you've got a single point of failure and those customers would be unable to access that service but even worse because this was in the early days we didn't have proper history settings in place we didn't have the bulk heading and the separation and isolation of thread pools and so I can still remember being on a call where one node went down and all of Netflix went down along with it it took us three and a half hours to pumping it back up to wait for that cache to refill itself before we could fulfill requests so that's the anti-pattern the single point of failure pattern and going back to biology redundancy is fundamental we have two kidneys so that if one fails we still have another one we have two lungs same thing those give us increased capacity but we can live with only one of them and just as your human body does that Netflix has approached an architecture using a technology called edie cash and easy cash is essentially a wrapper around memcache D it is sharted similar to the squid caches but multiple copies are written out to multiple nodes so every time a write happens not only does it write it out to multiple nodes but it writes them into different availability zones so it sprays them across and separates them across the network partition and likewise when we do read reads are local because you want that local efficiency but the application can fall back to reading across availability zones that it needs to to get to those other nodes this is a success pattern that has been repeated throughout ET cache is is used by virtually every service that needs the cash today at Netflix have been highly useful to us in lots of good ways now let's talk about the combination of the two this is the scenario we talked about earlier where you've got a hybrid service it's very easy in this case to take UD cash for granted let me tell you why it can handle 30 million requests per second across the clusters we have globally which is 2 trillion requests it stores hundreds of billions of objects in tens of thousands of memcache D instances and here's the biggest win here it consistently scales in a linear way such that requests can be returned within a matter of milliseconds no matter what the load is obviously you need to add enough notes but it scales really well and we had a scenario several years ago where our subscribers service was leaning on EB cash a little bit too much and this is another anti-pattern worth talking about it was called by almost every service I mean everybody wants to know about the subscriber and you know what's their customer ID and how do I go access some other piece of information it had online and offline calls going to the same cluster the same easy cache cluster so the batch processes doing recommendations looking up subscriber information plus the real-time call path and in many cases it was called multiple times even with the same application within the lifecycle of a single request it was treated as if you could freely call the cache as often as you wanted to so that at peak we were seeing load of 800,000 to a million requests per second against the service tier but fallback was a logical one when you were thinking about it from a one-off perspective I just got a cache miss let me go call into the service the problem was of the fallback also when the entire Evie cache layer went down was still a fallback to the service and the database and that's the anti-pattern the service in the database couldn't possibly handle the load that ET cache was shouldering and so the right approach was to fail fast so with the successive load we saw EB cache go down it sits down the entire subscriber service and the solutions were several the first thing is is stop hammering away at the same set of systems for batch and real-time do request level caching so you're not repeatedly calling the same service over and over again as if it was free make that first hit expensive and the rest of them free throughout the lifecycle of the request and something we haven't done yet but will very likely do is embed a secure token within the devices themselves that they pass with their request so that the subscriber services on available you can fall back to that data stored in that encrypted token it should have enough information to identify the customer and do the fundamental operations for keeping the service up for that customers that give some kind of reasonable experience and then of course you want to put this under load using Kaos exercises using tools like this now let's move on let's talk about variants this is variety in your architecture and the more variants you have the greater your challenges are going to be because it increases the complexity of the environment you're managing let's talk about two use cases one is operational drift that happens over time the other is the introduction that we've had recently over the last few years of new languages and containers within our architecture operational drift is something that's unintentional you don't do this on purpose but it does happen quite naturally drift over time looks something like you know setting your alert alert thresholds and keeping those maintained because those will change over time your timeouts and your retry settings might change maybe you've added a new batch operation that should take longer your throughput will likely degrade over time unless you're constantly squeezed testing because as you add new functionality that's likely to slow things down and then you can also get this drift sort of across services let's say you found a great practice for keeping services up and running but only half of your teams have actually embraced that practice so the first time we go and reach out to teams and say hey let's go figure this out let's go get your alerts all tuned let's do some squeeze testing let's let's get you all tuned up and make sure service is going to be highly reliable and well performance and usually we get a pretty enthusiastic response on that first pass but humans are not very good at doing this very repetitive sort of manual stuff most people would rather be doing something else or they need to do their day job like go and build product for their product managers the next ad test we need to roll out and so the next time we go we tend not to get that same level of enthusiasm when we say hey sorry but you're going to need to go do this again you can really taken a lesson again from biology with this concept of autonomic you has been the autonomic nervous there's lots of functions that your body just takes care of and you don't have to think about it you don't have to think about how you digest food you don't have to think about breathing or you would die when you fell asleep and likewise you want to make sure you set up an environment where you can make as many of these best practices subconscious or not even not required for people to really spend a lot of time thinking about and the way that we've done that at Netflix is by building out a cycle of continuous learning and automation and typically that learning comes from some kind of incident we just had an outage you get people on a call we hopefully alleviate customer pain we do an incident review to make sure that we understand what happened and then immediately do some kind of remediation hopefully to make sure at least practically that that works well but then we do some analysis is this a new pattern is there a best practice that we can derive from this is this a recurring issue where if we could come up with some kind of solution it would be very high impact and then of course you want to automate that wherever possible and then of course you want to drive adoption to make sure that that gets integrated this is how knowledge becomes code and it gets integrated directly into your micro service architecture over the years we've accumulated a set of these best practices we call it production ready this is a checklist and it's a program within Netflix virtually every single one of these has some kind of automation behind it and a continuous improvement model where we're trying to make them better whether that's having a great alerting strategy making sure using auto scaling using chaos monkeys to test out your stateless service doing red-black pushes to make sure that you can roll back quickly and what are the really important ones staging your deployments so that you don't push out bad code to all regions simultaneously of course all of these are automated and I'm going to jump over I'm going to talk about polyglots and containers this is something that's come about really just in the last few years and this is an intentional form of variance these are people consciously going I want to introduce new technologies into the micro service architecture when I first started managing operations engineering about three years ago we came up with this construct of the paved road the paved road was a set of sir best-of-breed technologies that work best for Netflix with automation and integration sort of bacon so that our developers could be as agile as possible but if they got on the paved road they were going to have a really really efficient experience we focused on Java and what I'm now going to call bare-bones ec2 which is a bit of an oxymoron but basically using ECT as ec2 as opposed to containers while we were building that out and very proud of ourselves for getting this working well our internal customers are engineering customers we're going off-road and building out their own paths started innocuously enough with Python doing operational work made perfect sense we had some back-office applications written and Ruby and then things got sort of interesting when our web team said you know we're going to abandon the JVM and we're actually going to rewrite the web application in node.js that's when things got very interesting and then as we added in docker things become very challenging now the reasons we did this were logical it made a lot of sense to embrace these technologies however things got real when we start talking about putting these technologies into the critical path for our customers and it actually makes a lot of sense to do so let me tell you why so the API gateway actually had a capability or has a capability to integrate groovy scripts that can act as endpoints for the UI team and they conversion every single one of those scripts so that as they make changes they can deploy a change out into production that has a certain to Don to devices out in the field and have a sync up with that endpoint that's running within the API gateway but this is another example of the monolith lots of code running in process with a lot of variety and people with different understandings of how that service works and we've had situations where endpoints got deleted or where the scripts or some script went rogue generated too many versions of something and ate up all of the memory available on the API service so again a monolithic pattern to be avoided and so the logical solution is to take those points and push them out of the API service and in this case the plan is to move those into nodejs little node.js applications running in docker containers and then those would of course call back into the API service and now we've got our separation of concerns again now we can isolate any breakage or challenges that are introduced by those node applications now this doesn't come with a call come without a cost fact there's a rather large cost that comes with these kinds of changes and so it's very important to be thoughtful about it the UI teams that we're using the groovy scripts were used to a very efficient model for how they did their development it didn't have to spend a lot of time managing the infrastructure they got to write scripts check the mana and they were done and so trying to replicate that with a nodejs and docker container methodology takes a substantial amount of additional work the insight triage capabilities are different if you're running in a container and you're asking about how much CPU is being consumed or how much memory you have to treat that differently you have to have different tooling you have to instrument those applications in different ways we have a base ami that was pretty generic that was used across all of our applications now that's being fragmented out and more specialized node management is huge there is no architecture out there or no technology out there today that we can use out of the box that allows us to manage these applications the way that we want to in the cloud and so there's an entirely new tier called Titus being built that allows us to do all the workload management and the equivalent of auto scaling and node replacement and all of that the Netflix is making a fairly huge investment in that area and then all the work we did over the years running in the JVM with our platform code making people efficient by providing a bunch of services now we have decisions to make do we duplicate them do we not provide them and let those teams running a node have to write their own direct rest calls and manage all of that themselves that's being discussed and there's a certain amount of compromise happening there some of the platform functionality is going to be written natively a note for example and then of course anytime you introduce a new technology into production we saw the flavor to the cloud we saw this every time we've done a major we are to texture you're going to break things and they things will break in interesting and new ways that you haven't yet encountered and so there's a learning curve before you're actually going to become good at this and so rather than one paved road we now have a proliferation of paved roads and this is a real challenge for the teams that are centralized that are that are finite that are trying to provide support to the rest of the engineering organization so we had a big debate about this a few months ago and the stance where we landed was the most important thing was to make sure that we really raised awareness of cost so that when we're making these architectural decisions people are well informed and they can make good choices we're going to constrain the amount of support and focus primarily still on JVM but obviously this new use case of node and docker is pretty critical and there's a lot of energy going into supporting that and then of course logically logically we'd have to prioritize by impact with a finite number of people who can work on these types of things and we're possible seek reusable solutions delivery is relatively generic so you can probably support a wide variety of languages and platforms with delivery so that's one example then one other example is client libraries that are relatively simple can potentially be auto-generated so you can create a ruby version and a python version and a java version so we're seeking those kinds of solutions again this is one of those places where there's no one cut and dry right way to do this hopefully this is good food for thought if you're dealing with these kind of situations so let's talk about that last element now change what we do know is that when we change when we are in the office when we are making changes in production we break things this is outages by day of week lo and behold on the weekend things tend to break left here's a really interesting one by time of day nine o'clock in the morning boom time to push changes time to break Netflix so we know that that happens and so the fundamental question here is how do you achieve velocity but with confidence how do I move as fast as possible and without worrying about breaking things all the time the way that we address that is by creating a new delivery platform this replaced asgaard which was our workhorse for many years this new platform is a global cloud management platform but also a delivery and automated delivery system and here's what's really critical here spinnaker was designed to integrate best practices such that as things are deploying out into production we can integrate these lessons learned these automated components directly into the path for delivery in the pipeline we see here using two things that we value highly automated canary analysis where you put a trickle of traffic or some traffic into a new version of the code with live production traffic and then you determine whether or not the new code is as good or better than the old code and stage deployments where you want to make sure you deploy getting the five minutes fun we want to make sure you deploy one region at a time so that if something breaks you can go to other regions you can see the list here of other functions that are integrated and long-term the production ready checklist we talked about earlier is fodder for a whole wide variety of things that long term should be integrated into the delivery pipeline I'm cheating a little bit here because of time constraints luckily I can and I did a talk last year reinvents that might be of interest to you if you really want to dig into how these functions deeply integrate with each other how does the production ready performance and reliability chaos engineering integrate with spinnaker and continuous delivery monitoring systems the special thing we encourage you to check it out now I'm going to close this out with a short story about organization and architecture in the early days there was a team called electronic delivery that was actually the first version of streaming was called electronic delivery we can have a term streaming back then back originally we were going to download and had a hard drive in some kind of device and the very first version of the Netflix ready device platform looks something like this it had fundamental capabilities like network capabilities open the platform functionality around security activation playback and then it was a user interface the user interface was actually relatively simple at the time it was using something called a cue reader we go to the website to add something into your cue and then go to the device and see if it showed up what was also nice is this is developed under one organization which was called electronic delivery and so the client team and the server team were all one organizations so they had this great tight working relationship it was very collaborative and the design that they had developed or the we had developed was XML XML based payloads custom response codes within those XML responses and versions firmware releases that would go out over long cycles now in parallel the Netflix API was created for the DVD business to try and stimulate applications external applications that would drive traffic back to Netflix we said let a thousand flowers bloom we hope that this would be wildly successful it really wasn't it didn't really generate a huge amount of value to Netflix however the Netflix API was well poised to help us out with our UI innovation it contained content metadata so all the data about what movies are available and could generate lists and it had a generalized REST API JSON based schema HTTP based response code starting to feel like a more modern architecture here and if the dude is an OAuth security model because that's what was required at the time for external apps that evolved over time to something else but what matters here is that from a device perspective we now had fragmentation across these two tiers we now had two edge services functioning in very very different ways one was rest-based JSON ooofff the other was RPC XML and a custom security mechanism for dealing with tokens and there was a firewall essentially between these two teams in fact because the API originally wasn't as well scaled with NCCP there was a lot of frustration between teams every time maybe I went down my team got called and so there was some friction there we really wanted them to be able to get that up and running but this distinction this you will services protocol schemas security models if it's god forbid you were a client developer and you had to span both of these worlds and try to get work done you were switching between completely different contexts and we actually had examples where we wanted to be able to do things like return limited duration licenses back with the listed movies that were coming back for the user interface so when I click through and hit play it was instantaneous as opposed to having to make another round-trip call to do DRM so because of this I had a conversation with one of the engineers a very senior engineer at left at Netflix and I asked him what's the right long-term architecture can we do an exercise here and go figure this out and this is a gentleman named Peter stout and of course the very thought police the first question he had to me seconds later was do you care about the organizational implication what happens if we have to integrate these things what happens if it breaks the way we do things well this is very relevant to something called Conway's law is anybody hearing some laughter so whoever laughs first tell me what Conway's lives all right good here's the sort of more sort of detailed explanation organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations are very abstract I like this a little better any piece of software reflects the organizational structure that produced it here's my favorite one you have four teams working on a compiler you'll end up with a four path compiler so that's what's had a four pass compiler that's where we work and the problem with this is this is the tail wagging the dog this is not solutions first this is organization first that was driving the architecture that we had and when we think about this this was this is a sense of going back to this illustration we had before we had our gateway we had NCCP which is having a legacy devices +20 back support we had API this was just a mess and so the architecture we ended up developing with something we call Blade Runner because we're talking about the edge services and the capabilities of NCCP became decomposed and integrated directly into the Zul proxy layer the API gateway and the appropriate pieces were pushed out into new smaller micro services that port more fundamental capabilities like security and features around playback like subtitles and CC and data so the lessons learned here this gave us greater capability and it gave us greater agility long term by unifying these things and thinking about the clients and what their experience was we're able to produce something much more powerful and we ended up refactoring the organization in response I ended up actually moving on my whole team got folded under the Netflix API team and that's what I moved over to operations engineering and that was the right thing to do for the business lessons learned solutions first team seconds and be willing to make those organizational changes so I'm going to briefly recap I have zero minutes I'm going to go over just a couple of minutes just so we can wrap up cleanly here so micro service architectures are complex and organic and its best to think about them that way and their health depends on a discipline and about injecting chaos into that environment on a regular basis for dependencies you want to use circuit breakers and fallbacks and apply chaos you want have simple clients eventual consistency and a multi reach and failover strategy for scale and brace auto scaling please it's so simple and it's a great benefit reduce single points of failure partition your workloads have a failure driven design like embedding requests and during request level caching and of course do chaos but again chaos under load to make sure that what you think is true is actually true for variants engineer your operations as much as possible to make those automatic understand and socialize the cost of variants and prioritize the support if you have a centralized organization and most people most organizations do prioritize by impact to make sure that you're as efficient as possible on change you want automated delivery and you want to integrate your best practices on a regular basis and again solutions first team second there's a lot of technologies that support these strategies that Netflix has open source if you're not familiar with it I think a lot of people are go check out Netflix OSS and also check out the Netflix tech blogs where there are regular announcements about how are done at Netflix how things are done at scale announcements about new open source tools like visceral which is the tool that generated the visuals we've been looking at throughout and I think we're out of time at this point do we have time for questions or should I just take it out of slide as you can all right cool well I push the limit thank you very much thank you everybody [Applause]

Info

Channel: InfoQ

Views: 1,782,257

Rating: 4.9376888 out of 5

Keywords: Netflix, Microservices, QCon, InfoQ, distributed systems

Id: CZ3wIuvmHeM

Channel Id: undefined

Length: 53min 13sec (3193 seconds)

Published: Wed Feb 22 2017