AWS re:Invent 2015: A Day in the Life of a Netflix Engineer (DVO203)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good afternoon and welcome to a day in the life of a Netflix engineer abusing you using 37% of the internet if this is not the talk that you're here for now would be the time to go a little bit about me my name is Dave Hahn I'm a senior lots of things engineer in the critical operations and response engineering team Netflix what does that lots of things mean I'm involved in our operations involved in our crisis handling I do some of our cloud architecture I work on performance of our stack care a lot about reliability I help people get insight into their running services I'm concerned about our network performance I spend a lot of time engaging with our partners like Amazon and I get to do some hardware and software things as well overall though my job is to make things better I'm part of the core team the crowd critical operations and response engineering team it's important to have a pronounceable acronym for your team's name it means you're important thank you my team is responsible for crisis management you press that play button and you don't quite get what you expect we provide a lot of the availability reporting the numbers that drive the rest of the organization to tell them how they're doing reliability best practices we're interested in talking with people about how do they make reliable services and run them in the cloud we're primarily the people responsible for our AWS relationship and we do a lot of operations education to help people understand how best to operate things in the cloud the team is made up of sres service reliability engineers it's their job to have an excellent understanding of how the different parts of the Netflix ecosystem fit together and how that ecosystem runs in the cloud we also have program managers on our critical operations team we have we have them there because they focus on communication follow-up making sure some of those risk mitigation kind of things get done so we have technical people focusing on technical things and communication people focusing on communication things we also have crisis leaders crisis leaders have an important role in that it's their job to understand both the technical aspect of a decision as well as the business impact of a decision so that we're in one of those really tough moments where you have to make a call one direction or the other they understand both of those and they can lead the charge forward it's the port team the core team has some goals that we find to be important this is one of the first ones it is our job to protect the customer experience as we find our customers enjoy the service most when it's available and operational ok good keep up with me that was a joke there's more of them and they don't get a whole lot better if any of you have ever monitored social networking like Twitter when there is that occasional Netflix outage you'll notice some people believe they're going to die I wanted to let you know we've checked nobody has actually died see I warned you they were going to get a lot better so protecting the customer experience we want to focus on technologies practices and behaviors that protect our customer experience things like graceful degradation failover fail back that whole category of things where something is better than nothing those of you that have used the service I assume they're a couple of Netflix users in here what great wonderful support thank you I had a couple of hands thank you I appreciate that those of you that have used the service have probably seen something like this this is where you can surf around a bit and see what kind of content is available we call this Lola mo list of lists of movies there you go there's your secret Netflix word Netflix word of the day worth the cost of admission right there I keep telling you they're not going to get better so Lola mow your list of list of movies now as you can imagine there's a service that provides this list of lists of movies for you we try to customize and personalize each one of our customers experiences so you're Lola Moe is unique now imagine with me if you will that that service isn't having such a great day I have a couple of options on how I can design that here's one possible experience glad you understand I have however I put up the argument with with the teams that drive the UI that we should replace it with this that way if we're having problems we're at least making healthy life recommendations and if we go back to this so this Lola mo imagine that that service again is not behaving quite correctly so I can give you that can't connect to netflix message which I think we can all agree is suboptimal but could I do something different could I maybe get recommendations that are for people that I perceive to be like you or maybe in your geographic area or if nothing else at least you know something something is better than nothing this pervades the way we talk about architecting software and architecting our footprint in the cloud we always want to have when possible the best possible experience for our customers however knowing the things will go wrong we want to make sure we're still protecting that customer experience here's another example I tried to use some technical terms here so I'll walk you through them US East a bad thing happened okay no questions on that good an automated reaction took over move that traffic over to our implementation u.s. west until u.s. East could could stabilize and then we move the traffic back protecting the customer experience the number one goal of the team the other things that we do unique failures why not have no failures right no no hands in the air for that one okay I'll keep going Netflix prizes what we call the velocity of innovation I want software developers developing new features and getting them out there for our customers to use I want our user experience designers trying new things I want you I designers finding new ways for you to engage with the service that velocity of innovation is very very important at Netflix so we know that things will go wrong and we are willing to trade some availability to maintain that velocity of innovation so instead of assume that those failures aren't going to happen we know they're going to happen but I want that thing to only happen once I want each failure to be unique and interesting and new so how do we do that we'll talk a little bit more later but things like good instant review processes getting everyone together where you can we're in your environment where you have safe honest open feedback and you can talk about what actually happened and why and this is particularly important for what we call getting to the real root cause so we give you an example of what I mean we had a failure a little while ago where a little problem with the service and we got the people together later on and said you know what what happened what caused that thing to go in that direction and one of the engineers you know put his hand up in the air and said I put the wrong number into a thing and hit a button and it very quickly did the thing I asked it to do and our entire service turned off that would be very easy to say well you human error right how do we fix human error please don't do that again but we take the time to actually dig into we walk through the tool he was using understanding what he was trying to figure out and what we learned by taking the time to find the real root cause is that we had a piece of tooling in the environment that made it too easy for people to do the wrong thing had we not done that had we just said human error and moved along there's a good chance somebody else would have come along and they would have used that tool and they would have done the wrong thing and I would not be keeping with my goals to have unique failures so we spent a lot of time making sure we understand why things fail what contributed to those failures and how we make sure that they don't happen again so that we meet our goal of both protecting the customer experience and having unique failures constant improvement I think it's kind of a natural outcome of that if you have an environment where you're regularly talking about what happened and how you got there you're going to continue to improve but we highlight it because I think sometimes we forget about it those of us that live in the operations world making sure these things you know work globally all the time you can get into a mode to where you're just putting out the next fire and putting out the next fire and putting out the next fire remembering the part of our job is constant improvement helps to make that a bit better because we're know we're moving along better how do we do that I'm a strong believer and you can't change things you don't measure so it Netflix we measure everything we'll talk about a bit we'll talk about that a bit more a little bit about Netflix Netflix is a median entertainment company that has the simple goal to delight our customers and win moments of truth what is that moment of truth well this is a moment this is when you or someone like you or this honest-to-god really really real family here sits down and decides what's you know what to do with the time they have for entertainment and when they pick Netflix we win so the goal of the company is to win those moments of truth so how do we do that the first thing is engaging in compelling content it would not matter if I could run a service with 100% perfect availability the best possible performance I could squeeze out of it and the smallest possible Amazon bill if nobody wants to watch what I have so compelling and engaging entertainments our first ingredient having the service be available is still important once we do that finding ways for people to better engage with what we have find what they're looking for there's that low limo example again special Netflix word we'll try different things here's a different view a little bit more detail about this particular about this particular show as opposed to the smaller amount of detail before so you'll notice that there are changes there differences we have an enormous a be testing infrastructure we are constantly testing we're very data-driven and we'll try different things in whichever one works out the best for our customers will win it's also interesting to note that that means your NIF let your Netflix experience is probably very unique as compared to anyone elses our customers are typically in about 13 or so tests at any given time so once you get good engagement you start to play with things so here we're mixing edia a little bit here's our friend Lola mo again as somebody's skull is scrolling through we've taken one of the lines we made it a cinematic experience we've broken up that video across those different panes this is a this is for beasts of no nation a Netflix original film available on October 16th and this may be how you see it on the surface so now we have good compelling content we have good ways for people to interface with the service we try to tweak that a little bit for our audiences this is one I find particularly interesting I have I have kids at home and this is our kids interface this came out of a out of a supposition by one of our engineers the kids think differently about entertainment than adults do whereas we might think about movie titles or television shows or seasons in particular episodes kids have characters that they want to see so we built an interface specifically so kids could pick a character doesn't matter what movie or TV show they're in they can watch toothless or King Julien or the Penguins of Madagascar peppa pig or Barbie very simply by interfacing with the service so how did we get there the Netflix cloud journey Netflix started as a dvd-by-mail company in 1997 about ten years later in 2007 we started streaming out of the Netflix data centers in 2008 there was an inconvenient fire in one of the Netflix data centers fires by the way not good for availability but it caused us to reconsider what should we be doing do we do we build more data centers do we try to get better fire suppression what kinds of things do we do and the decision was made that becoming an excellent data center operations company does not help us necessarily meet the goals of winning moments of truth so we started our move to the cloud so we started that move in 2009 by 2010 we had the first devices streaming from our new from our new Amazon infrastructure in us east one 2011 we'd stood up our service in EU West one for our EU customers 2012 December 24th ELB as a service melts down in us East one anybody remember that one a few hands yeah I was on call that ones burned into my brain it again caused us to change the way we think so we decided we needed to start having a better multi-regional strategy so by 2013 we were also operating in us West too and we can run customers out of either one by 2015 our cloud migration was complete those of you good at math that's a long time so what is the Netflix architecture on top of Amazon Web Services that makes us you know be able to send this out to our customers first bit is open connect a few years ago Netflix started building our own purpose-driven single-purpose CDN all of your video bits stream to you from an open connected syn stalled in one of our hundreds of peering locations around the world were in lots of AI X's and better yet were even embedded within the ISP networks so those video bits get to you even faster everything else that is Netflix metadata systems customer informations compute algorithms front ends all of those kinds of things run off of our Amazon Web Services infrastructure so what does that look like we have a global deployment we currently operate in three regions and as of as of recently we can serve any customer anywhere in the world from any one of those regions so doesn't matter if there's a problem in one region I can handle you from anywhere so the Netflix architecture itself Netflix is a service-oriented architecture composed of loosely coupled elements that have bounded contexts anybody recognize that as an Adrian Cockcroft definition okay I'll move on there are some important bits they're loosely coupled the services are independent they can be coded tested upgraded and managed completely independently and because they have bounded context those services are completely self-contained may recognize this as a micro services architecture that's the way Netflix has put together we have hundreds of mark of micro services everything from one service that talks about that video metadata another one that's about customer information another one that makes that Lola mo stuff that we talked about we have hundreds and hundreds of micro services we're currently over a little over 700 micro services that make the Netflix service as it is it looks like this those of you taking notes I'll give you a moment this is not particularly helpful and it's missing a few things not only is it confusing Netflix is an ever-changing environment that advantage of micro services that allows a service team to make changes and deploy to the production environment whenever they see fit means that this is a moving target so we've developed a few internal tools to help us understand our architecture because as soon as you were to say write something down in that documentation it's going to suffer from bit rot that instant and be incorrect because three other service teams are already pushed out some new data so for instance here's an internal tool we have called salt salt works off of the actual calls made from one service to another so now I don't have to try to understand what the dependencies are for particular service I can see and it's also self documenting since it's actually built off of live calls that are made this is always accurate so for instance you'll see there's that Lola mole thing again there really is a service that makes Lola mode there's all the things that it has to talk to in order to give you that list of list of movies we can even drill in a little bit further and it gets a little bit more legible but you get the idea we have a self documenting system so the Netflix ecosystem itself we discuss that was made up of hundreds of micro services there are thousands of daily production and changes anybody just have a little shiver when I said that ok we run tens of thousands of instances and we'll cycle out anywhere from 15 to 20 percent of those instances on a normal day we have hundreds of thousands of customer interactions per minute we have millions of customers we have billions of metrics and as of last quarter we provided over 10 billion hours of entertainment to our customers we have tens of operations people here's the other ones kind of fun we also have a knock we also didn't have anything cleverly renamed that is a knock but we don't call a knock how do we pull this off Netflix has a DevOps culture now it's normally anytime you say DevOps you have to spend about 20 minutes explaining what you mean by DevOps right I've given you a preview to the next 20 minutes or to borrow another phrase in order to understand DevOps one must first understand DevOps in the Netflix DevOps culture we have a 100% ownership culture the teams that are responsible for a micro service make the decisions about what language to write it in what storage systems are going to use their data models their caching architecture they code it they test it they deploy it they run it and they support it are also on call 24/7 in order to do that we don't have the H call traditional but maybe traditional over the fence to operations kind of architecture when software is created in Netflix there's no software team that creates it runs it through a release engineer which then actually you know Grace's it pushes out to the production environment and then there's an Operations Group that that's responsible for operating the thing particularly when it fails there's oftentimes in those scenarios being the operations person you hope that there's good and accurate documentation is up to date maybe there's a run book that tells you what to do and a lot of times they'll say go look at this log file great this runs on six hundred instances which one do I pick let's assume I pick the right when I get to the log file and invariably there will be there will be one of those error entries in a log file that says well this shouldn't happen if you recall my goals from earlier on protecting the customer experience that does not protect the customer experience why would I take someone whose expertise is in operations and try to make them you know a software engineer that happened to write this piece of software that understands what it's supposed to do when I actually have the software developers that wrote this piece of software and understand what it's supposed to do so we engage with our service teams that own that software when there are problems so I have the world experts on how the thing is supposed to be operating right there with me protecting my customer experience so what else do we do we talked a little bit about instant reviews earlier in the velocity of innovation I bring it up again because it's extremely extremely important that you have good healthy instant reviews or post mortems or whatever phrase we're using the devops world this week to say we're going to get everybody together and we're going to talk about what happen and figure out why these other things start to fall apart having software engineers on call 24/7 doesn't work out well if all we're doing is just pushing a page to them leaving them by themselves healthy incident reviews are very very important honest and open feedback again part of that I bring it up again because things will go wrong some new piece of hotness will go out there and break all the old stuff without fail that will go wrong somebody will try something new they'll make a bad judgment call it will go wrong but if you don't have a place where they can talk about that and review that and figure it out we're not going to have unique failures we won't be protecting our customer experience so this is a bit about the Netflix culture as well we have the Netflix culture deck perhaps some of you've heard about that it's a slide presentation that talks about the ingredients that make up the Netflix culture and what we think is important you can find that at jobs Netflix calm if you want to peruse a few of the other things that are important in our culture second thing on how do we do this with the devops culture we want to make sure we have easy ownership what I mean by that I've now taken a group of software engineers that are good at writing a software service and said oh by the way I also need you to be a successful release engineer I need you to be a good operations engineer and you really need to be a really good service reliability engineering so you know just don't screw that up okay that's not reasonable we want them to own it we want them to be successful owning it so we try to make it as easy as possible for them to be successful owning their service so we've created a set of software different tools that our software engineers use that we think makes ownership easier so for instance one category service discovery if we go back to that micro-services picture again the really clear clear blue one assume that you're a service owner and you've created this new service and you need to go get membership information or Lola mo information or you need to publish a metric or you need to look something up from data storage what instances do that how do I talk to them how do I make sure that when that service team changes something that my stuff isn't going to break so we created a service discovery software called Eureka Eureka maintains that information not only on what instances do what but how to talk to them what ports to talk to them on what protocol to talk to them on so now all you have to do is you make a single call and say I would like to push a metric and yori cancers back these instances would love to talk to you about that and here's how you do it we have another piece of software called EDA and there are times when you need information about your objects inside of AWS having 800 Engineers constantly calling an ec2 API get you throttled if you ever run into that that's happened us a few times so we've created Etta as kind of a cache so that all of our objects are kept there along with their history so we also get the benefit of history as opposed to just current information so one example of easy ownership we've made service discovery easy now we go solid communication we have micro services talking to each other across a network that none of us have seen through routers we haven't configured potentially through peering connections we know nothing about occasionally things are going to go wrong so now in order to make in order to meet my goal of protecting my customer experience so I now also tell my software developers I'd really like it if you were a great network engineer as well again probably not going to make me successful so we have another set of two rules that they can use we have it we have a tool called ribbon and ribbon handles all of that communication in between these different services it handles how do I talk to them how long should I try to talk to them how do i do my exponential back-off you know all those details that are important but vary from service to service ribbon handles that for all of my software developers we have another tool called histories histories helps to isolate network faults and protect the application it's kind of a visualization of what happens if we start over there at the left hand side during that green area we have an excellent set of communications from one service calling another one at some point the service being called some of the instances start to become unhealthy well the calling service is notified via history to go into a failback mode and start providing some other experience to my customers you see it gets worse the blue almost covers everything that entire service is almost gone however after a little while the service recovers and we go green again now the entire time my customers were not impacted at all so I'm protecting the customer experience I'm also making ownership easy because this was all automated just by using historic so they got this advantage nobody got woken up nobody had to log on to anything and you know turn a dial or flip a bit or do something different this all happened both the failure in the recovery in an automated fashion so now I've removed two degree the pain and difficulty of operating in a large network environment for my software developers the right thing happens easily continuous deployment this is tough this is tough to do well I joked that the goal of my software developers is to put the happy thing in the cloudy thing they really like it when I describe their job that way all of us who've done deployments that you know in Amazon understand there's actually a lot of pieces there that are both important and complicated we're picking instances and scaling rules and potentially attaching ELB s and security groups there's lots of different things that we need to do around the code to make this a healthy running service I want to pull that away or abstract it away or remove it as much as I can for my software developers requirements those of you that have used Asgard one of our open source projects Asgard we've sunset Asgard and this is a view of its replacement spinnaker spinnaker is a continuous deployment integration tool one of the primary goals of spinnaker is that we should be able to look at this see things quickly understand a lot about the running application with no training we can see I'm willing at API proxy big font up there in the left hand corner see it's made up of a good number of instances few of those aren't healthy if you over in a little bit of odd state I see an icon telling me there elby's around I see how many instances there should be I can see what the launch configuration was you can see that too and we didn't have to do any training on it so made it easy now once that software is up and running for them to get important information about how their software is running how about getting there we use deployment pipelines in spinnaker we want deployment pipelines to do a couple of things we want to be easy so deployment pipeline can be kicked off by something like a commit to a repository we also want to make sure it meets those other goals like protecting the customer experience so you can see in some of the older deployments down there toward the bottom of the screen that set of code failed some tests could have failed smoke tests have failed compatibility tests two failed squeeze test could failed performance tests if anyone's heard about our automated canary analysis making sure your new code is as good as your old code all of those things are evaluated before that code gets out there into the environment and if it fails spinnaker stops it from getting out there if it's successful it handles all the rest of it spitting up new instances whatever other associations need to be done everything else happens automatically making ownership and deployment easy data persistence this is another one that's hard there there are different data storage engines and their data models and their TTLs for things and caching systems and a lot of different ways to do things so we set these up to services for our software developers so instead of now both having to run us run the service they're responsible for they also do not have to be database administrators just like they did not have to be net engineers so we have software available that makes data persistence in the cloud easier or on things like Eevee cache and dynamite our versions of memcache and reddest that we think run best at scale and cover our replication needs we also have Asti annex and Dino which are client libraries that make accessing cloud-based data very simple similar to hysterics it makes my understanding of that entire data structure system and who to talk to and where to go get things I don't have to worry about it I make a simple call all of the heart operation is handled for me and my data is handed back to me insight insight into the running operations into the services into the micro services is extremely important for meeting those goals frankly insights also hard thing to do there are lots of different places where we're looking for effective insight into things they might be metrics on a particular run on a particular running application or us or an instance that's behaving badly or something but it turns into a hard problem to make sure you have all the insights you need when you need it so again this is something we try to set up and make make easy for easy ownership there we go so we have a piece of software called Atlas Atlas is our time is our largely dimensional time series data storage and data storage system for near real-time operational insight ID to practice that one a few times I still think it right Netflix runs a lot of metrics I mentioned earlier where a data driven company not only on picking content and providing content and recommending content but also on getting insight into our operations we've looked at lots of different options that were available for time series databases we struggled to find something that was both large enough and fast enough for our needs so we developed a piece of software called Atlas atlas currently runs a little under two point five billion metrics per minute every hour of every day for Netflix great we have metrics now you'd have to find 12.5 billion metrics the the manager of the insight team I think did some experiment where he looked at if I take a retina display and I have one pixel for every metric that we have if I have enough 15-inch displays if I go 90 feet high I can have one pixel for every metric now we have a different problem the metrics exist we need to make finding them and engaging with them easy so this is part of Atlas here it makes it easy for you know a couple of clicks for me to see what metrics are available what they look like how they might fit together what I can do with them we make it easy then to turn that into a dashboard where I can relate information however again with 2.5 2.5 billion metrics having people stare at dashboards we don't have a screen that big so our insight team make sure we go a step further how do we garner the correct attention for something that's having a problem a signal that's you demonstrating something interesting out of those two point five billion metrics so for instance here we see a the signal as the blue line we have a predictor of what the blue line should not go below that's kind of the surfing red line below it you'll see it's kind of tracking the signal there and then there are the green bars the green bars show anytime that that signal went below the predictor so we can see what really fell off the cliff it went below the predictor so the goal of the insight team is to make publication of complex metrics simple to make retrieving complex metrics simple to make visualization and analysis of complex metrics simple to make building automation which is what you need to do with 2.5 billion metrics and to protect your customer experience so the computers you know computers react faster than we do to build automation on top of metrics simple you might have noticed a pattern there as to the goals of the inside operations team metrics only tell you part of the story like we're saying we have a lot of these how do I turn this from interesting data into useful information for my software teams we have a few other tools this is a tool called mogul mogul helps service owners introspect their particular service and see how it's being impacted by dependence and dependency Network calls downstream errors those kinds of things again the tool is designed to be simple so that anyone can use it to introspect their service and figure out what's going on this is a term piece of software called slalom what's common in the Netflix infrastructure is a vie as I'm building a micro service I may pull in a client it allows me to easily access another micro service I may pull in two or three of those all of a sudden now I've created some dependency on other services I may not even be aware of one that I'm dependent on them to how much I'm dependent on them this tool allows them to easily see that dependency chain and gives them an idea based on relative size how dependent they are this is vector vector allows us to easily introspect the behavior of a specific instance high resolution highly granular data about how their particular piece of software is behaving on that particular instance then get an idea how they're using memory or how the CPU is behaving another important thing in how we answered this how question was what we call cloud thinking for those of you that were born in the cloud and molded by the cloud this next little bit will feel very natural those of you that have physical data centers and are accustomed to managing physical equipment this parts important you cannot import all of your data center or physical Hardware thinking into your cloud model and by that I don't mean you just tweak it a little bit you don't change it a little bit you don't you know dust a little off the top you have to change completely the way you're thinking I'll give you an example I've done lots of years in the data center and I've made the cloud transition myself and I have the gray hair or remaining gray hair to prove it I told you the jokes weren't getting better in the data center we tried to avoid failure by identifying single points of failure and moving around them right we'd have to power connections going to a particular machine those connections would go to different power buses those power buses would go to UPS's and the UPS's would be backed up by generators and if it's really important information we might have another layer of generators and we may even try to feed those from two different points on the power grid we did the same thing with our network connectivity we add more than one physical connection out to different switches to different network paths so that a failure in a certain place wouldn't cut us off we put disks in servers and put lots of disks in there we put raid controllers in there to make it logical on how those are going to work together multiple raid controllers in there in case one of those failed that we we weren't going to have a failure we tried to avoid failure by getting around those single points of failure interestingly enough and the reason I chose this picture we always have the lights right we add Drive lights we had Drive error lights we had network connectivity lights we had network activity lights we had power indicator lights I had raid lights I'd like to light - light - light - lines and you get accustomed to be looking down that row and see all those lights and recognizing patterns and maybe a color that's off and using that to get an idea where your failure may be coming at you in the cloud you never get to see the lights I call this having verbs not nouns so I'm part of that cloud transition you no longer have loud that load balancers you have the advantage of load balancing I no longer have networks and switches and routers I have a fabric that delivers my packets for me it's an important way that you start to think because it will change the metrics that you look at it will change the way you architect applications it will change all the assumptions you've made about how you operate in the cloud once you start to realize and what you start to think about is you're no longer thinking about a data center or Network centric implementation you're thinking about simply a business implementation on top of an infrastructure that you don't have to think about one of the last ones on how remove surprises sure everybody's going wouldn't that be nice I would like to zero surprises please so how do we do that there are some there are certain guarantees or promises that the cloud banks for you your instances will die one of my teams has a quote he likes quite a bit instances are cattle they're not pets you don't get to name them there are lots of reasons those instances might be going away and some of you are very good reasons you may be auto-scaling I mentioned earlier we autoscale those tens of thousands and thousands and thousands of instances some of them come some of them go you may get one of those occasional emails from Amazon that says hey that piece of hardware your instance is running on it's not so good or there's that rare occasion where human error may get rid of an instance for you the important thing to think about or to adjust your thinking about is that the impact is the same that instance is going away it's been one of these stated requirements at Netflix that the loss of a single instance should not affect your running service I would go as hard to say to you that if you loss of a single instance impacts your service you're doing the cloud wrong because you're guaranteed those instances are going to go away it may live for minutes it may live for hours it may live for months it may live for years but those instances are going to go away so how do we adjust our thinking around that so your favorite instance dies how do we prevent that from being a problem Netflix talks a lot about stateless applications as that instance comes up and it becomes part of your application or part of your service there should be nothing special about it as it leaves there should be nothing special about it there should not be that one special instance in a cluster that has the one file that has the key that has the piece of information that if it goes away the whole thing falls down but you are going to have to store some data so we talked about high data spread and redundancy we use cassandra for storing a lot of our data I mentioned earlier that we are in three amazon regions we always run in three availability zones so for our Cassandra rings that means that I'm going to have one instance in each availability zone that has my row of data and I'm going to have that across each of my regions so that's a three by three fashion so I've nine copies of my data there's my redundancy means I'm gonna have to lose a lot of instances before I lose any data high data spread and redundancy so your applications are truly stateless production failure injection this is a fun one lease lease for me how do you see these you know this this motley crew before I saw some hands Thank You those you that really liked them we do have stickers to the booth we've learned it's very important we bring stickers with us to these things the simian army does a few very specific things for Netflix for instance there in the dead center on the bottoms chaos monkeys the short one double fist and the guns chaos monkey does one very simple thing I'm going to teach you how to make your own chaos monkey in about five lines of insert your favorite programming language here pick a random list of instances shoot them in the head twice that's all chaos monkey does functionally but let's understand what chaos monkey does for your organization I said earlier the loss of a single instance should not affect your running service Cass monkey guarantees that for Netflix we know those instances are going to go away so every hour of every day in the production environment chaos monkey is killing instances for us you know I saw a few people shiver when I said production environment it's been over two years since the activity of chaos monkey has caused any kind of problems our customers have noticed we are not impacted by the loss of a single instance now we wanted to take that further you'll notice he has a bit of a Big Brother there chaos gorilla the guy with the bazooka we decided you know we should be able to lose an availability zone and not have it impact our services so we made sure our software's architected that way and we started running chaos gorilla and we knock out a zone then get even bigger the big ghostly guy there in the back that's chaos Kong as you notice the pattern may kind of see where I'm progressing to chaos Kong kills a region about every four weeks or so we knew to region I'll give you them to let that sinking what does that look like this is a real-time traffic visualization tool we have called flux you'll see we're talking to three regions around the world us West to is going to start to become unhealthy gets a little unhealthy and then it gets a little bit more unhealthy remember I want to protect my customer experience so I start redirecting that traffic from u.s. West to to u.s. East one we do that at a proxy layer we have called Zul that sits right behind our lbs so at this point there's no activity in us to accept a redirection activity then I update my DNS records so that us two is no longer a valid DNS record for anything in Netflix and I'll see that traffic drain away completely chaos Kong is the tool that we use to do this us two will start to become healthy again and you'll see the process reverse the important thing is that we do this at least once a month and that our customers don't notice occasionally things haven't gone well some of you might have noticed the vast majority of the time that we run this exercise our customers don't notice you see we've gone back to a we've gone back to a steady-state this allows us to have problems in Amazon that are regionally based and for me to protect my customer experience whether that was that remember that ELB thing from 2012 some of you may know about the recent DynamoDB challenge that happened in US east we use this tool the Chaos Kong tool we visualized it with flux to protect our customer experience a few other cloud guarantees to take seriously you're going to share resources with other people right this is a virtualized environment I hope it's not a surprise if I tell you that some of your neighbors don't always behave in the best fashion however if we take that seriously we think about some of those tools we talked about earlier like ribbon and history that helped us to isolate unhealthy portions of our system it doesn't matter you don't have to aggressively worry that you have a situation where your co-tenant was somebody who isn't behaving well that instance will be pulled away nobody else will talk to it it will not affect your service the architecture will change out from underneath you I think I'm probably not alone if I say I've poked around the Amazon infrastructure before trying to figure out my stuff lives and what I'm next to anybody else ever done that a few hands on you have tried that the architecture will change out from under you for lots of good reasons network upgrades droplet changes all of those kinds of things stop thinking about where you are in the infrastructure stop trying to take that datacenter thinking and map it on to your cloud solution write your applications in such a way where it doesn't matter where you are in the architecture and a reminder that you're never going to see the lights so with all of that kind of stuff that you know takes care failure for me what is it you'd say you do here well Bob I'm so glad I didn't have to explain that reference I've given that one before and people go who's Bob Netflix it's a movie I do participate in some of our crisis handling although that's honestly it's less than 10 percent of my time remember we don't have that many operations engineers that core team is relatively small we have 800 people we have 800 engineers or so at Netflix less than 10% of those people are in a group we call operations engineering less than 10% of the people in that group bear a crisis response or operations engineering or sre title so for those of you doing math that means that less than 1% of the Netflix engineering infrastructure has anything to do with operations or crisis handling directly that's because of the other things we've talked about we do do a bit of crisis handling I spent a lot of time and engagement I work with Amazon to talk about this is what Netflix is doing and this is what we're thinking about I engage with other teams to learn about what they're doing we have this big changing ecosystem I need to kind of keep up to date on that I do get to make some things I think all engineers I think erm you know mostly tinker's the core so I get to make some things so here are some of the things that I've got to make yes they're all Ethernet cables they're all sharks there's a theme so jaws we've had we've had challenges in the past where oftentimes our customer service centers where we have these large labs full of devices Netflix is available on over a thousand different devices they're the ones who can reproduce a problem and what does the engineer need the engineer needs a traffic capture so we tell a customer service person what I'd like you to do is go find a laptop three extra cables and the one hub that is still in existence in Northern California hook it up to that device then I want you to start Wireshark and I want you to fill out the filtering appropriately so we have the right information then I'd like you to reproduce that and then I like you to email it to me thank you not feeling a whole lot of success with that plan so we built a little hardware device that makes it easy anybody to replace an ethernet cable with this little device and they get traffic capture the Hammerhead devices these are network quality test devices we send these out to our network partners around the world they install them in the last mile of their network and it starts testing the Netflix experience from inside their network it's really important that means that with this tool in place we can iron out a lot of those bugs before we have customer number one coming off that network experiencing Netflix Shark Tank is the system that gathers all the information from the hammerheads together and make sense out of it kaiju this is our most recent one we even have a few of our kaiju aficionados in the audience today we do a lot of testing of things at Netflix and being able to create certain Network situations is an important part of that test so if we have somebody testing a new piece of software on an Xbox or Playstation or a Roku or you know your favorite device here there are times when they want to be able to create a network situation things like I really need the traffic to pop out in the UK I need to be on a five Meg DSL line but I need three percent of packet loss so I can run this test it's been possible in the past but you had to chain together all of these different pieces of hardware and hope you didn't get it wrong kaijus a system we developed where people can now access this as a service so they make an API call their device gets all of this network configuration they can do whatever they want so a few of the things that I've been able to make we spend a lot of time in the core brew providing education to other people around the company again I said we don't have lots of operations engineers but having some good operational thinking that you can apply some things that you can understand is extremely important for me to provide to them to help them protect the customer experience so we spent a lot of time on education as to what I don't do as stated I'm you know one of the guys with the pager I don't spend time tailing log files I don't spend time reading other people's code I don't spend time looking at run books but the biggest thing that I don't do and this is my favorite is that I never feel like I'm by myself is important in healthy DevOps environment I've had times in the past when I've been the lone wolf guy it doesn't matter whose code it is go fix it I don't have to deal with that in the environment that I have in our healthy DevOps environment I know I have lots of people there that are experts in their particular systems just waiting to help me out what I do do it's my job to make things better hopefully I can make a few things better for you I mentioned a few pieces of software's we were going through things tools that Netflix uses we talked about Eureka and ribbon and historic Sand Etta and spinnaker and some of the things under spinnaker like a manator some of you may be familiar with st annex dino dynamite servo atlas vector all of those kinds of things these are all available for you to use at the Netflix OSS github site Netflix github do those projects along with quite a few others things that we've built that help us with operation or orchestration or data persistence that we have found to be beneficial to us we like to give away to everybody else because if it's helped us there's no reason it can't help you again if you're curious more about the Netflix culture and maybe those jobs that might be available in athletes we have a website too that you can find that in jobs at Netflix comm we have a few more speakers tomorrow so we have spark and presto on the Netflix Big Data Platform we also have splitting the check on compliance and security keeping developers and auditors happy in the cloud those are Thursday morning at 11 a.m. also all of our speakers myself included will be at our at our booth to spend time answering questions I'm Dave Han from the port team at Netflix these are the places you can find me on the internet if you enjoyed the presentation remember to complete your evaluations if you did not these evaluations or not for you

Info

Channel: Amazon Web Services

Views: 78,086

Rating: undefined out of 5

Keywords: aws-reinvent, reinvent2015, aws, cloud, cloud computing, amazon web services, aws cloud, Media & Entertainment, DevOps, DVO203, Dave Hahn - Netflix, Introductory (200 level), Netflix Engineer, Scaling in the Cloud, Netflix, cloud computing event

Id: -mL3zT1iIKw

Channel Id: undefined

Length: 50min 45sec (3045 seconds)

Published: Mon Oct 12 2015