DevOps Vs. SRE: Competing Standards or Friends? (Cloud Next '19)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[MUSIC PLAYING] SETH VARGO: Today I'm going to talk to you about DevOps versus SRE. This is a hot and contentious topic. Which one's better? Which one's worse? Are they the same thing? Are they different, competing standards or friends? For those of you that don't know me, my name is Seth. I work on the Developer Relations team at Google. I've been at Google about a year and a half now. Prior to that, I worked at companies like HashiCorp and Chef Software, which are kind of regarded as leaders in the DevOps space, similar to other tools like Puppet or Ansible or Salt. And I've been involved in this DevOps space for quite some time. I've written popular tools. I've contributed to popular projects. And I've watched this movement kind of from the beginning, really since about 2007, 2008. And I'd like to share my perspective today on what I think DevOps is and how it relates to SRE or site reliability engineering. So with that, let's go ahead and get started. In the beginning, there were two groups of people. On the left, we have developers. Developers are concerned with agility. They want to build features, build software, and ship it as quickly as possible to get it in the hands of customers and users. On the right hand side, we have operators. Operators are concerned with stability. It's not broken. Please don't touch it. And this wasn't just a personality thing. This was driven by the business. As an operator, it was your responsibility to make sure that the system never went down. And when the system went down, you got a phone call or a page in the middle of the night. And it was your responsibility to fix it. And if you didn't do it in a timely manner, you might be fired. But then on the flip side, the developers have all of these roadmaps and Agile and Jira tickets that they have to complete. And if they don't complete them and they don't get them into production, they're not delivering value to the business. And the business is suffering. And then they're at risk of getting fired. And these are two competing ideas. As we introduce new features and new functionality into a system, we also introduce instability. Every new line of code we write has the potential to have a bug, has the potential to have a performance regression. So these are directly competing ideas. We have developers who are trying to move quickly and introduce instability and operators who are trying to slow things down as much as possible because it's not broken, please don't touch it. But here's where things get a little bit interesting. Developers were closer to the business, both physically and metaphorically. Developers often sat in the same building as the directors and the vice presidents and the CEOs. They were physically located closer to the business decision makers. Operators, on the other hand, were often in a data center. Their desk might be hundreds or thousands even miles away from the corporate office. They often felt disconnected. They felt like their ideas weren't being heard. Educationally, these two groups of people often came from different backgrounds. Developers traditionally had a software engineering background, computer science, information systems, computer engineering, something along those lines from a two or four-year degree or more. Operators tended to come more from a practical background. They might have, like, an associates degree or a practicum in something like network engineering. So there's a skills gap on both sides. Developers are really good at algorithms and writing software. Operators are really good at understanding network topologies and failure scenarios and how redundant does a SATA drive actually need to be in order to have this many nines of availability? But because developers were closer to the business, oftentimes they would just write their code. And they would throw it over the wall to the operators. That animation was nice, right? Yeah, I worked really hard on that. So these developers would throw their code over to the operators. And they'd be like, here's my PHP. Please go run it for me. Thanks, have a great day. And these operators, remember, they don't have a traditional background in software engineering. They may have never worked with these languages before. In addition to the responsibilities of keeping the network up and running, making sure hard drives are not filling, making sure that servers are not faulting, they now have to understand bugs in the application, because when that application goes down, they're not paging the developer. They're getting paged. They're getting woken up in the middle of the night for a bug in the application code. But because developers were closer to the business, when operators would complain, no one would listen. So the DevOps movement really in its purest form is about breaking down that wall between developers and operators. I know, that was another nice animation. By breaking down the barriers between developers and operators and aligning the incentives between them, we can deliver software better and faster and more safely for our end users. What are some ways that we can do this, though? Well, one of the easiest ways that you can break down the barriers between developers and operators is to put them in the same physical room. This has been shown to work successfully time and time again. It's a time tested pattern. Instead of having your operators go sit in the data center hundreds of miles away, put them in the same room as the developers. Make them attend stand-ups. Make the developers listen to the operators. And all of a sudden, you'll start to have these amazing conversations where developers will try to be-- they'll try to write some algorithm or some distributed systems problem. And they'll have this great thing. And it works great in theory. Then you have an operator that steps in and says, so that's great. But you're telling me that you need 40 gigs a second for network traffic. And our data center is running Cat5e cable, which is maxed at, like, 10 over short distances and five in reality. So that thing that you're trying to do works great on paper and works great on your local laptop. But it will never work in production unless we upgrade the cabling in our data center. And just that small collaboration saves the company millions or even billions of dollars investing in a software project that cannot be successful because the underlying hardware won't support it. And this is just one example of how putting developers and operators physically in the same space helps improve the software delivery cycle. If you look at the DevOps manifesto, there's kind of five key categories that DevOps is broken into. The first is reducing organizational silos. And I've already talked a little bit about this. How do we reduce the silos that exist between developers, the people writing code, and operators, the people making sure that code continues to run? What we quickly found, though, is that that was just the biggest mountain. After you reduce the silos between developers and operators, who knows what the next giant mountain is? Security, legal review, marketing, PR. All of a sudden, those little things become the new mountains. And this is where the DevOps movement in its purest form-- if you're a DevOps purist, you're like, no, it's just developers and operators. But in order to actually do this successfully, you'll see that it has to involve cross-functional teams. The same way that we want to involve operators in the development lifecycle, we also want to involve the security team, because if we involve the security team early, instead of our security and privacy review taking six months for someone who has no idea what our product does, who has no understanding of our core business goals, we instead have someone who's regularly attending our stand-ups. They're regularly contributing code and reviewing code. And then when it comes time to actually do the review, they may be able to complete it in weeks instead of months, ultimately delivering software faster, but also safer, because they're far less likely to miss something, right? Reviewing 100,000 lines of code is much harder to spot a bug than reviewing 1,000 lines of code versus 100 versus one. The second piece of the DevOps manifesto is that we have to accept failure as normal. If you recall earlier, I talked about fire, being fired a lot. Developers are worried about being fired if they don't ship features. Operators are worried about being fired if they don't deliver 100% availability. This is not OK. This doesn't create a culture in which people, humans, can thrive. We have to accept failure as normal. Any system that humans build is inherently unreliable. And in fact, I would challenge anyone to find a system in nature that is 100% reliable. Most systems fail given large scale. So given that anecdote or that lemma, we have to accept failure as normal. It has to be built into the core of our business. We can't just fire people every time the system goes down. Instead, we need to plan for it in advance. So as a concrete example, let's say we're doing a database migration. We're about to roll out a database migration. Before we do that, let's plan for failure. Let's accept that failure is going to happen. So before we actually roll out that database migration, we're going to plan a rollback. We're going to write a script that rolls back our deployment and rolls back the changes to the data model. Well, why would we invest that effort in advance? It might just work. And you're right. It may totally work and that was wasted effort. But the problem is if it fails, if that deployment or that database migration fails. Now your phones are ringing. Your pagers are going off. Social media is blaring. Your boss is yelling at you. Your boss's boss is yelling at you. The site is down. You're losing money. And you're trying to build a plan. It's not the best time to build a plan. The best time to build a plan is when there's not a lot of pressure and you can think clearly. So by accepting failure as normal, we understand that bad things are going to happen. There's going to be bad deploys. There's going to be bad database migrations. How do we recover from them? We need to think about that before we deploy them. On a similar note, the next piece of the DevOps movement is this idea of implementing gradual change. If you work in a waterfall software development methodology, you may deploy once a year, twice per year. And the problem is if you're doing that, you're deploying hundreds or millions of lines of code at one time. And the chance that there are no bugs in those million lines of code is effectively zero. It is very, very small that there are zero bugs in a million lines of code. I don't care how good of a software engineer you are or how rigorous your discipline is. There's going to be a bug. So you deploy the software. And some users start complaining. Well, now you only have to search through a million lines of code to find the bug. It's not that long. It might take another year, versus deploying small incremental changes. If we deploy, say, 10 or 100 lines of code at a time, if all of a sudden our monitoring starts failing, our users are yelling at us on social media, our boss is telling us that something is broken, or our internal users are saying, hey, this is slow now, we know exactly where to look for the problem. The smaller that change, the easier it is for us to identify the problem and the faster it is for us to fix that bug or roll back the change. Rolling back a million lines of code is a lot harder than rolling back 10. Now on the flip side-- and this is my soapbox for a moment-- I don't like when people measure DevOps success in deploys per day. I think deploys per day is actually a bad metric for success in this industry. You'll see it a lot where people are like, oh, we deploy 6,428 times per day. And I'm like, great, I can add a comma to some Java 6,486 times per day. That's not very exciting to me. But the frequency at which you deploy relative to your business and relative to your industry is a signal. It does tell you how well you're practicing these methodologies. If once per week is standard for your industry-- maybe you're in something like fintech or banking where there's regulatory requirements and that's the industry standard, that's fine. You don't have to catch up to some startup that's deploying 100 times a day. What's important is that you're implementing gradual change with respect to your business and your industry. End soapbox. The next piece of the DevOps movement is this idea of leveraging tooling and automation. And this is where you'll often see people confuse tools like Chef, Puppet, Ansible, Salt, Terraform as DevOps tools. And that's because those tools largely guided the DevOps movement. They largely supported that DevOps movement. And they coincided with the DevOps movement. So a lot of people are like, I'm a DevOps engineer. And I'm like, what do you do? And they're like, I write Terraform configs. I'm like, OK, maybe not the appropriate job title. That's fine. But leverage tooling and automation is a key piece of the DevOps movement. What we found is that after breaking down those barriers and after accepting failure as normal and implementing gradual change, it turns out there's just a lot of work, like creating users, installing packages, building Docker containers, monitoring, logging, alerting. All of those things take time. And if you're a company that has 100,000 VMs and you need to roll out an open SSL patch, you can't have people sit at a keyboard, SSH into them, and run yum update. It doesn't scale. And we quickly learned this. We quickly learned that we have to have tooling, we have to have automation in order to successfully implement DevOps, because otherwise you're just constantly chasing your tail. You're running around putting out fires when instead we need to be leveraging automation and tooling to make things repeatable and to make a pattern out of these. Another key point is that humans are inherently very bad at doing the same thing over and over and over again. We get bored. We get distracted-- oh, look, a butterfly-- whereas computers are really good at doing the same thing over and over and over and over again. So we should leverage computers for doing the same thing over and over and over again. The last piece of the DevOps movement is that we have to measure everything. It doesn't matter if you do all of these things. If at the end of the day your boss comes to you and says how much more successful is the business and you say people are happier, that's not a business justification. And I'm not saying that we should be justifying everything we do with money or deploys per day. But we have to have numbers to support the efforts that we're driving. If today you don't have any metrics and you implement all of these DevOps things and everyone feels better and you know that the business is better, but when your manager comes to you and says, hey, we've invested $2 million over the course of a year. We hired a bunch of people. We bought these software packages. We invested in this tooling. What do you have to show for it? And you don't have a tangible metric to be able to point to, it's unlikely that the effort will continue. And on the flip side, if you don't measure everything, how will you know if you're actually successful? You have to set clear metrics for success. And that includes at the organizational level, but also at the application level. Now, there's also a difference between measuring everything, monitoring everything, and alerting on everything. These are very different things. You may measure CPU usage, memory usage, available disc space. You may monitor available disc space. But you may only alert if no one can use your checkout page, because if a disc is full, that sucks. But hopefully you're running in high availability mode. And some other service can take over. And a human can clean that out during normal business hours. All too often I see people adopting DevOps. And they jump right in. And they start alerting on every metric. And they don't understand why their phone battery keeps dying. CPU usage is at 30%. Oh my god! 30%. Why? We need to measure and alert on things that matter to our users. Is the checkout page working? Can people actually buy things from our e-commerce site? Can our internal users run their analytics and their reports? That's what matters. That's what we alert on. Then we measure the other things so that when we get that alert, we can find the root cause or the root causes that triggered that alert. So what you'll notice here is that these are all abstract ideas, though. I know this is shocking. You had no idea this slide was coming. These are all abstract ideas. I gave you some examples of how you might, say, reduce organizational silos-- put people in the same room. I gave you some examples of measuring versus monitoring versus alerting. But they're abstract ideas, right? The way your company might accept failure as normal is very different than the way another company might accept failure as normal. And what we found is that a lot of companies don't like abstract ideas. They want more concrete implementations. So story time. Independently of DevOps, around the same time, though, Google started this discipline called SRE, which stand for Site Reliability Engineering. It was an engineering-based discipline-- so kind of on the engineering ladder, the engineering pay scale, if you will-- of people who build and maintain reliable systems. Their job was to keep, at the time, things like search and ads up and running all of the time, or as much of the time as possible. So SRE is kind of like a very prescriptive way to do DevOps. And this is why you might hear us say this phrase every so often, which is that class SRE implements DevOps. Let me explain what I mean by that. So SRE evolved independently from DevOps. Google was really in its own little bubble at that time. And they arrived at SRE as the way to build and maintain and run production systems at scale. And DevOps was kind of built by the community. And SRE was kind of built by Google at the time. Very recently, we've learned that SRE should be shared with the world. For quite some time, Google thought that SRE was like our secret sauce, if you will. It was the thing that differentiated us from our competitors. It was the thing that differentiated us from other cloud providers. We shouldn't talk about it. It's a secret. We didn't even post job postings for it. But very quickly, we learned that there's a language to SRE. There's nomenclature. There's ways to think about production systems that as a cloud provider, when we go to a customer and we start saying these words and we start talking about these concepts, they have no idea what we're talking about. So we go to a customer and we say, hey, this service has three nines of availability. And you're depending on it, which means you can never have more than three nines of availability. And our customers just look at us like, nines, question mark, like a German shepherd with the head tilted sideways. And this is when we decided that SRE doesn't have to be a secret. It doesn't have to be a secret sauce. And in fact, other companies can practice SRE. And then Google stepped outside of its bubble and realized that there's already a practitioner-based community, the DevOps community that is trying to do this. They're doing it successfully in some cases, unsuccessfully in other cases. And that's because it's not prescriptive. DevOps is like an abstract class or an interface in programming. It says here are the things you should do. Go figure it out, whereas SRE is a concrete implementation of that class. It says here's how you reduce organizational silos. Here's how you accept failure as normal. Here is how you measure everything. And this is why we say class SRE implements DevOps. Now just like in programming, you may have an abstract class or an implementation of an abstract class that has other methods that are not in the interface. You may have additional methods, additional functions that aren't in the interface. And SRE's the same way. There are things in the SRE discipline that aren't really part of the DevOps interface. But SRE does satisfy the DevOps interface. And let me show you how. So if we go back to the previous slide where we kind of talked about the five key areas of DevOps, how does SRE reduce organizational silos? Well, the first way that we do this is we share ownership with developers. So SREs at Google share ownership with the developers by using the same shared set of tooling across the organization. So there's a single set of tools that both software engineers and SREs use for production systems. And by leveraging the same set of tools, you have software engineers contributing. You have site reliability engineers contributing. And you start to build high-performance, very reliable tooling that helps people get their job done. On the other side of the organization, SRE has very prescriptive ways for which we determine the availability of a system. And the way that we do that results in almost forced conversation and forced collaboration between product teams, developers, site reliability engineers, and even sales and post-sales organizations. How does SRE accept failure as normal? Well, one of the ways in which we encourage that collaboration I just talked about is through these things called SLOs, service level objectives which I'll talk about more in detail in a bit. But these service level objectives-- in addition to kind of forcing collaboration between these different groups in the organization, they also force us to admit how reliable or how unreliable our system can be. And by having that conversation, we immediately admit that our service is going to have faults. And as a developer and as a product owner and as a site reliability engineer, I can determine how much fault I actually want my product to have. If I'm building a product that is up against some competition, it needs to hit the market very quickly, I may accept more risk so that I can deploy faster, I can deploy riskier changes, et cetera. If I'm building a product that is targeting, say, the health care space or the aviation space, that needs more nines of availability. That has to work all of the time. We can't have hours and hours of downtime. We have to have redundancy and availability. And that may slow down our development velocity. But this is a conversation that occurs between the product owners, the developers, and the site reliability engineers. Additionally, as you might expect, the SRE discipline strongly encourages blameless postmortems, which are popular among the DevOps community as well. When an outage occurs, it's nobody's fault. And instead, we try to find ways in which we can improve the system moving forward. Something that SRE does that DevOps doesn't do is we then generate metadata about our postmortems globally across the country-- or country-- company. So for example, at Google we know that the vast majority of our outage comes from a bad configuration change. And that's a publicly shared statistic. And the reason we know that is that in addition to doing an isolated postmortem, postmortem about an incident, we then categorize and catalog all of that information into a database where we can run reports and gain statistics about what is causing these overall outages. And that can help you build kind of meta-tooling and better improve the system overall. SREs implement gradual change by moving fast to reduce the cost of failure. Again, small iterative deployments are strongly encouraged over large 100,000, multi 100,000 line changes. Tooling and automation is really a key piece of the SRE culture. In fact, SREs have this thing called toil-- T-O-I-L. I'll talk about it more in a little bit. But toil is this idea that you should be spending on times that bring long-term value to the system. So you should be investing your time and investing your resources in things that bring value to the system in the long run. SSHing into a system and restarting a service doesn't bring long-term value. It brings immediate-term value. But that's something that we should fix in the long term. So SREs invest heavily in tooling and automation in order to automate this year's job away. That's kind of an unofficial tagline among a lot of SREs at Google is that the manual tasks that I did this year I shouldn't have to do again next year. I should automate those away. And lastly, we measure everything. We measure systems level metrics. But we also measure things like toil, reliability. And we'll talk more about reliability in a bit. But hopefully this is clear as to why we say class SRE implements DevOps. SRE is a very prescriptive, very concrete way to satisfy the DevOps interface. So given that, let's jump into a few key areas that I think are important to discuss. The first is SLIs, SLOs, and SLAs, oh my. An SLI is a service level indicator, something like request latency or requests per second or failures per request. They're a point in time or an aggregate point in time of a particular metric about a system. Basically an SLI tells you at any given moment yes or no for a metric in a system. Is it healthy or is it not? And you define that. So for your system, healthy may be that the ratio of successful requests to failed requests is less than 1%. So 99% of your requests are successful. So at any point in time, if you imagine a point in time, that's a binary operation, yes or no. At this moment in time, are we up or down per that metric? Then an SLO is a binding target for a collection of those SLIs. So while an SLI is like a point in time, if you imagine from calculus, if you were to integrate all of those points over a time period, like a quarter or a half or a year, that's where you get your SLO from. So the SLI measures up or down. And then the SLO says how much up or down can we have in a particular time period-- again, like a quarter or a half. Then there's the one that you probably heard of before, which is an SLA. An SLA is a service level agreement. This is a business agreement that happens between a customer or a consumer and a service provider. The SLA is typically based on the SLO. Ideally, you want to break your own internal SLO before you break an external SLA, because violating an SLA typically means you have to give money or credits or reparations in exchange for violating that SLA. So SLIs drive the SLOs, because remember the SLO is basically an integral of the SLIs over time. And then those SLOs inform those SLAs. But to give you a better understanding of who is involved in this process, I built this handy slide. So product, SRE, and software engineering work together to build those SLIs up or down. Then the SREs and the product work directly to determine what that looks like over a period of time. And again, this is a function of how fast do we want to move, what are we up against in the market, how much risk are we willing to accept, what's our target market. And then SLAs are generally built by the sales teams and the customers maybe with some negotiations from product. But they have to be tighter or less than the SLO, because again, you want your SLO to break before your SLA breaks. And sometimes SLAs are also used as part of a sales engagement where you can buy more availability. So that's not really part of the SRE conversation other than the SLO informs kind of the minimum baseline for the SLA. So I often get this question, which is like, OK, so you have this SLI thing and then this SLO thing. And then what happens when you go over? You're like, OK, system needs to be 99.99% reliable. That gives me, like, 12 and 1/2 minutes of downtime per quarter. What happens when I'm out? Like, can I just keep pushing? Do I fire my developers? Like, what happens next? Well, this is where error budgets come in. So one of the things that's important to note is that it is nearly impossible to find a system that is 100% reliable. And it is often cost prohibitive to build a system that is 100% reliable, especially when you're relying on third party components. Take, for example, this two-dimensional material phone on the screen. This screen, or this phone has 99% reliability with its cellular network carrier. This is actually pretty standard in most countries. That's your SLA with your mobile carrier provider. So your mobile provider is saying that we will service 99% of your requests. So 1% of your requests will fail. So let's say you have some back-end service. And you have an unlimited amount of money. And you say, we want 100% availability. So in order to do that, you would have to run your own fiber connections with redundancy to every cell phone tower in the country of your choosing in order to be able to deliver maybe 100% availability. And then you'd need your own kind of internet backbone to run that. And you would need on-call people pretty much constantly. Even if you did all of that and invested millions, if not billions of dollars, the user would only experience your app at 99% availability, because they are governed by the least reliable components of their system. So even if you have 100% connectivity between your data centers and the cellular network towers, that user, that end user will only experience 99% availability, because they are governed by the least reliable component in the system. And this is a key point is that optimizing for 100% availability isn't just difficult. Oftentimes it's irresponsible. It is not in your company's best interest to optimize for 100% availability. Instead, you should be accepting failure as normal and understanding how much failure is available for your system. So like I said, that user only experiences 99% availability. So how do you determine how much risk your service can tolerate? How do I know how risky my service can be? Well, there are many factors to consider, like fault tolerance, availability, competition in the market, how fast you're trying to deliver, whether there's a giant conference that you need to launch at. That was a joke. But your acceptable risk dictates your SLO. So if you have a product that has really critical market timing and you know that you need to deliver features quickly, you may say, hey, we're only going to offer one nine, 90% availability, because we need to be able to move quickly. And we don't want to have to focus on reliability right now. We want to focus on shipping new product and new features. But again, if you're in a different industry like health care or aviation where reliability is super important, you may say, like, hey look, we only want to focus on reliability right now. We need to improve our nines of availability so that our customers can trust us. So as long as your SLOs are met, you can continue pushing new features and new product. But what happens if you violate that SLO? What happens if you've exceeded your error budget, which is the amount of failure you can have within your SLO? Well then everyone just plays ping pong, right? That's how this works? No. You can continue deploying. Your developers can continue building features. But everything has to focus on reliability. They can't ship new features until we improve the reliability. So the development efforts, the focus shifts from building new features and delivering new features to improving reliability and improving the availability of the system until the budget is replenished. So just like a bank account where you get a paycheck and the money goes in and then you buy some Pokemon cards and then the money goes out and then you get more money, the error budget works the same way. It's that after we've exhausted our bank account of error, we can only focus on reliability. We can only deploy features that focus on reliability or bugs that improve reliability. So what does that actually look like? Well, here's kind of a pretty dumb version of an internal graph. So on the top, we have the health of the system, which is measuring the number of requests that are under 300 milliseconds at any point in time. So this is like an SLI. In the middle in the green, we have the compliance, the up or down. Are the sum of the requests under 300 seconds? So you'll notice that that's more of a step function, because it's binary. It's yes or no. You're either above or below the line. And then at the very bottom, we have the budget of non-compliant requests in our SLO budget. So how many requests are remaining? What you'll notice there is pretty far in the right hand side, there's a drastic increase in our latency. So you'll notice the-- it's a little bit confusing, because the chart goes down, which is a decrease in the number of requests under 300 milliseconds, which is an increase in the number of requests over 300 milliseconds. So we see a drastic increase in latency that obviously causes our compliance to drop significantly. And then you can start to see that just like my bank account, all of that starts to drop in our budget. We start to see a massive drop, because we have a prolonged violation of our SLO. And it's a steady decline until we're back in compliance. And then we gradually gain a little bit more error budget, because time is moving on. And then we start declining again, because we have a regression. And these types of charts help inform how much availability your system actually has. And when we're at the bottom of that budget, we have to focus on reliability. So you might ask yourself, well, what just prevents developers from saying, well, like, I know I'm at the budget. But this is an important feature. Why can't I deploy it? Well, they can. But they might lose their SRE support. So remember that SRE is a discipline. It's a separate organization that partners with the software and product teams. And if the product team and the software team are not willing to be adequate partners, the SREs will gladly hand you the pager and walk away, at which point now developers are not only responsible for building new features, they're also responsible for the reliability of the system. And I guarantee you they'll start improving the reliability of the system that night at 2:00 AM. Another thing I want to talk about, because I get a lot of questions about it, is toil. It's a fun word, toil. It's like foil, but with a T. Toil is best described as what it's not. It's one of those weird things. Toil is not email. It's not expense reports. It's not meetings. It's not traveling. These are all things we call overhead, things that are required to do your job that pretty much everyone in the organization needs to accomplish. Toil is actually something that is manual, repetitive. Most importantly, it's devoid of long-term value. It's often incredibly tactical and highly automatable. Classic example of toil is like SSHing into a system and restarting a service because it's out of memory or it's spiking CPU usage or something crazy. That's incredibly tactical. Right now, doing that, graphs go back down. Everyone's happy. But it is devoid of long-term value, because that service is likely going to out of memory again. And you're going to have to do it again. And it's going to be repetitive. And it's manual. But it is very easy to automate. That's a classic example of something that's very toily. And in the SRE discipline, we measure toil and we talk about toil because it's a very negative consequence of the job. If you're constantly working with toil, this can lead to career stagnation. No one got promoted for restarting servers. How many people got promoted because they restarted a server once? Right, exactly. No one. And at the same time, a little bit of toil is also good. So there's a careful balance here. And every year, we kind of send out a survey to all of the SREs at Google. And most SREs aim somewhere between 10% to 20% of toil in their job. So you might ask yourself, where's toil good? It sounds like this is terrible. I should automate everything. Well, if once per year you have to do some very complex operation, like some aggregate report that spans multiple systems and it's very complex and it would take you 20 hours to automate it, but it just takes you 15 minutes to do it manually and you only need to do it once per year, it is not a good return on investment for you to spend time automating that. Instead, you should make sure that other people on the team know how to do it so that you're not one person, documented, et cetera. But you shouldn't waste more time automating something than it takes to do it over time. It would take a hundred years for you to reclaim that time. No, it would take you 125 years to reclaim that time. It's just not worth it. Toil is also an excellent way for newcomers, like interns or people new to the team, to learn the system. There's no-- I mean, there are better ways. But one of the best ways to learn a production system is to explore it. And being able to poke around and understand systems and understand how they work with one another-- that's very toily. But it's also a great learning opportunity for newcomers to the team. Another value for toil is that-- how many people have ever had a bad day? Wow, that's like-- you are all so positive people. How many people have ever had a straight six-hour meeting day? I know that you've all had that. Yeah, right? And then at the end of six hours, you have 475 unread emails. You basically feel like you've accomplished nothing for the day. And then you get a page that this server's out of disk space. And you can fix it. Then you can go home for the day. And you feel like you've done something with your life. Toil can actually satisfy that. Toil provides that instant gratification, but in small doses. At a large, grand scale, we want to eliminate toil as much as possible. We want SREs to be focused on improving the reliability of the system and the availability of the system, not performing toily tasks unless absolutely necessary. So I've given you kind of a brief overview. You may ask yourself, where can I learn more? There's this happy link at Google.com/SRE where you can find all of this information and a bunch more, including two free books-- that's free e-books. The first is the "Site Reliability Engineering" book. I like to call this the theoretical calculus book. And then there's the "Site Reliability Workbook," which is like the problem set, if you will. They're both excellent books. The SRE book talks a little bit more about the theory and the history. The workbook is a little bit more practical, a little bit more hands on. And there's an entire section about how DevOps and SRE relate to one another. Both of these are free. Did I mention that they're free? And you can download them for free. If you would like a print version, you can purchase them from your favorite book vendor. So to kind of conclude here, is SRE DevOps 2.0? No, it's not trying to be. I also hate when people put 2.0 on things. Is SRE trying to overtake DevOps? No, lol. And I learned yesterday that I'm the only person who pronounces lol. I thought people pronounced it, lol. Can I adopt both DevOps and SRE? Yes. As I said before, if you implement DevOps, if you follow the SRE workbook and SRE book, you will be satisfying the DevOps interface. On the flip side, if you satisfy the DevOps interface, you might not be doing SRE. There are a number of other disciplines, like VictorOps has a thing called ProdOps. There are other disciplines that satisfy the DevOps interface, but are not SRE. Is DevOps dead? I get this question a lot. No one really talks about DevOps anymore. I think that's because it's not dead. It's just assumed now. It's assumed that you're following some or all of these principles. Is my talk over? Yes. We have about five minutes for questions. There are two microphones in the middle of the room. If anyone has questions, please line up at the microphone for the recording. While folks are lining up, if you have a question that you don't feel comfortable asking on the microphone or on the recording, that is my Twitter handle. Feel free to tweet at me at any time. My direct messages are also open. So if you have a question that you want to ask that you don't feel comfortable asking in public, feel free to send me a direct message. I will get back to you as soon as possible. Thank you all so much. [MUSIC PLAYING] [APPLAUSE]

Info

Channel: Google Cloud Tech

Views: 80,411

Rating: 4.9086986 out of 5

Keywords: type: Conference Talk (Full production);, purpose: Educate, pr_pr: Google Cloud Next

Id: 0UyrVqBoCAU

Channel Id: undefined

Length: 44min 34sec (2674 seconds)

Published: Tue Apr 09 2019