[MUSIC PLAYING] SETH VARGO: Today
I'm going to talk to you about DevOps versus SRE. This is a hot and
contentious topic. Which one's better? Which one's worse? Are they the same thing? Are they different, competing
standards or friends? For those of you that don't
know me, my name is Seth. I work on the Developer
Relations team at Google. I've been at Google about
a year and a half now. Prior to that, I
worked at companies like HashiCorp
and Chef Software, which are kind of regarded as
leaders in the DevOps space, similar to other tools like
Puppet or Ansible or Salt. And I've been involved
in this DevOps space for quite some time. I've written popular tools. I've contributed to
popular projects. And I've watched this movement
kind of from the beginning, really since about 2007, 2008. And I'd like to share
my perspective today on what I think DevOps is and
how it relates to SRE or site reliability engineering. So with that, let's go
ahead and get started. In the beginning, there
were two groups of people. On the left, we have developers. Developers are
concerned with agility. They want to build features,
build software, and ship it as quickly as possible to get
it in the hands of customers and users. On the right hand side,
we have operators. Operators are concerned
with stability. It's not broken. Please don't touch it. And this wasn't just
a personality thing. This was driven by the business. As an operator, it was
your responsibility to make sure that the
system never went down. And when the system went down,
you got a phone call or a page in the middle of the night. And it was your
responsibility to fix it. And if you didn't do
it in a timely manner, you might be fired. But then on the flip
side, the developers have all of these roadmaps
and Agile and Jira tickets that they have to complete. And if they don't complete
them and they don't get them into production, they're
not delivering value to the business. And the business is suffering. And then they're at
risk of getting fired. And these are two
competing ideas. As we introduce new features
and new functionality into a system, we also
introduce instability. Every new line of code we
write has the potential to have a bug, has the
potential to have a performance regression. So these are directly
competing ideas. We have developers
who are trying to move quickly and introduce
instability and operators who are trying to slow things
down as much as possible because it's not broken,
please don't touch it. But here's where things get
a little bit interesting. Developers were closer
to the business, both physically
and metaphorically. Developers often sat
in the same building as the directors and the
vice presidents and the CEOs. They were physically located
closer to the business decision makers. Operators, on the other hand,
were often in a data center. Their desk might be hundreds
or thousands even miles away from the corporate office. They often felt disconnected. They felt like their
ideas weren't being heard. Educationally, these
two groups of people often came from
different backgrounds. Developers traditionally
had a software engineering background, computer
science, information systems, computer engineering, something
along those lines from a two or four-year degree or more. Operators tended to come more
from a practical background. They might have, like,
an associates degree or a practicum in something
like network engineering. So there's a skills
gap on both sides. Developers are really good
at algorithms and writing software. Operators are really good
at understanding network topologies and failure
scenarios and how redundant does a SATA drive actually
need to be in order to have this many
nines of availability? But because developers were
closer to the business, oftentimes they would
just write their code. And they would throw it over
the wall to the operators. That animation was nice, right? Yeah, I worked
really hard on that. So these developers would
throw their code over to the operators. And they'd be like,
here's my PHP. Please go run it for me. Thanks, have a great day. And these operators,
remember, they don't have a
traditional background in software engineering. They may have never worked
with these languages before. In addition to the
responsibilities of keeping the network
up and running, making sure hard
drives are not filling, making sure that servers
are not faulting, they now have to understand
bugs in the application, because when that
application goes down, they're not paging
the developer. They're getting paged. They're getting woken up
in the middle of the night for a bug in the
application code. But because
developers were closer to the business, when
operators would complain, no one would listen. So the DevOps movement
really in its purest form is about breaking down that
wall between developers and operators. I know, that was
another nice animation. By breaking down the
barriers between developers and operators and aligning
the incentives between them, we can deliver software
better and faster and more safely for our end users. What are some ways that
we can do this, though? Well, one of the easiest
ways that you can break down the barriers between
developers and operators is to put them in the
same physical room. This has been shown
to work successfully time and time again. It's a time tested pattern. Instead of having your operators
go sit in the data center hundreds of miles away,
put them in the same room as the developers. Make them attend stand-ups. Make the developers
listen to the operators. And all of a
sudden, you'll start to have these amazing
conversations where developers will try to be-- they'll try to write some
algorithm or some distributed systems problem. And they'll have
this great thing. And it works great in theory. Then you have an operator
that steps in and says, so that's great. But you're telling me that
you need 40 gigs a second for network traffic. And our data center is
running Cat5e cable, which is maxed at, like,
10 over short distances and five in reality. So that thing that you're trying
to do works great on paper and works great on
your local laptop. But it will never
work in production unless we upgrade the
cabling in our data center. And just that small
collaboration saves the company millions or even billions of
dollars investing in a software project that cannot
be successful because the underlying
hardware won't support it. And this is just
one example of how putting developers and operators
physically in the same space helps improve the
software delivery cycle. If you look at the
DevOps manifesto, there's kind of
five key categories that DevOps is broken into. The first is reducing
organizational silos. And I've already talked
a little bit about this. How do we reduce the silos
that exist between developers, the people writing
code, and operators, the people making sure
that code continues to run? What we quickly found,
though, is that that was just the biggest mountain. After you reduce the
silos between developers and operators, who knows what
the next giant mountain is? Security, legal
review, marketing, PR. All of a sudden,
those little things become the new mountains. And this is where the DevOps
movement in its purest form-- if you're a DevOps
purist, you're like, no, it's just developers
and operators. But in order to actually
do this successfully, you'll see that it has to
involve cross-functional teams. The same way that we
want to involve operators in the development
lifecycle, we also want to involve
the security team, because if we involve the
security team early, instead of our security and privacy review
taking six months for someone who has no idea what
our product does, who has no understanding
of our core business goals, we instead have
someone who's regularly attending our stand-ups. They're regularly contributing
code and reviewing code. And then when it comes time
to actually do the review, they may be able to complete
it in weeks instead of months, ultimately delivering software
faster, but also safer, because they're far less likely
to miss something, right? Reviewing 100,000 lines
of code is much harder to spot a bug than reviewing
1,000 lines of code versus 100 versus one. The second piece of
the DevOps manifesto is that we have to
accept failure as normal. If you recall earlier, I talked
about fire, being fired a lot. Developers are worried
about being fired if they don't ship features. Operators are worried about
being fired if they don't deliver 100% availability. This is not OK. This doesn't create
a culture in which people, humans, can thrive. We have to accept
failure as normal. Any system that humans build
is inherently unreliable. And in fact, I would
challenge anyone to find a system in nature
that is 100% reliable. Most systems fail
given large scale. So given that anecdote
or that lemma, we have to accept
failure as normal. It has to be built into
the core of our business. We can't just fire people every
time the system goes down. Instead, we need to
plan for it in advance. So as a concrete
example, let's say we're doing a
database migration. We're about to roll out
a database migration. Before we do that,
let's plan for failure. Let's accept that failure
is going to happen. So before we actually roll
out that database migration, we're going to plan a rollback. We're going to write a script
that rolls back our deployment and rolls back the
changes to the data model. Well, why would we invest
that effort in advance? It might just work. And you're right. It may totally work and
that was wasted effort. But the problem is if it
fails, if that deployment or that database
migration fails. Now your phones are ringing. Your pagers are going off. Social media is blaring. Your boss is yelling at you. Your boss's boss
is yelling at you. The site is down. You're losing money. And you're trying
to build a plan. It's not the best
time to build a plan. The best time to
build a plan is when there's not a lot of pressure
and you can think clearly. So by accepting
failure as normal, we understand that bad
things are going to happen. There's going to be bad deploys. There's going to be bad
database migrations. How do we recover from them? We need to think about
that before we deploy them. On a similar note, the next
piece of the DevOps movement is this idea of
implementing gradual change. If you work in a waterfall
software development methodology, you may deploy
once a year, twice per year. And the problem
is if you're doing that, you're deploying hundreds
or millions of lines of code at one time. And the chance that there are
no bugs in those million lines of code is effectively zero. It is very, very small
that there are zero bugs in a million lines of code. I don't care how good of a
software engineer you are or how rigorous
your discipline is. There's going to be a bug. So you deploy the software. And some users
start complaining. Well, now you only have to
search through a million lines of code to find the bug. It's not that long. It might take another
year, versus deploying small incremental changes. If we deploy, say, 10 or
100 lines of code at a time, if all of a sudden our
monitoring starts failing, our users are yelling
at us on social media, our boss is telling us
that something is broken, or our internal users are
saying, hey, this is slow now, we know exactly where
to look for the problem. The smaller that
change, the easier it is for us to identify the
problem and the faster it is for us to fix that
bug or roll back the change. Rolling back a
million lines of code is a lot harder than
rolling back 10. Now on the flip side-- and this is my
soapbox for a moment-- I don't like when people
measure DevOps success in deploys per day. I think deploys per day
is actually a bad metric for success in this industry. You'll see it a lot where people
are like, oh, we deploy 6,428 times per day. And I'm like, great, I can
add a comma to some Java 6,486 times per day. That's not very exciting to me. But the frequency
at which you deploy relative to your business
and relative to your industry is a signal. It does tell you how well you're
practicing these methodologies. If once per week is
standard for your industry-- maybe you're in something
like fintech or banking where there's
regulatory requirements and that's the industry
standard, that's fine. You don't have to catch
up to some startup that's deploying 100 times a day. What's important is that you're
implementing gradual change with respect to your
business and your industry. End soapbox. The next piece of
the DevOps movement is this idea of leveraging
tooling and automation. And this is where you'll
often see people confuse tools like Chef, Puppet,
Ansible, Salt, Terraform as DevOps tools. And that's because those
tools largely guided the DevOps movement. They largely supported
that DevOps movement. And they coincided with
the DevOps movement. So a lot of people are
like, I'm a DevOps engineer. And I'm like, what do you do? And they're like, I
write Terraform configs. I'm like, OK, maybe not
the appropriate job title. That's fine. But leverage tooling
and automation is a key piece of
the DevOps movement. What we found is that after
breaking down those barriers and after accepting
failure as normal and implementing gradual
change, it turns out there's just a lot of
work, like creating users, installing packages,
building Docker containers, monitoring, logging, alerting. All of those things take time. And if you're a company
that has 100,000 VMs and you need to roll
out an open SSL patch, you can't have people sit at
a keyboard, SSH into them, and run yum update. It doesn't scale. And we quickly learned this. We quickly learned
that we have to have tooling, we have to
have automation in order to successfully
implement DevOps, because otherwise you're just
constantly chasing your tail. You're running around putting
out fires when instead we need to be leveraging
automation and tooling to make things repeatable and
to make a pattern out of these. Another key point
is that humans are inherently very bad at doing
the same thing over and over and over again. We get bored. We get distracted--
oh, look, a butterfly-- whereas computers are really
good at doing the same thing over and over and
over and over again. So we should leverage computers
for doing the same thing over and over and over again. The last piece of
the DevOps movement is that we have to
measure everything. It doesn't matter if you
do all of these things. If at the end of the day
your boss comes to you and says how much more
successful is the business and you say people
are happier, that's not a business justification. And I'm not saying
that we should be justifying everything we do
with money or deploys per day. But we have to have
numbers to support the efforts that we're driving. If today you don't
have any metrics and you implement
all of these DevOps things and everyone
feels better and you know that the
business is better, but when your manager
comes to you and says, hey, we've invested $2 million
over the course of a year. We hired a bunch of people. We bought these
software packages. We invested in this tooling. What do you have to show for it? And you don't have a tangible
metric to be able to point to, it's unlikely that the
effort will continue. And on the flip side, if you
don't measure everything, how will you know if
you're actually successful? You have to set clear
metrics for success. And that includes at the
organizational level, but also at the
application level. Now, there's also a difference
between measuring everything, monitoring everything, and
alerting on everything. These are very different things. You may measure CPU usage,
memory usage, available disc space. You may monitor
available disc space. But you may only alert if
no one can use your checkout page, because if a disc
is full, that sucks. But hopefully you're running
in high availability mode. And some other
service can take over. And a human can clean that out
during normal business hours. All too often I see
people adopting DevOps. And they jump right in. And they start alerting
on every metric. And they don't understand
why their phone battery keeps dying. CPU usage is at 30%. Oh my god! 30%. Why? We need to measure
and alert on things that matter to our users. Is the checkout page working? Can people actually buy things
from our e-commerce site? Can our internal users run their
analytics and their reports? That's what matters. That's what we alert on. Then we measure the other things
so that when we get that alert, we can find the root
cause or the root causes that
triggered that alert. So what you'll notice
here is that these are all abstract ideas, though. I know this is shocking. You had no idea this
slide was coming. These are all abstract ideas. I gave you some examples of
how you might, say, reduce organizational silos-- put people in the same room. I gave you some examples
of measuring versus monitoring versus alerting. But they're abstract
ideas, right? The way your company might
accept failure as normal is very different than the
way another company might accept failure as normal. And what we found is
that a lot of companies don't like abstract ideas. They want more concrete
implementations. So story time. Independently of DevOps,
around the same time, though, Google started
this discipline called SRE, which stand for
Site Reliability Engineering. It was an engineering-based
discipline-- so kind of on the
engineering ladder, the engineering pay
scale, if you will-- of people who build and
maintain reliable systems. Their job was to
keep, at the time, things like search and ads up
and running all of the time, or as much of the
time as possible. So SRE is kind of like a very
prescriptive way to do DevOps. And this is why
you might hear us say this phrase every so
often, which is that class SRE implements DevOps. Let me explain what
I mean by that. So SRE evolved
independently from DevOps. Google was really in its own
little bubble at that time. And they arrived
at SRE as the way to build and maintain and run
production systems at scale. And DevOps was kind of
built by the community. And SRE was kind of built
by Google at the time. Very recently, we've
learned that SRE should be shared with the world. For quite some
time, Google thought that SRE was like our
secret sauce, if you will. It was the thing that
differentiated us from our competitors. It was the thing that
differentiated us from other cloud providers. We shouldn't talk about it. It's a secret. We didn't even post
job postings for it. But very quickly, we learned
that there's a language to SRE. There's nomenclature. There's ways to think
about production systems that as a cloud provider,
when we go to a customer and we start saying these
words and we start talking about these concepts,
they have no idea what we're talking about. So we go to a customer
and we say, hey, this service has three
nines of availability. And you're depending
on it, which means you can never have
more than three nines of availability. And our customers just look at
us like, nines, question mark, like a German shepherd with
the head tilted sideways. And this is when
we decided that SRE doesn't have to be a secret. It doesn't have to
be a secret sauce. And in fact, other
companies can practice SRE. And then Google stepped
outside of its bubble and realized that
there's already a practitioner-based community,
the DevOps community that is trying to do this. They're doing it
successfully in some cases, unsuccessfully in other cases. And that's because
it's not prescriptive. DevOps is like an abstract class
or an interface in programming. It says here are the
things you should do. Go figure it out, whereas SRE
is a concrete implementation of that class. It says here's how you
reduce organizational silos. Here's how you accept
failure as normal. Here is how you
measure everything. And this is why we say
class SRE implements DevOps. Now just like in
programming, you may have an abstract
class or an implementation of an abstract class that
has other methods that are not in the interface. You may have additional
methods, additional functions that aren't in the interface. And SRE's the same way. There are things in
the SRE discipline that aren't really part
of the DevOps interface. But SRE does satisfy
the DevOps interface. And let me show you how. So if we go back to
the previous slide where we kind of talked about
the five key areas of DevOps, how does SRE reduce
organizational silos? Well, the first way
that we do this is we share ownership with developers. So SREs at Google share
ownership with the developers by using the same
shared set of tooling across the organization. So there's a single set of tools
that both software engineers and SREs use for
production systems. And by leveraging the
same set of tools, you have software
engineers contributing. You have site reliability
engineers contributing. And you start to build
high-performance, very reliable tooling that helps people
get their job done. On the other side
of the organization, SRE has very prescriptive
ways for which we determine the availability of a system. And the way that
we do that results in almost forced conversation
and forced collaboration between product teams,
developers, site reliability engineers, and even sales
and post-sales organizations. How does SRE accept
failure as normal? Well, one of the ways in which
we encourage that collaboration I just talked about is
through these things called SLOs, service
level objectives which I'll talk about
more in detail in a bit. But these service level
objectives-- in addition to kind of forcing collaboration
between these different groups in the organization, they also
force us to admit how reliable or how unreliable
our system can be. And by having that
conversation, we immediately admit that our service
is going to have faults. And as a developer
and as a product owner and as a site
reliability engineer, I can determine how
much fault I actually want my product to have. If I'm building a product that
is up against some competition, it needs to hit the
market very quickly, I may accept more risk so
that I can deploy faster, I can deploy riskier
changes, et cetera. If I'm building a product
that is targeting, say, the health care space
or the aviation space, that needs more nines
of availability. That has to work
all of the time. We can't have hours
and hours of downtime. We have to have redundancy
and availability. And that may slow down
our development velocity. But this is a conversation
that occurs between the product owners, the developers, and
the site reliability engineers. Additionally, as
you might expect, the SRE discipline
strongly encourages blameless postmortems, which
are popular among the DevOps community as well. When an outage occurs,
it's nobody's fault. And instead, we try to find
ways in which we can improve the system moving forward. Something that SRE does
that DevOps doesn't do is we then generate metadata
about our postmortems globally across the country-- or country-- company. So for example,
at Google we know that the vast
majority of our outage comes from a bad
configuration change. And that's a publicly
shared statistic. And the reason we know
that is that in addition to doing an isolated postmortem,
postmortem about an incident, we then categorize and catalog
all of that information into a database where we
can run reports and gain statistics about what is
causing these overall outages. And that can help you
build kind of meta-tooling and better improve
the system overall. SREs implement gradual
change by moving fast to reduce the cost of failure. Again, small
iterative deployments are strongly encouraged
over large 100,000, multi 100,000 line changes. Tooling and automation is really
a key piece of the SRE culture. In fact, SREs have this
thing called toil-- T-O-I-L. I'll talk about
it more in a little bit. But toil is this
idea that you should be spending on times that bring
long-term value to the system. So you should be
investing your time and investing your resources
in things that bring value to the system in the long run. SSHing into a system
and restarting a service doesn't bring long-term value. It brings immediate-term value. But that's something that we
should fix in the long term. So SREs invest heavily
in tooling and automation in order to automate
this year's job away. That's kind of an
unofficial tagline among a lot of SREs at Google
is that the manual tasks that I did this year I shouldn't
have to do again next year. I should automate those away. And lastly, we
measure everything. We measure systems
level metrics. But we also measure things
like toil, reliability. And we'll talk more about
reliability in a bit. But hopefully this is
clear as to why we say class SRE implements DevOps. SRE is a very prescriptive,
very concrete way to satisfy the DevOps interface. So given that, let's jump into
a few key areas that I think are important to discuss. The first is SLIs,
SLOs, and SLAs, oh my. An SLI is a service
level indicator, something like request
latency or requests per second or failures per request. They're a point in time
or an aggregate point in time of a particular
metric about a system. Basically an SLI tells you
at any given moment yes or no for a metric in a system. Is it healthy or is it not? And you define that. So for your system,
healthy may be that the ratio of successful
requests to failed requests is less than 1%. So 99% of your requests
are successful. So at any point in time, if
you imagine a point in time, that's a binary
operation, yes or no. At this moment in time, are
we up or down per that metric? Then an SLO is a binding target
for a collection of those SLIs. So while an SLI is
like a point in time, if you imagine from
calculus, if you were to integrate all of those
points over a time period, like a quarter or
a half or a year, that's where you
get your SLO from. So the SLI measures up or down. And then the SLO says
how much up or down can we have in a particular
time period-- again, like a quarter or a half. Then there's the one
that you probably heard of before, which is an SLA. An SLA is a service
level agreement. This is a business
agreement that happens between a customer
or a consumer and a service provider. The SLA is typically
based on the SLO. Ideally, you want to break
your own internal SLO before you break an external
SLA, because violating an SLA typically means
you have to give money or credits or
reparations in exchange for violating that SLA. So SLIs drive the
SLOs, because remember the SLO is basically an
integral of the SLIs over time. And then those SLOs
inform those SLAs. But to give you a
better understanding of who is involved
in this process, I built this handy slide. So product, SRE, and
software engineering work together to build
those SLIs up or down. Then the SREs and the
product work directly to determine what that looks
like over a period of time. And again, this is a function
of how fast do we want to move, what are we up against in
the market, how much risk are we willing to accept,
what's our target market. And then SLAs are
generally built by the sales teams
and the customers maybe with some
negotiations from product. But they have to be tighter
or less than the SLO, because again, you want your
SLO to break before your SLA breaks. And sometimes SLAs are also used
as part of a sales engagement where you can buy
more availability. So that's not really part
of the SRE conversation other than the SLO informs
kind of the minimum baseline for the SLA. So I often get this
question, which is like, OK, so you have this SLI thing
and then this SLO thing. And then what happens
when you go over? You're like, OK, system
needs to be 99.99% reliable. That gives me, like,
12 and 1/2 minutes of downtime per quarter. What happens when I'm out? Like, can I just keep pushing? Do I fire my developers? Like, what happens next? Well, this is where
error budgets come in. So one of the things
that's important to note is that it is nearly
impossible to find a system that is 100% reliable. And it is often cost
prohibitive to build a system that is 100%
reliable, especially when you're relying on
third party components. Take, for example, this
two-dimensional material phone on the screen. This screen, or this
phone has 99% reliability with its cellular
network carrier. This is actually pretty
standard in most countries. That's your SLA with your
mobile carrier provider. So your mobile
provider is saying that we will service
99% of your requests. So 1% of your
requests will fail. So let's say you have
some back-end service. And you have an unlimited
amount of money. And you say, we want
100% availability. So in order to do
that, you would have to run your own fiber
connections with redundancy to every cell phone tower in
the country of your choosing in order to be able to deliver
maybe 100% availability. And then you'd need your own
kind of internet backbone to run that. And you would need on-call
people pretty much constantly. Even if you did all of
that and invested millions, if not billions of
dollars, the user would only experience your
app at 99% availability, because they are governed by
the least reliable components of their system. So even if you have 100%
connectivity between your data centers and the cellular
network towers, that user, that end user will only
experience 99% availability, because they are governed by
the least reliable component in the system. And this is a key point
is that optimizing for 100% availability
isn't just difficult. Oftentimes it's irresponsible. It is not in your
company's best interest to optimize for
100% availability. Instead, you should be
accepting failure as normal and understanding
how much failure is available for your system. So like I said, that user only
experiences 99% availability. So how do you determine how much
risk your service can tolerate? How do I know how risky
my service can be? Well, there are many
factors to consider, like fault tolerance,
availability, competition in the market, how fast
you're trying to deliver, whether there's a giant
conference that you need to launch at. That was a joke. But your acceptable
risk dictates your SLO. So if you have a product that
has really critical market timing and you know that
you need to deliver features quickly, you may
say, hey, we're only going to offer one
nine, 90% availability, because we need to be
able to move quickly. And we don't want to have to
focus on reliability right now. We want to focus on shipping
new product and new features. But again, if you're in a
different industry like health care or aviation
where reliability is super important, you
may say, like, hey look, we only want to focus on
reliability right now. We need to improve our
nines of availability so that our customers
can trust us. So as long as your
SLOs are met, you can continue pushing new
features and new product. But what happens if
you violate that SLO? What happens if you've
exceeded your error budget, which is the amount of failure
you can have within your SLO? Well then everyone just
plays ping pong, right? That's how this works? No. You can continue deploying. Your developers can
continue building features. But everything has to
focus on reliability. They can't ship new
features until we improve the reliability. So the development
efforts, the focus shifts from building
new features and delivering new features
to improving reliability and improving the
availability of the system until the budget is replenished. So just like a
bank account where you get a paycheck
and the money goes in and then you buy
some Pokemon cards and then the money goes out
and then you get more money, the error budget
works the same way. It's that after we've exhausted
our bank account of error, we can only focus
on reliability. We can only deploy features that
focus on reliability or bugs that improve reliability. So what does that
actually look like? Well, here's kind of
a pretty dumb version of an internal graph. So on the top, we
have the health of the system,
which is measuring the number of requests that
are under 300 milliseconds at any point in time. So this is like an SLI. In the middle in the green,
we have the compliance, the up or down. Are the sum of the
requests under 300 seconds? So you'll notice that that's
more of a step function, because it's binary. It's yes or no. You're either above
or below the line. And then at the
very bottom, we have the budget of non-compliant
requests in our SLO budget. So how many requests
are remaining? What you'll notice there is
pretty far in the right hand side, there's a drastic
increase in our latency. So you'll notice the-- it's a little bit confusing,
because the chart goes down, which is a decrease in
the number of requests under 300 milliseconds,
which is an increase in the number of requests
over 300 milliseconds. So we see a drastic
increase in latency that obviously
causes our compliance to drop significantly. And then you can
start to see that just like my bank
account, all of that starts to drop in our budget. We start to see a massive drop,
because we have a prolonged violation of our SLO. And it's a steady decline
until we're back in compliance. And then we gradually gain a
little bit more error budget, because time is moving on. And then we start
declining again, because we have a regression. And these types of charts help
inform how much availability your system actually has. And when we're at the
bottom of that budget, we have to focus on reliability. So you might ask yourself, well,
what just prevents developers from saying, well, like,
I know I'm at the budget. But this is an
important feature. Why can't I deploy it? Well, they can. But they might lose
their SRE support. So remember that
SRE is a discipline. It's a separate organization
that partners with the software and product teams. And if the product team
and the software team are not willing to be adequate
partners, the SREs will gladly hand you the pager and
walk away, at which point now developers are not only
responsible for building new features, they're also
responsible for the reliability of the system. And I guarantee
you they'll start improving the reliability of the
system that night at 2:00 AM. Another thing I
want to talk about, because I get a lot of
questions about it, is toil. It's a fun word, toil. It's like foil,
but with a T. Toil is best described
as what it's not. It's one of those weird things. Toil is not email. It's not expense reports. It's not meetings. It's not traveling. These are all things we
call overhead, things that are required to
do your job that pretty much everyone in
the organization needs to accomplish. Toil is actually something
that is manual, repetitive. Most importantly, it's
devoid of long-term value. It's often incredibly tactical
and highly automatable. Classic example of toil is
like SSHing into a system and restarting a service
because it's out of memory or it's spiking CPU
usage or something crazy. That's incredibly tactical. Right now, doing that,
graphs go back down. Everyone's happy. But it is devoid
of long-term value, because that service is likely
going to out of memory again. And you're going to
have to do it again. And it's going to be repetitive. And it's manual. But it is very easy to automate. That's a classic example of
something that's very toily. And in the SRE discipline,
we measure toil and we talk about
toil because it's a very negative
consequence of the job. If you're constantly
working with toil, this can lead to
career stagnation. No one got promoted
for restarting servers. How many people got
promoted because they restarted a server once? Right, exactly. No one. And at the same time, a little
bit of toil is also good. So there's a careful
balance here. And every year, we kind of send
out a survey to all of the SREs at Google. And most SREs aim somewhere
between 10% to 20% of toil in their job. So you might ask yourself,
where's toil good? It sounds like this is terrible. I should automate everything. Well, if once per
year you have to do some very complex operation,
like some aggregate report that spans multiple systems
and it's very complex and it would take you
20 hours to automate it, but it just takes you 15
minutes to do it manually and you only need to
do it once per year, it is not a good
return on investment for you to spend
time automating that. Instead, you should make sure
that other people on the team know how to do it so that you're
not one person, documented, et cetera. But you shouldn't
waste more time automating something than
it takes to do it over time. It would take a hundred years
for you to reclaim that time. No, it would take you 125
years to reclaim that time. It's just not worth it. Toil is also an excellent
way for newcomers, like interns or people new to
the team, to learn the system. There's no-- I mean,
there are better ways. But one of the best ways to
learn a production system is to explore it. And being able to poke
around and understand systems and understand how
they work with one another-- that's very toily. But it's also a great
learning opportunity for newcomers to the team. Another value for toil is that-- how many people have
ever had a bad day? Wow, that's like-- you are
all so positive people. How many people have ever had a
straight six-hour meeting day? I know that you've all had that. Yeah, right? And then at the
end of six hours, you have 475 unread emails. You basically feel like
you've accomplished nothing for the day. And then you get a page
that this server's out of disk space. And you can fix it. Then you can go
home for the day. And you feel like you've done
something with your life. Toil can actually satisfy that. Toil provides that
instant gratification, but in small doses. At a large, grand scale, we want
to eliminate toil as much as possible. We want SREs to be
focused on improving the reliability of the
system and the availability of the system, not
performing toily tasks unless absolutely necessary. So I've given you kind
of a brief overview. You may ask yourself,
where can I learn more? There's this happy link at
Google.com/SRE where you can find all of this information
and a bunch more, including two free books-- that's free e-books. The first is the "Site
Reliability Engineering" book. I like to call this the
theoretical calculus book. And then there's the "Site
Reliability Workbook," which is like the
problem set, if you will. They're both excellent books. The SRE book talks
a little bit more about the theory
and the history. The workbook is a little
bit more practical, a little bit more hands on. And there's an entire section
about how DevOps and SRE relate to one another. Both of these are free. Did I mention that they're free? And you can download
them for free. If you would like
a print version, you can purchase them from
your favorite book vendor. So to kind of conclude
here, is SRE DevOps 2.0? No, it's not trying to be. I also hate when people
put 2.0 on things. Is SRE trying to
overtake DevOps? No, lol. And I learned yesterday
that I'm the only person who pronounces lol. I thought people
pronounced it, lol. Can I adopt both DevOps and SRE? Yes. As I said before, if
you implement DevOps, if you follow the SRE
workbook and SRE book, you will be satisfying
the DevOps interface. On the flip side, if you
satisfy the DevOps interface, you might not be doing SRE. There are a number
of other disciplines, like VictorOps has a
thing called ProdOps. There are other disciplines that
satisfy the DevOps interface, but are not SRE. Is DevOps dead? I get this question a lot. No one really talks
about DevOps anymore. I think that's
because it's not dead. It's just assumed now. It's assumed that
you're following some or all of these principles. Is my talk over? Yes. We have about five
minutes for questions. There are two microphones
in the middle of the room. If anyone has questions,
please line up at the microphone
for the recording. While folks are
lining up, if you have a question that you
don't feel comfortable asking on the microphone
or on the recording, that is my Twitter handle. Feel free to tweet
at me at any time. My direct messages
are also open. So if you have a
question that you want to ask that you don't feel
comfortable asking in public, feel free to send
me a direct message. I will get back to you
as soon as possible. Thank you all so much. [MUSIC PLAYING] [APPLAUSE]