SETH VARGO: Hi there. And welcome to our second
video in the series on SRE and DevOps. My name is Seth, and
I'm a developer advocate at Google focused on
infrastructure and operations. LIZ FONG-JONES: And hi. I'm Liz, a site reliability
engineer, or SRE at Google, where I teach Google
Cloud customers how to build and operate
reliable services. So Seth, have you ever
run into the problem where you're trying to
talk about reliability and availability, and you get
so many different definitions it makes your head want to spin? SETH VARGO: It
happens all the time. LIZ FONG-JONES: And
even worse, have you ever run into the situation
where the developers are trying to push out new
features and they're breaking things more and more,
and they just won't listen? SETH VARGO: It's like
you're in my head. Have we worked together before? My team is constantly putting
out fires, many of which end up being bugs in
the developers' code. But when we try to push
back on the product teams to focus on reliability,
they don't agree that reliability is an issue. It's the classic DevOps problem. Do you have any recommendations? LIZ FONG-JONES: Yeah. This is a really common
problem in the relationship between product
developers and operators. And it's a problem that
Google was really worried about in the early
2000s when we were first building out Google web search. And this is when we started
defining the SRE discipline. And it's really been
a work in progress that we're improving on since. SETH VARGO: Wow. Google had these same
problems in the early 2000s? I had no idea. But I still don't understand. How can SRE help solve
this, apparently, very common problem? LIZ FONG-JONES: So SRE tries
to solve this problem in three different ways. First of all, we try to
define what availability is. Secondly, we try to define
what an appropriate level of availability is. And third, we get
everyone on the same page about what we are
going to do if we fail to live up to those standards. And we try to communicate this
across all the organization, from product developers
to SREs, and all the way from individual
contributors all the way up to vice presidents. That way, we have a shared
sense of responsibility for the service and
what we're going to do if we need to slow down. And we do that by defining
service level objectives in collaboration with
the product owners. And by agreeing on these
metrics in advance, we make sure that there's
less of a chance of confusion and conflict in the future. SETH VARGO: OK, so an
SLO is just an agreement among stakeholders about how
reliable a service should be. But shouldn't services just
always be 100% reliable? LIZ FONG-JONES:
So the problem is that the cost and the technical
complexity of making services more reliable gets higher
and higher the closer to 100% you try to get. It winds up being the case
that every application has a unique set of requirements
that dictate how reliable does it have to be before
customers no longer notice the difference? And that means that
we can make sure that we have enough room
for error and enough room to roll out features reliably. SETH VARGO: I see. We should probably
do another video where we talk about why
100% availability isn't a real target. OK Liz, I'm ready. I've decided that I want my
service to be 99.9% reliable. So where do I get started? Do I use Vim, Emacs, Nano? What do I do? LIZ FONG-JONES: So
I'm a Nano user. But first, you
really have to define what availability is
in addition to defining how available you want to be. We need to make sure that we
understand what availability is in the [INAUDIBLE]
server service, and that we have clear numerical
indicators for defining that availability. And the way that we
go about doing that is by defining not just service
level objectives, but service level indicators, or SLIs. So SLIs are most often
metrics over time, such as request
latency, the throughput of requests per second in
a case of a batch system, or failures per total
number of requests. They're usually
aggregated over time, and we typically apply a
function like a percentile, like a 99th percentile
or a median. And that way, we can get to
a concrete threshold which we can define to say, is this
single number good or bad? So for instance, a good
service level indicator might be saying, is the 99th
percentile latency of requests received in the
past five minutes less than 300 milliseconds? Or alternatively, another
service level indicator might be, is the ratio of
errors to total requests received in the past five
minutes less than 1%? SETH VARGO: OK. Thank you for explaining that. It's much clearer now. But how does that
SLI become an SLO? LIZ FONG-JONES: So if you think
back to your calculus lesson, Seth-- I know this may have
been a while ago. When you have a service
level indicator, it says at any moment in
time whether the service was available or
whether it was down. So what we need to do is
we need to add all that up or integrate it over a
longer period of time-- like a year, in your example
of 99.9% over a year-- to see, is the total
amount of downtime that we've had more or
less than the nine hours that you were worried about? SETH VARGO: But you should
always beat your SLO, right? LIZ FONG-JONES: No. So the thing is that SLOs are
both upper and lower bounds. So this is for two reasons. First of all, the
fact is that if you try to run your service
much more reliable than it needs to be,
you're slowing down the release of features that
you might want to get out that would make your
customers happier than having an extra
femtosecond of up time. And then secondly,
it's an expectation that you're sending
for your users-- that if you suddenly
start breaking a lot more often than they're
used to because you start running exactly at your
SLO rather than doing much better than your
SLO, then your users will be unhappily
surprised if they're trying to build other
services on top of yours. SETH VARGO: OK. So this is all starting to
make a little bit of sense now. But what is an SLA then? There are so many
SL letter things. And I remember at a previous
job, I signed an SLA something. What did I do? LIZ FONG-JONES: So to
spell it out first, a SLA is a service
level agreement. And what it does
is it says, here's what I am going to do if I don't
meet the level of reliability that is expected. It's more of a
commercial agreement that describes what
remediation you're going to take if your
service is out of spec according to the contract. SETH VARGO: I see. So the SLA is like
a business agreement associated with an SLO. So they're exactly
the same, right? LIZ FONG-JONES: Not
quite, because you really want to make your SLA more
lenient than your SLO. So you get early
warning before you have to do things like
field angry phone calls from customers or
have to pay them lots of money for failing to
deliver the services promised. We rarely work with SREs. And instead, we focus on meeting
our SLOs with the understanding that sales teams and business
teams will think more about the SLAs they
build on top of our SLOs. SETH VARGO: I see. So SLAs describe the set of
services and availability promises that a provider is
willing to make to a customer, and then the
ramifications associated with failing to deliver
on those promises. Those ramifications might
be things like money back or free credits for failing
to deliver the service availability. LIZ FONG-JONES: Yes,
that's exactly correct. SETH VARGO: So to
summarize, SLIs are service level
indicators or metrics over time, which inform about
the health of a service. SLOs are service
level objectives, which are agreed upon bounds
for how often those SLIs must be met. And finally, SLAs are
business level agreements, which define the
service availability for a customer and the
penalties for failing to deliver that availability. LIZ FONG-JONES: Exactly. So SLIs, SLOs, and
SLAs hearken very much to the DevOps principle that
measurement is critical, and that the easiest
way to break down the organizational
barriers is to have common language about what
it means to be available. And we give, with SLIs, a
very well-defined numerical measurement for what that is. And with the SLOs, we
collaborate between the product owners and the SREs
in order to make sure that the service is running
at an appropriate level of reliability for customers. It's a lot clearer
to me now why we say class SRE implements DevOps. SETH VARGO: Awesome. Thanks, Liz. And thank you for watching. Be sure to check out
the description below for more information. And don't forget to
subscribe to the channel and stay tuned for our
next video, where we discuss risk and error budgets.