What is Site Reliability Engineering (SRE)?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Thank you for joining us today! My name is Bradley Knapp, and I'm one of the product managers here at IBM Cloud and we've come to answer the question: what is Site Reliability Engineering, or SRE? And SRE is really the name for a new discipline  that's actually an old discipline.   It's a new name, it's only been around 15, 18 years, but the job itself has been around for a very long time. It's just evolved over time, and now we've given a formal name to the discipline and the job. And so, the question is, what is SRE,  what is site reliability engineering? And so, the way that I like to describe it is that it's really the collision of the  traditional IT role and DevOps, right? So, back in the day, in the traditional IT  role, you would think about lots of people sitting in an operations center staring at very  large screens, kind of arranged in a semicircle, like a mission center, or a  watch center in the military. Well, that world doesn't so much exist anymore, and in the new world, in the DevOps cycle that everyone should be embracing for their software releases, you still have to have reliability. Your developers are still going to engineer  the software to be reliable, but when it comes to actually operating it actually delivering  the service that goes out to the end customer, that's really kind of outside of the responsibility of those software developers. That's where SRE comes in. An SRE is what I like to call a 50/50 role, right? SREs should spend about 50% of their time focusing on solving customer issues. That can be escalations, could be responding to incidents, dealing with an upset customer who  needs help on a tactical problem. That's going to be 50, and then the other 50%   is maybe the most important part, and  that's every SRE should be actively   trying to automate themselves out of a job.  They want to automate all of the things. The buzzword for this is reducing toil  , right? Reducing all of the manual work   necessary to keep any kind of  software environment up and running. This includes the hardware itself,  it includes all of the middleware,   it includes the software - all of the related  services you have to keep these things live. And so, the question then becomes: all right,  well, we're going to automate these things,   isn't that putting my job at risk  if we get rid of these manual tasks? And the answer is: in reality, no it's not.  It's never going to put your job at risk,   because every time you automate something,  you gain some additional insight   into the system. Every time you automate  something, you learn something new,   and you identify additional tasks that  you'll be able to automate in the future. And so, automation is core. It's approaching  operations with a development mindset,   because you want to programmatically  solve problems so that you don't have   to go in and make the same manual  fix time after time after time. This is key to the SRE role, and  it's key to your success in it. And so that other 50% of the time I talked  about that before right, that's going to be   escalations. It's going to be on-call work or,   in some cases, for a large enough  organization, SRE might be 24-7. It's going to include customer facing work, right?  You are going to have to interact with customers,   and it's going to include being the  source of knowledge for your group. Because SRE crosses all boundaries: it  knows about hardware, it knows about   software, it knows about monitoring, it knows  about logging, it knows about automation. And so, they understand all of the different  components. They have the institutional knowledge   of how to keep the product up  and running as a product manager. I like to make the joke that when I want  to know how software's designed to run,   I go, and I ask the developers who wrote it.  When I want to know how it actually runs,   I go, and I ask SRE because they're the ones who  get to deal with the implementation every day. And so, bridging the gap between what  actually happens and what we want to happen,   that's so important to the SRE job because  they have day-to-day hands-on interaction   with how people actually use the product. So, SRE is constantly feeding data back  into development so that development can   make the software better, at the same time that  they're automating in all of the resiliency. SRE understands that failure will happen.   Failure is just the nature of business.  You cannot design a perfect system. And so, what SRE excels at is  programmatically identifying   potential failures and solving them ahead  of time, and it's also good at identifying   how are we going to solve  immediate tactical problems. And so, I talked a minute ago about monitoring,  right? That traditional IT room with all of the   screens. Well, monitoring and logging  are just key to the SRT role, SRE role. So SREs, as they monitor, they're keeping track  of what's happening in real time. Logging is an   archive of everything that's happened, so  that you can go back and examine it later. So, your monitoring is going to give  you the ability to anticipate failures   and see them coming so that  you can proactively solve them. Logging is when you get an unanticipated  failure. It allows you to go back   see what happened. You can do a an RCA, a Root  Cause Analysis , on it and figure out how to   solve it, not just for now, but for the future.  That gets back into the automation again, right? If you know what happened,  and you know why it happened,   you can then adjust that monitoring that we  were talking about, so that the monitoring   itself will catch this edge case and you  don't encounter that failure ever again. So, SRE is just core to a successful business,  and most companies will find they have a role   pretty similar to SRE today in the world  of software in the world of technology it's   something that we already have, even  though we may not be calling it SRE,   but if you're talking to a startup, a very  young company, they're going to say, well,   you know we don't have the budget to go out  and develop an SRE organization to start with,   right? We only have 25 employees, we  only have 30 employees , and that's okay. The important part of SRE for a small company is  not so much having someone with that job title,   because your developers are your operators at  that point. It's engineering everything they do   with that SRE mindset: that failure is an option  and, as a matter of fact, should be predicted for,   but is something that you can automate  to solve. It's something that you can   create enough redundancy that,  when failure does happen,   it's not a big deal because you're  resilient enough that nothing goes down. And so, as long as you develop with that SRE  mindset in mind, and you are being resilient,   you're being redundant, you are constantly going  back and automating problems so that you don't   have to manually fix the same thing over and over  and over again, and you're doing good root cause   analysis on actual failures so that they don't  happen again, and you're monitoring so that you   will know when they're about to happen and you can  head it off at the pass - that's really the key. Large organizations, they can afford an entire  SRE department. They can stand it up, or they   can transition an existing operations group into  it by empowering that operations group. Again   that 50/50 rule, spending  half their time automating,   half their time fixing problems, and automating  all of the things. Automate everything, because   the less manual work and manual intervention you  have the happier that SRE team is going to be. Thank you so much for your time today. If you  have any questions, please drop us a line below.   If you want to see more videos like this  in the future, please do like and subscribe   and let us know. And don't forget: you can grow  your skills and earn a badge with IBM Cloud Labs,   which are free, browser-based  interactive Kubernetes labs,   that you can find more information  on by looking below. Thanks again!
Info
Channel: IBM Technology
Views: 21,598
Rating: 4.9274755 out of 5
Keywords: Site Reliability Engineering, SRE, cloud computing, IBM, IBM Cloud, DevOps, Monitoring, IT
Id: ztIIcXNzMN4
Channel Id: undefined
Length: 8min 12sec (492 seconds)
Published: Fri May 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.