AWS re:Inforce 2021: Scaling security, one human at a time

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(upbeat music) - My name is Eric Brandwine, and I'm a distinguished engineer with the AWS Security Team. One of the most common questions that I get from customers is, how do you do that? I'm an engineer, and most of the people that I talk to that our customers are engineers. Engineers like technology, which is good. That's why they're engineers. And usually when I get this question, what they're really asking is more about the tools that we have, the things that we've purchased or that we've built, the technical mechanisms that we use to run the AWS Security Team. But I've been at this for a while, and that means two things: One, I've made a bunch of mistakes, and I've learned a lot of things that don't work; and two, the scale is daunting. Exponential growth is implacable and impressive. I've had to completely change how I think about getting the security job done. I've realized that the single most important thing that we have is our organization, our humans, our people. Scaling as a leader, even scaling as an engineering leader, is a very people-intensive process. My job is still technical. I still dive deep and get involved in the details, but the most important thing that I work on, that I build, isn't built using computers. So as is the style these days on the Internet, we're gonna talk about scale using bananas. And so at first I built tools. When we started AWS Security, it was me and three managers. I was literally 100% of our engineering bandwidth. If it happened, I did it. This is the job that I thought I was gonna have when I was in school, and I was super happy with it. But then the banana turned into a bunch of bananas. As things got larger, I helped other people build tools. This was great. We were getting a lot more done, and I had to learn a bunch of new skills. It still kind of matched my expectations. We were a team, I was becoming a leader, and it was awesome. But remember exponential growth, the banana bunch is now an entire banana tree. The cloud got bigger. The team got bigger. It got to the point where no single person could even be aware of every single tool, every development effort that was underway. The goal now was to build the org such that the right tools were built at the right speed with the right quality bar, even more new skills to learn, a bunch of new challenges, and still tons of fun. We were a team of teams, and I figured out that the most important thing that I was building was the AWS Security Organization. But remember, exponential growth keeps on marching. The only analogy that I have for this kind of scale is an entire banana plantation. It got to the point that no single person could even be aware of every single hiring decision, every single headcount allocation, every single roadmap trade-off. I really had this problem at the banana tree stage, but it wasn't until the universe rubbed my nose in it that the dime dropped for me, that I really understood the challenge that I had to work on. I couldn't build the AWS Security Organization. It was too large. It changed too quickly. I couldn't keep up. My mechanisms for scaling myself kept breaking down. I was forced to personally confront a lesson that large-scale leaders throughout history have had to learn. No longer could I stretch myself such that I had a link, no matter how tenuous, to even the big things that were going on. Even if I ignored the actual security work, I couldn't keep up with the pace of building the organization. What now? We have to build an organization that not only built the right tools with the right speed and the right quality, it had to, on its own, build more of that organization. It had to be self perpetuating and mostly autonomous. Now what? How do we build this machine that builds itself using humans, which are notoriously non-standardized and difficult to predict? So just a caveat here, I used the word I an awful lot on this slide. This was my story about my growth alongside the growth of AWS Security. I'm one of many people who have helped grow and shape this organization. There may be a stage that comes after banana plantation, but this is as far as I've gotten in my career growth. I'll let you know if I get there. Anyway, the answer to building an organization that builds more of itself is large and complicated, and I won't claim to really know all of it. But an important part of the answer is culture. We are, all of us, members of a bunch of different cultures. I'm an American, I'm a Jew, I'm an Amazonian. Each of these groups has their own customs, their own norms, and they tend to be self-perpetuating. I can walk into a synagogue, even one I've never been in, and I know what to say, what to do, what not to do, even though I've never been there before, never met these people. Amazon has gotten large, and there are teams I've never heard of before. But when I meet a new team, we're working off the same playbook. We've talked a bunch in various fora about elements of our Amazon-wide culture, like our leadership principles, our use of bar raisers, working backwards documents, things like that. Those are incredibly important mechanisms for the Amazon-wide culture, but AWS Security isn't all of Amazon. We've got our own peculiarities, our own priorities. And we have to make sure that our team is working on the right stuff in the right direction and that our new hires are brought into this culture. How do you make a culture? And again, I don't claim to have the entire answer, but one of the mechanisms that we use for intentionally building and steering our culture are tenets. Effectively, tenets can be viewed as rules for a culture, what we want people to do, how we want them to act, what is important to us as a group. Good tenets are hard. Often, you've got a culture that is evolved. Hopefully, you mostly like it. It's really tough to think about your culture, to step outside of it and look at it objectively and to write down the rules that capture what you like about it. It can be harder to write down rules that address what you don't like about it. Often cultures feel instinctive and subconscious. Not only do you have to be able to think about these all but automatic behaviors, you have to worry about the unintended consequences of the changes that you're trying to make. You're not gonna get this right the first time, and that's okay. Good tenets are often in tension with each other. They're not just simple declarative statements that can be just followed. They're guideposts, ways of thinking that help people make good decisions in unforeseen situations. If you're gonna write down tenets, if you're gonna try to make them an element of your culture, you have to take them seriously. As leaders, you have to follow the tenets, to use them in conversation, or nobody in your organization is gonna take them seriously, either. So enough lead-up here, let's get to our tenets. And this is how tenets are always presented at Amazon: Our tenets, unless you know better ones. And it's an honest offer. I've taken feedback on our tenets. I've given other teams feedback on their tenets. Literally everyone is invited to speak up here, and it can be difficult for junior people, new people. It can be very uncomfortable for them to feel comfortable challenging tenets, and it's our job as leaders to give them the space and the comfort to do so. Our tenets are posted publicly, publicly within Amazon on the AWS Security Wiki page. Tenets can't be limited access. They can't be need-to-know. They can't be restricted. And so literally anyone in Amazon that wants to read our tenets can go to our Wiki page, and they can read our tenets and they can understand what we value and how the team is going to prioritize their work. So one, we lead in preventing unauthorized access to AWS resources, our customers' or ours. We continually assess our systems, identify exposures, evaluate risks, and relentlessly drive mitigations. Our first tenet seems pretty obvious for a security team, but there's some nuances here. Just writing this tenet down changes it from an implicit assumption into an explicit expectation of our team. At pretty much every re:Invent, Andy Jassy would say security is job zero at some point during his keynote, and he's serious about that. Every team owns the security of their services, which is great, because at this scale, we have to have everyone pulling with us, but we're out in front. We lead. This is our focus. But it doesn't say we lead Amazon or we lead AWS. We expect our team to be out in front, not just inside the company, but outside as well. If there's a security issue to be found, we're the ones that should find it. Our customers or ours, this scopes our responsibility. Of course, the AWS services, infrastructure, data centers, et cetera, and all of our internal usage of AWS is within our charter. But this tenet tells us that if our customers are not getting the right outcomes, that we need to engage. We have the shared responsibility model. Our customers are responsible for their own security, and they have deeper knowledge of what is and is not acceptable for their applications than we do. However, this tenet tells us that we need to care about the actual resulting customer experience. It's the bit of our culture resulting from this tenet that led to the launch of our customer-facing security services, like GuardDuty and Security Hub. And relentlessly, security can be exhausting. We're here for the long haul. We have to work with the service teams for years, and we have to have a good working relationship with them. This tenet tells everyone in the Security Organization and everyone else that reads our tenets that we expect our team to doggedly drive issues until they are done, done, done. The fact that one of our engineers won't drop something isn't them being annoying. It's them doing exactly what they should be doing. Two, we constantly provide visibility to senior leadership into the biggest potential risks backed up with data and carefully prioritized. I've talked in the past about the culture of rapid escalation that we have at AWS, and this tenet has elements of this. Constantly, we are expected, not only to report up to our leadership, we're expected to do so all the time. There are plenty of companies where you avoid escalation, where bothering your executives is a sign that you failed to do your job and that you need help. Here, we not only expect people to keep our leaders informed, we expect them to do it all the time. Backed up with data. Security is inherently dealing with the unexpected, unique unforeseen events, but even so, security at AWS is a data-driven discipline. We may not have all the data or even a lot of data, but anytime we engage, we bring the data we have. When we have the data, when it's available, we have it. We're familiar with it and we can speak to it. And carefully prioritized. What we're saying is that we're gonna call it like we see it, no matter how uncomfortable that may be. If we think that flagship launch at re:Invent isn't ready and won't be ready for six months, that's what we're going to say. If the security thing that Adam or Andy has been asking about literally every week for the past couple of months is what we believe to be the third or fourth or 12th priority, that's what we're going to say. Our most constrained resource, would anyone care to guess what are most constrained resource is across AWS and Amazon? It's our builders, our engineers. Every day that they spend working on a security effort, they're not working on new features or services. They're not even working on other security tasks. Security teams inevitably run into tough trade-offs, and we don't punt on that problem. We own it, we dig in, and we bring forward our best suggested prioritization. Three, we escalate appropriately yet aggressively to ensure that security issues are resolved promptly and with high judgment. If in doubt, we will escalate. Right here, clearly articulated, we've got our culture of escalation. I could highlight the entire tenet but will refrain from doing so. If you hang out at Amazon long enough, you hear people talk about making high-quality, high-velocity decisions. That's what the promptly and with high judgment is getting to. We're gonna do it fast, and we're gonna do it right. We reject the false choice between the two. And the way we do this is by escalation. If a group of us is not confident in the decision that we're making, or if we can't converge on a decision, we don't yet have the right people engaged, and we need to escalate. It's easy to escalate aggressively. I could just page Adam or Andy or Jeff for every issue, and that would qualify as aggressive escalation, but it wouldn't be appropriate. But again, we're going to do this well, and we're gonna do this quickly. We're gonna eat our cake and have it too. We do this by calibrating our team members, by encouraging escalation, and by giving them cheap, low-risk escalation paths, it can be really uncomfortable for a more junior engineer to escalate to a general manager or vice president that they've never met. Honestly, it's unfair to expect them to do so. Instead, we make it clear that escalation within the AWS Security Organization is free. Everyone has a manager. Everyone has teammates that they trust. Those are great first points of escalation, and those people can help calibrate the escalation. They have a broader network that can help bring the right people in. And everyone in our team knows that they can call on the leaders of AWS Security at any time, day or night, if they need our help. And one of the things that we help with is calibrating escalations. We have a bunch of on-call rotations. There's a dedicated pager carrier for these internal escalations and I'm on that rotation. But Steve, CJ, and I, as well as other leaders, are always available. I will underline this bit here. If in doubt, we will escalate. This is super clear, in plain, unambiguous English. This is an example of a tenet as declarative instruction. This tenet came out of a review of our tenets with Andy Jassy. He's the one that added this sentence. It captures a clear expectation of our most senior leadership from the CEO on down. One of the common concerns that I hear in response to talking about this culture of escalation is that our pagers must be going off all the time. Do we ever sleep? Isn't it exhausting being on-call all the time? In reality, no, it's not a problem. The number of inappropriate escalations that I've been involved in is stunningly low. I've been at the company for 13 years, Very few. Almost every time I've been pulled into an issue, it's been the right call. The times when someone else could have handled it or there was a better way to escalate just serve feedback to me and to the other leaders on our training and our tooling to make sure that there's better escalation paths next time. When we've dug in and something turns out to be no issue, people often apologize to us. It's natural. You've got someone. This is their first security event. They're not sure what to do. They hesitantly push the Page Someone Right Now button, and it comes back and it says, there's no issue. There's nothing wrong. And the natural reaction there is to say, I'm so sorry for paging you in the middle of the night for something that was no issue. And people across AWS Security at every level always respond, "Nope, that was the right thing to do." I would rather have a mountain of no issues than a single missed issue. It's wonderful. It thrills me so much, and it's a sign of the culture in action. Four, we are guardians of customer privacy and trust. We advocate for our customers in all security engagements. This tenet is pretty straightforward. It tells us who we're working for. Privacy and trust, this bit clarifies our charter. Our relationship at AWS with our customers is deep and rich. They trust us not just with their data, but with the computations on that data. It doesn't matter how much vetting you do when selecting a partner. It doesn't matter how many compliance controls and audits they can produce evidence for. It doesn't matter how many security or encryption features they offer. At the end of the process, you have to make a decision to trust this partner. It is our job to make sure that AWS is worthy of the truly humbling amount of trust that our customers have placed in us. But this tenet doesn't just talk about trust. It also talks about privacy. We just had the Fireside Chat about privacy. Privacy is a foundational part of what we do. We ensure that the data that customers have trusted us with is used and retained in accordance with their expectations. And all, I like this word here. It doesn't matter if you're involved in an application security review, compliance audit, design review, high-severity security ticket, or literally any other activity, the answer to, "Is now the time to speak up for our customers?", is always yes, always. Our customers can't be present in these meetings, these engagements, these tickets, so we're there to speak for them. This can come across as corny and perhaps trite, but I actually find it really empowering. One of our leadership principles is customer obsession. It's a pillar of the Amazon-wide culture. In my role as a security engineer, I regularly take unpopular uncomfortable and even borderline heretical positions, but I've never had a problem as long as I could show that I was doing so from a position of customer obsession. This tenet is one of the ties between our team's culture and the broader Amazon culture. Five, we own security for all of AWS, including third-party and open-source software. We take nothing as a given and extensively test all of our components, even those built by other parts of the company. If something doesn't work for us, we will move off of it. At re:Invent in 2017, I gave a talk about normalization of deviance. In the talk, I go through the tragic story of a plane crash. Highly trained pilots failed to follow the approved procedures, leading to overrunning the end of the runway, a crash, and the deaths of all onboard. I'm not gonna recount that entire story here. It's literally a different talk, but it's a topic that I worried about four years ago, and it's one that's still front of mind for me today. When you dig in, you realize that these highly trained pilots, they go through a tremendous amount of training. The airplane is incredibly expensive. You don't entrust it to just anyone, yet these highly trained pilots made these mistakes that led to their deaths and those of all of their passengers. And when you dig in, you learn that this was not a one-time failure. These pilots didn't come to work one morning and say, "Let's get sloppy today." Slowly, likely over years, their discipline slipped. There were no negative consequences for their actions, and so their discipline slipped some more. The local community of pilots, perhaps it was just these two that work together, perhaps there was a larger group that all worked together, all acted alike. Had an outsider come into that group, they would have been appalled. And this is called normalization of deviance. Our application security process is constantly evolving, improving as we get better at our jobs and as our tools get better. Yet, the services that we launched a year ago or three years ago or five years ago were pretty good. Customers liked them. And that older application security process was a lot easier. This team is under a lot of pressure to launch, and just a couple of years ago, we didn't have to do this step or that step. Can't we skip them just this once? And perhaps most frustrating, it's hard to know when your security efforts have made a difference. "We fuzzed that piece of code for weeks. We fixed a dozen bugs. It's been running flawlessly for a year." Is it running cleanly because we fuzzed it, or would it be doing just fine without the fuzzing? If you look back on some of the largest IT security flaws in the industry, things like Heartbleed, EternalBlue, Spectre, and Meltdown, one thing that they all have in common is that they were present in the code or in the hardware for years before they became public. Is my security work making a difference at all? The feedback loop between what we do and the ensuing results can often be very long and very lossy. In some cases, it's possible to make mistakes in security that have no negative consequences for years, you can see how it would be easy in security for your discipline to slip. You relax a bit, nothing bad happens. Teams move faster. You relax a bit. Nothing bad happens, 20 goto 10. At the end of this process lies a loss of customer privacy and trust. We cannot go down this path. So five, we own security for all of AWS, including third-party and open-source software. We take nothing as a given and extensively test all of our components, even those built by other parts of the company. If something doesn't work for us, we will move off of it. This tenet is one of our efforts to prevent normalization of deviance. All of AWS. Here's another bit that scopes our charter. We own security for AWS, full stop, from the cement slab in the data center through the power and cooling gear of the servers, the network, the services we build, everything. It's a daunting task, but it makes the ticket routing flow chart really simple. You've got a security problem. We're it. It doesn't matter who wrote it, where we got it, who runs it. If it affects the security of AWS, it's ours. Across the Internet, there are defacto standards. If you need a library for parsing XML, for terminating TLS, for any of a huge number of common tasks, there's a preferred choice. Everyone's using it. It's the most popular library for this task, pretty much everywhere. The obvious assumption is that it's good and that we should use it too. We're not allowed to make that assumption. It may be good. It may not be good. We have to go and actually find out, get actual facts, make an informed decision. We take nothing as a given. This tenet is telling us to make our assumptions explicit and then to question them. This is incredibly hard to do, but you get better at it with practice. We can't become complacent. We can't allow our discipline to slip. We will move off of it. This last clause tells us that there's yet another way, that not only is it okay to put ourselves in an uncomfortable situation, we're expected to do so. Amazon has a rich technical legacy. We've got decades of tools built by really smart people across the company. But many of these tools were built, for example, for a single tenet e-commerce site, massive, scaled, secure for their intended purpose, but very different from the low-level multi-tenet infrastructure services that AWS started with. It may be that everyone else in the company is using some tool, but if it's not right for AWS, we're not going to use it. If we're already using it, we will migrate. These migrations are expensive, and it can be difficult to tell the service team, your service that you're perfectly happy with and really proud of needs rework. Instead of that cool feature, you need to do this migration. This tenet tells us that not only are we allowed to speak up here, we're obligated to do so. Amazon. We are the one-stop shop for all security questions within Amazon. In cases where we don't own the answer, we own getting the question answered. Amazon is a large distributed company. Teams are good at navigating Amazon within some radius, and commonly performed tasks converge to some level of reasonable efficiency. But outside that radius, it can be very difficult to navigate, to find the right owner. And so one pattern that I've seen here is that someone cuts a ticket to their best guess for the right owner. That on-call engages and says, "Nope, that's not us. Try team two." Team two engages and says, "Nope, that's not us. Try team three." This can continue for quite a while. And in the most frustrating cases, you wind up looping back to a team you've already talked to. I call this ticket ping pong. This is a waste of everyone's time. It doesn't move us any closer to resolving the issue, but it's rational behavior for each of those on-calls. Each of those on-calls is trying to limit how much time they spend on a problem that isn't their problem. They're being helpful, but they're only being locally helpful. For us, security is normal. The quotidian pager tickets, the constant risk management decisions, it's what we're trained for, and it's familiar to us. For our service teams, an urgent security issue is unusual, unfamiliar, unsettling. Getting lost in a twisty maze of security passages all alike is exactly the wrong outcome. If we've got an urgent security issue, we can't waste time on ticket ping pong. As large as our team is it's way too small to secure AWS alone. We own security, but security is everyone's job. If we're gonna drive the right outcomes for customers, we need all the service teams, everyone pulling along with us. And to be clear, the right outcomes for customers don't mean Security always gets its way. The service teams have deeper business context than we do. And in order to get to those high-quality decisions, we need to have a productive relationship with them. Even if the issue isn't urgent, tickets that bounce around from queue to queue, email threads where nobody owns the issue, unproductive meetings with the wrong people are frustrating and disappointing and they erode that relationship. If you have a security problem and you get ahold of someone in AWS Security, it's sticky. You may have found the wrong person. It may not be their job to help you, but per this tenet, our answer has to be, "That's not us. You probably want Team X. I will reach out to them and find an owner for this." We're gonna spend more of our time right now, locally suboptimally, so that we get a more globally optimal result. It's inherent in security that you're gonna have plenty of uncomfortable, unexpected discussions that put a strain on your relationships. You've got to take any opportunity you can to invest in those relationships, to build strength that you can rely on when you need it. We found that this intake process, this first impression, the time between having a question or an issue and finding someone to help is surprisingly important. Seven, we drive our work to focus on the most critical security risks for the business. They will be prioritized first for the business and then for the service teams. we will ensure each expectation is well-understood, actionable, and supported by appropriate tooling. Security is the art and science of risk management. There's no organization on earth that has zero security risk. The only way to drive security risk to zero is to not offer any useful services. The old joke is that a pair of wire cutters is the best network security tool. Security engineers like to fix things. This is a natural human reaction. You get that hit of dopamine, the sense of accomplishment. It's great. And speaking as a security engineer, one of the hardest things to do is to walk past a problem that you know you can solve. A team is struggling, a customer isn't getting the right experience, and you can help, but you must not. Because if you spend your time helping out here, you're not gonna be spending your time working on the bigger, more ambiguous, more important security challenge that you should be working on. This tenet is telling us to stay focused. You can escalate, you can phone a friend. You can make the argument that this new issue is more critical to the business than that other one, but we've gotta keep working on the most important things. On the last slide, I talked about building relationships with the service teams. These can't be tiny little badge pictures next to correspondence and tickets. They need to be colleagues, real human three-dimensional beings. As these relationships grow, we're gonna empathize with our friends on the service teams. Another perfectly natural human reaction is to think, we've cut them a lot of tickets recently. That last on-call shift was really tough. Do I really need to page them for this one? This highlight here, first for the business and then for the service teams reminds us who we're working for. It is the goal of the business to drive long-term customer value, leading to a virtuous cycle of deeper relationships, more usage of our services, and greater customer value. So in this case, the business is a proxy for our customers. We have to do what's right for customers. That doesn't mean that we don't empathize with the service teams. In our hypothetical example, you have to cut the pager ticket, but then you can immediately pick up the phone and reach out to the on-call and make sure they're doing okay, that they have everything they need. You can reach out to the services general manager, and you can express your concern about the ticket load and offer to help. But regardless, the prioritization and urgency of the asks coming from our team, our expectations of the service teams are going to be driven by customer risk, customer expectations, and customer needs. And actionable, in security, it's easy to tell people what to do. We all know what the right things are. Use least-privileged, revoke old keys, a whole bunch more. Saying it doesn't help much. Anyone that's read more than three or four security blog posts, they know this stuff. If they care about security, and security is everyone's job, and if they're competent, then why aren't they doing it? It's because it's hard, and they don't even know how to get started. If we expect a team to do something, it has to be actionable. There has to be a clear next step, a path for them to follow. We've talked a bunch about escalation, and this applies to escalation. Eric, the grumpy security engineer that just wants to get stuff done is gonna send an email to a VP that says, "Your team hasn't finished deprecating that old TLS protocol yet, and you should feel bad." This isn't helpful. It erodes the relationship. It's not actionable. It's gonna get a head shake, a shrug. It's gonna get deleted. If instead, Eric, the AWS Security engineer, sends this VP an email that says, "Of 100 load balancers, your team has migrated 87. There are 13 remaining. Of those 13, four are blocked on this feature that's due to be released on this date. That leaves you with nine actionable load balancers that you should move right now. Here's the list of the nine, and here's a link to the instructions that you should follow." Then I'm helping them do the right thing. I'm guiding them down the path. That second email is actionable. Supported by appropriate tooling. If you have a narrow problem, then you want the owners of that problem to own the tooling. For example, our hypervisor team owns their own build tools, their own test tools, and their own patching tools because our standard tooling doesn't work for them, and they're the experts. But if you have a broad problem such as patching general purpose Linux boxes or managing IAM policies, you need to have centrally-owned tooling. If there are a hundred teams that need to patch, not only is it gonna be really expensive to have all 100 of them build their own patching tooling, you can have 100 different sets of bugs, 100 sets of subtly different behaviors. It's gonna be a disaster. It will be cheaper and better to invest in a single centrally-owned set of tools. And it's gonna be faster. My nephew was a U.S. Marine and at the firing range, they taught them that slow is smooth, and smooth is fast. And that really captures how I think about a lot of things in security. At our scale, you have to learn how to panic strategically. Slow is smooth, and smooth is fast. If you announce, "Okay, everyone upgrade TLS, go!", Then it's gonna be satisfying. A few teams are gonna figure it out quickly. Your numbers are gonna start to move. There's gonna be a lot of activity. You're doing security! And it's gonna be a mess. You've successfully panicked, but it's not strategic. If instead you dive deep on the TLS upgrade problem on how services are using TLS, which libraries or which services they're using to terminate TLS, how customers are connecting, what impact this migration is gonna have on customers, and then you plan and build tooling to support the common use cases, it's frustrating. Your numbers sit at zero. There's no progress. Most of the teams are doing nothing, but then the tools become available. They get validated by the early adopters. They get rolled out broadly, and all of a sudden, there's this tidal wave of progress. As hard as it is to wait for the tooling, this path is faster than the okay-go method. This is moving with urgency, but it's doing so strategically. It's panicking strategically. One consequence of this is that we're a builder organization. We have more software developers than we do security engineers. That's not to say that we build all of these tools. For example, our patching tools are owned by our builder tools organization because patching is a software change, just like any other software deployment, and the same safety testing and availability concerns apply. Sometimes these tooling efforts are small and internal, and no one ever hears of them. Sometimes they're major investments that we launch publicly. Delete old unused keys is one of the reasons that we built access-key-last-used in our IAM service. Use least-privilege is part of why we have IAM Access Analyzer and VPC Reachability Analyzer. Rotate all your passwords was a driver for our SSO service, our single sign-on service. Rather than making it easier to rotate passwords, just get rid of as many of them as you can. There are still urgent security issues where we page people in and figure out the path forward in real time. But this tenet tells us that we always dive deep, and we provide clear, actionable guidance and support any broad efforts with tools. So the goal of this set of tenets, and really of any set of tenets, is to equip a set of people to make good decisions, to allocate their time well, and to prioritize the things that are, to us, the most important. And to be clear, it's not, how do I train people to make decisions the way that I would, or even the way that Steve would? But how do we give a growing group of people a framework, a core set of shared values and beliefs? To lend some breadth to this talk, to show how tenets can be used by teams that aren't security teams and other teams within AWS I've chosen a couple of tenets from other teams to share. The AWS Cryptography team owns services like KMS, our Key Management Service; ACM, the Amazon Certificate Manager; and more. They're also our internal experts on cryptography. We chatted with Ken Beer earlier in the day. I love their tenets. And today, we're gonna look at two of them. Trust is hard to earn and easy to lose. To maintain trust, we prioritize security, durability, and availability, in that order, over building new features. This tenet is exemplary. It is a super clear expression of what the team values. A developer, a product manager, a general manager who faces a tough decision, a summer intern, a new hire can just read this and know how to make their decisions, can know what is important to this organization. Durability, we never lose a key, but we will delete a customer's key when the customer asks us to do so. And this tenet, in a single sentence, says, we're gonna tackle one of the hardest problems in computer science. Making data like keys durable is a heavily studied problem. Most of the solutions revolve around keeping multiple copies, ideally on multiple systems on multiple types of media in multiple locations. This works, but all those techniques also make it harder to delete data. Real deletion means not only is the key not accessible, it's no longer recoverable from any media anywhere. Doing either one of these things is hard. Doing both of them in a single system is a real challenge, and this tenet keeps the team focused on it. I think that this tenet is one of the reasons why the KMS team has been as successful as they have at solving this problem. It keeps the entire team focused on solving both of these challenges simultaneously. One of the reasons that I love these tenets is that the team uses them in conversation. "Our trust tenet says that we should do this first," or "that would be awesome, and I bet customers would love it, but I don't know how to square that with our durability tenet." They're a part of the daily conversation. They're influencing the members of the team, spread from team member to team member. As I expect pretty much everyone watching this already knows, S3, the Simple Storage Service, is our highly durable, scalable object store. It's one of our oldest services, and it's a foundational building block. And so most of our customers are using it. And as a result, it's one of the largest distributed systems on earth. At its core, S3 is really simple. Just put and get over HTTP, pay-as-you-go storage. But real customers with real applications have interesting requirements, and S3 has become so much more than that. scalable, we scale availability, speed, throughput, capacity, and robustness to support an unlimited number or variety of web-scale applications. We design our systems to use scale as an advantage, so that system growth increases, not decreases, our availability, speed, throughput, capacity, and robustness. This tenet is non-obvious, but once you wrap your brain around it, it is clearly the right way to think about S3. If you're building for S3, it's gonna get large. At S3 scale, even seemingly trivial jobs are distributed systems, rather than tiny little Perl scripts. Most systems lose efficiency as they scale. Clearly N-squared scaling is out, but even N log N scaling can be an issue. You have challenges with multi-machine coordination, distributed knowledge, network throughput. It's a big, big problem. This tenet tells the S3 team that their designs not only need to scale, but they need to get better as they get larger. As a simple example, a web server is a single point of failure. If you lose that web server, you no longer have a website. So you run two web servers. It's great. Now your app can survive the loss of one of them, but you can't load them up past 50%, because if you lose one, you lose half of your capacity. As your fleet of web servers gets larger and larger, the portion of your capacity that any one server represents get smaller and smaller. The cost of losing a single host goes down. And as you scale up, you can load the boxes closer and closer to 100%. It's easier and easier to take individual hosts out for maintenance to perform upgrades, software deployments, or even to handle failures. Now, this is a borderline trivial solution because we're dealing with an inherently very scalable share-nothing web server. It's a completely stateless service, but now apply that thinking to every layer of S3. Using this tenet, every engineer on S3 immediately thinks about any of their design scaling to 10, 100, or 1,000 times as large as proposed and how they're gonna get better as they get larger. When you're working on S3, that's the right thing to do. First use matters. Our largest customer tomorrow may be using our services for the first time today. We balance our investments and aren't afraid to self-disrupt to ensure our services remain differentiated and compelling for these customers. S3 is the simple storage service, and it still is, but it also supports an ever-growing set of rich functionality, like cross-region replication, storage class tiering, information lifecycle management, encryption, permissions, retention, and more. This tenet is another rejection of a false choice. We aren't gonna choose between our large, experienced mature customers and our small customers who are just trying out S3 for the first time. We're gonna delight both of them. So here, for your screenshotting convenience, are all of our tenets on a single slide. These are not the tenets that we started out with, and they're not gonna be the tenets that we have in the future. Even now, we're talking about changes to them. It took a few iterations and some discussions with our senior leadership to get them to where they are now, and we're always open to changing them. These tenets work for us. They express the peculiar way that AWS Security thinks and engages with our service teams and with our customers. They may or may not work for you. They may or may not be a good starting point for your own tenets should you want to try this out. But I suggest that you do. I'm an engineer. I have a very technical job. Success in my role involves getting deep in the details, understanding the gritty reality of implementations, the internals of our services. Yet, this was an entirely non-technical talk because as my career has progressed, as the team has grown, I've realized, as have many, many before me, that the single longest lever that I have to pull is to make the team itself more effective, more consistent, and self-perpetuating. If we're gonna keep up with the innovation of our service teams and the innovation of our customers, we have to make AWS Security make more AWS Security. When customers ask us how we do what we do, the tools, the systems, the processes are all interesting. It's all fun to talk about, but the real underlying bedrock of our success at scale is our internal culture that keeps us working as a single team making high-quality, high-velocity distributed decisions. And our tenets are a visible mechanism defining that culture. Thank you so much for the opportunity to talk with you today. Have a great day.
Info
Channel: AWS Events
Views: 6,512
Rating: undefined out of 5
Keywords: AWS, Events, Webinars, Amazon Web Services, AWS Cloud, Amazon Cloud, AWS re:Invent, AWS Summit, AWS re:Inforce, AWSome Day Online, aws tutorial, aws demo, aws webinar, AWS re:Inforce 2021, security, identity, compliance, cloud security, AWS security, cloud security community, learning conference, security best practices, AWS re:Inforce 2021 Sessions, SEC200-L, Eric Brandwine, Culture of Security, 200 - Intermediate
Id: edWC5q-enX0
Channel Id: undefined
Length: 43min 56sec (2636 seconds)
Published: Thu Aug 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.