AWS re:Invent 2020 - Developer Keynote with Dr. Werner Vogels

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[ambient sounds] If there’s an upside to this past year of working from home, it’s that I’ve been able to spend it in my hometown. Amsterdam. I can’t remember when I was able to spend such an unbroken stretch of time in one place. And it’s my place. [ambient noise] Amsterdam is the story of people harnessing technology. All of these canals that were built centuries ago as a way to direct water away from settlements became the centerpiece technology that enabled the expansion of a city, the flow of commerce, and ultimately a connection that reached out to the entire world. For me, this beautiful city is a reminder of what becomes possible when you connect people and ideas. It shows what can be done by artists and technologists on behalf of people in communities. [music playing] All we need is some creativity, the right tools, and some room to build. Technology always moves forwards but sometimes it's also good to look back, to look at where we came from, to understand our foundation, and the building blocks that have propelled us at lightning speeds into the world we live in now. Which is why I’m so excited to have found a place that will help me tell the stories of yesterday, today, and beyond. [music playing] [music playing] I’m speaking to you today from the CSM Suikerfabriek or Sugar City as it is called today. A factory in Halfweg just outside of Amsterdam and today a traditional heritage site as it stands at the site of a castle used by the Rijnland Water Board. [music playing] Sugar City or the Suikerfabriek in Dutch was built in 1863 and was one of the many sugar beet processing plants here in the Netherlands. The site has evolved over the years and today it is used for music events and retail shopping. And although it has changed a lot, the stories this place tells about technology, resilience and operations are still valuable in the digital world we live in today. These are the evaporation boilers that were used for sugar processing. Equipment like this was managed from a central control room. If it were running today, it would be operating much differently. Modern factories rely heavily on IT devices, data gathered directly from equipment, and real time computing to quickly alert when there is a problem. A device, that is AWS Snowcone, could be one way for Sugar City, if it were operating today, to collect data as it is coming from the equipment. The Snowcone could process the data, store it, and connect the boiler to the factory’s alerting and out emissions systems. This way, the factory isn’t bound by latency or transfer speeds of the internet connection. This factory has stood here for over a hundred and fifty years. It has a lot of stories to tell. We can learn a lot from places like this. We just need to know where to look. [music playing] It has been a challenging year for all of us. Everyone I know has been affected, either professionally or personally. And I’ve talked to quite a few of our customers who’ve been affected in the most dramatic way, with either family members or colleagues passing away. We have a long way to go before we can leave 2020 behind. As builders, we face unique challenges to keep doing what we were doing. But we also have the opportunity to make a disproportional impact as digital has become the default way to access services. With the success of these essential digital services, we’ll continue to expand on that even if we can return to some form of normal. 2020 has tested all of our systems. Physically and digital, and mentally. And with the changing needs and behaviors of our customers, the needs of our applications are changing as well. Just like this factory, many businesses have had to change and adapt in order to survive. It’s never been more important to stop, assess your needs, and ensure you’re focusing on the right things. At Amazon and AWS, we’ve been very fortunate in a lot of ways. Although many things are different, not a lot has changed about how we work. Our focus on small independent teams that are self-sustainable and self-directed set us up for success in these challenging times. And we’ve seen all around us that many companies that rely on complex supply chains to meet increased demand are having serious challenges. And it is also very fortunate at AWS that we’re part of a business that relies on robust procurement and fulfillment processes. It allowed us to keep up with demand even when other Cloud providers were struggling. We’ve been doing remote development for a long time so we already had the tools, processes, and mechanisms in place to distribute a collaboration for teams that are spread out all over the world. We have a long history to go where the talent is instead of forcing everyone to come to Seattle. All our engineers are working from home or, actually, anywhere they would like, just not the office. And given the success we had with it, I don’t think that’s going to change any time soon. Whilst Amazon and AWS have been very fortunate to be able to be paid for this, many of our customers have not been so lucky. Think about the hospitality and the airline industries. In talking to some of our airline industry customers, they expect that even if we can leave the pandemic behind in 2021, they do not expect to fully recover until 2025. And they expect that long-haul business travel may never return to previous levels. Their cargo divisions, however, do expect to survive. And I’ve seen how some of them are pivoting to find new businesses or are doubling down on the things that do work. And there’s an interesting analogy with this sugar factory. With the diminishing need for sugar, they pivoted to focusing on storage and packaging. And when that moved offsite, this factory was converted into a retail and events space. But when this factory was focused on only adapting to changes, the other sugar factories in the Netherlands were branching out into other industries. Today, there are two significant Dutch sugar providers. But in addition to sugar, one processes other crops like potatoes and onions, and the other one is a producer of baking ingredients. They diversified to better be able to withstand unexpected hardships like this. Now, we were very fortunate that we had established a number of years ago an AWS Disaster Response Team whose task it is to immediately reach out to customers who may have been negatively affected by a natural disaster, like earthquakes or typhoons. And as such, we hit the ground running when we saw how customers started to get affected by the pandemic. Personally, I’ve been working with quite a number of our customers who have been adversely affected to help them get fundamental control over their cost by moving to, what I’ve called in past keynotes, cost aware architectures so that the business has knobs to turn with respect to scale, reliability, performance to meet their needs now and in the future. A number of these customers have also seen down times as an opportunity to accelerate and innovate. One of the most important ways that technology and the business can work together is by using data to power new experiences for your customers. I want to introduce you to an AWS customer who really embraced this idea. Ava is a digital health company with a mission to advance women’s reproductive health by bringing together artificial intelligence and clinical research. So, let’s hand over the camera to Lea von Bidder, Ava’s co-founder and CEO, who’s recording in her home in Switzerland. She has a great story about the amazing insights that Ava is able to produce for their customers by making new use of the data they collect. Thanks, Werner. I’m really happy to be here today and share with you insights about how we at Ava harnessed the power of machine learning, clinical research, Cloud technologies, and AWS services to improve women’s health. Digital technology and Cloud services have transformed almost every aspect of our daily lives. With just one tap, I can order a ride, get a room, share a photo. And, of course, digital technologies have also started to revolutionize modern medicine with AI being able to do really interesting things, such as diagnose fractures or detect atrial fibrillation. However, when it comes to women’s health, progress has been very slow and this is largely due to the lack of funding and due to the lack of research in this space. We at Ava want to change that because we believe that so many women all around the world face unique and often unaddressed health challenges when it comes to their reproductive health. Be that contraception, conception, pregnancy or menopause. Our vision is to be a long-term trusted companion for women across their reproductive journey by giving them personalized, data-driven and actionable insights about their health. And how do we do that? We’re doing it by bringing together artificial intelligence and clinical research. For our users, it all starts with the Ava bracelet. A wearable device that collects individual health data about her, including breathing rate, heart rate, heart rate variability ratio, perfusion, and skin temperature. Every night, we collect more than three million data points from each of our customers and integrate that data into their health histories where our multiple algorithms improve over time, learning each woman’s menstrual cycle patterns and making individualized health recommendations. In 2016, we launched our first product. The Ava fertility tracker. And we have since then helped over forty thousand couples get pregnant using the bracelet by empowering them with actionable insights and by giving them a daily real time fertility indication so they can better time conceptive sex. But we are not simply in the fertility business. What we’ve actually built is a technology platform for collecting long-term personal health data for women over five, ten, fifteen years and more. And in fact, it is managing this huge dataset that is really at the core of our business. So, when we started our journey five years ago, we put a lot of thinking into how to best set this up and which Cloud provider to choose because we knew, if we were successful, our data would grow exponentially and we would have massive amounts of data to be processed and stored in close to real time. We knew we needed to be able to scale our computing power to manage the daily traffic peaks that result from our users syncing their device each morning. We needed security to protect the sensitive personal health data we collect. And we needed a flexible architecture to be able to add new services and applications on a regular basis. Ultimately, we selected AWS as our Cloud provider. Especially for the reliability and efficiency of Managed Services, data solutions like Amazon Relational Database Service, MongoDB on AWS, and Amazon Simple Storage Service that enable us to operate the three hundred terabytes of users’ health data in a reliable and secure way, the ease with which we can orchestrate services and deploy web applications, and of course, a large community of developers and guidance around best practices to bring it all together. Thanks to our collaboration with AWS, we have the technology we need to accelerate our ability to innovate and to provide new medical grade services and applications. Following the success of our fertility application and in line with our vision, we will soon be launching a non-hormonal contraceptive solution. We’ve also been able to move very quickly to address the current COVID-19 pandemic by developing an early detection algorithm that works for everybody but is particularly well suited for women. When COVID-19 hit, we found ourselves in an unusual position. The Ava bracelet was one of the only medical grade devices on the market with multiple sensors potentially related to COVID-19 symptoms such as temperature or breathing rate data. We quickly realized that we could utilize our unique clinical knowledge as well as our personalized understanding of our users’ data to detect anomalies in body temperature, heart rate, and breathing rate in a way that could aid in early detection of COVID-19 infection, even before users realize they’re ill or they experience symptoms. This represented a huge step forward in point of care. Providing users with actionable insights, it can lead them to seek testing or to self-isolate to reduce transmission risks. In just a couple of weeks, we developed and deployed a two-thousand-person pilot study, Liechtenstein, to collect data and train an algorithm for early COVID-19 detection. Now, we are ready to validate our algorithm on a much larger sample of twenty thousand men and women. Soon, we will be opening recruitment for this medical trial as part of the COVID-RED Consortium funded by the Innovative Medicine Initiative. This project brings together leading academic and industry experts in public health, epidemiology, wearable technology, and machine learning. So, how does it work? Participants will wear the Ava bracelet for nine months, during which time they will receive real time daily indicators about their likelihood of having a COVID-19 infection based on their personal health data and self-reported symptoms. Our goal is to see if we can leverage the power of AI and personal health data to aid in early detection as well as provide advice on when to seek testing or other treatment so that medical resources can be conserved and used more efficiently. The speed at which we’ve been able to support the COVID-RED Consortium and begin clinical trials is a great example of what’s possible when a small flexible company like Ava is backed by AWS, with a scalable, flexible, and well-architected infrastructure. And it’s exciting because the clinical work we’re doing with the COVID-RED Consortium has utility beyond COVID-19. It has the potential to aid in early detection of other types of infection as well. Because when we started with Ava, our goal was to improve women’s lives for the better by giving them more control and understanding of their reproductive journey. I am really proud of the work we’ve done at Ava to realize and expand on that vision, and even more so to see our data-driven and scientifically proven insights applied to the global good of early COVID-19 detection. Thanks to our fantastic team, our wonderful customers, the partners who have supported us along the way, and everyone at AWS who has been part of our journey. Thanks, Lea. That was great. The thing I really like about Ava’s story is how they consider themselves a data company, much more than a device manufacturer. And this really shows how business and technology can work together to create something greater than that they were able to do on their own. Customers like Brainly did a lot to prepare for events like COVID but still had quite a few surprises. I talked to them a few weeks ago for a recent episode of Now Go Build. They’re an online learning system that helps students with their homework. And even before the pandemic, they were growing and built an infrastructure to handle their growth. As schools moved out of the classroom after the pandemic hit, Brainly saw a dramatic increase in usage. In the span of four months, they hit a growth target that they’d originally thought would take twelve months to achieve. They grew from a hundred and fifty million users at the end of 2019 to two hundred and fifty million users in May 2020. I remember their CTO saying, “I thought we knew our business.” The way students learnt and worked together before the pandemic was very different after they started and attended classes online. Brainly had built significant automation for scaling up and down based on well-known traffic patterns they had. But not everything was automated because their long-term scaling had been extremely predictable and their systems weren’t built to handle dramatic spikes in usage. And certain services like caching and databases were alerting only a few hours before they degraded instead of weeks. Cost efficiency unexpectedly scaled better than the infrastructure and they were able to exploit economies of scale. Their estimation is that it cost forty percent less to scale to this level that initially predicted. Another interesting observation they made is that fewer distractions and interruptions in the office let them much better develop productively. Now, today’s developers need the right tools to be able to work outside the office. And it’s not just working from home, it’s working from anywhere. Your tools should all be part of the heavy lifting of IT. They should enable you to move fast and get a job done reliably and securely. But as our infrastructure is changing and the way systems are being built is changing rapidly because of the Cloud, we need to update our tools as well. And many of us do not want to be tied to our laptops for developing. And especially engineers have become hooked on Amazon WorkSpaces. Actually, an interesting story is that one of my colleagues had a laptop crash but without easy access to centralized IT helpdesk, he was forced to move to WorkSpace as a temporary solution. But he’s so happy now, he will not go back. Now, it’s typically thought of as a business productivity too but it is also ideal for development from anywhere. And you can make multiple workspace configurations with everything set up ready to go like your IV of choice, environmental code, and things like that. So, this is the equivalent of your Cloud Developer Desktop. Makes it easy to work wherever you are and from any type of device. But AWS Cloud9 is really the next generation of this experience. Browser based, limits spend of needs, and improves responsiveness. If you work on multiple different projects, you can have different developer environments for each of those. And it is used by many teams at Amazon as an alternative to Cloud Desktops. Cloud9 has the concept of builders, a series of tasks that built your system, runners, how and where you want to run your system, and debuggers. What is extremely useful in Cloud9 is that you can share what you’re doing live with someone else which enables, for example, remote pair programming. But because it’s a Cloud IDE with live sharing features, we saw a significant increase in usage of Cloud9. One particular area where Cloud9 has become extremely popular is in education. For example, Harvard’s online CS50 course uses Cloud9 for teaching computer science. The AWS Console and other dashboards are great for exploring services and seeing what they can do. But if you are as old as me, I prefer to use the terminal for common tasks. But I also have a bunch of not-so-great scripts to then set up the environment that I need for any particular task at hand. So, I’m excited to announce today another way where we’re meeting developers where they are and giving them the best of both worlds. Today, we’re launching AWS CloudShell. It is a new service built into the AWS Console that gives you access to a Linux terminal inside your browser by just clicking on an icon. And when you start a new CloudShell session, it is automatically preconfigured to have the same API permissions as you do in your console. This means you don’t have to manage multiple profiles or API credentials for different dev, test and production environments like you would normally have if you worked in a terminal. With these credentials being automatically forwarded, it is simple to start a new CloudShell session and use the preinstalled AWS tools straight away. The AWS CLI, Elastic Container Service CLI, and the Service Application Model CLI are all standards inside CloudShell. And CloudShell is not just an AWS command line interface though. It is a fully featured Shell environment based on a Linux 2 and includes several other common CLI tools preinstalled, like Python and notes, Bash, PowerShell, FIM, Git, and so on. And just like any other regular Shell environment, you can install whichever additional tool you need to get a job done. One of my favorites would be the CDK. Unlike a regular environment though, changes to the operating system don’t persist between sessions so you don’t need to worry about cleaning up anything if you accidentally break the environment. However, you do have up to one gig of persistent storage in your home directory at no cost so you can keep your data even if you forget to download it or copy it to S3 when you finish working. CloudShell continues that trend in making the right tools available whenever you need them from wherever you are. And we hope you find it a valuable addition to your process of building software and managing infrastructure. [music playing] Next to the major issues this year of COVID-19, racial, social and economic inequalities, battling climate change deserves the highest priority of our attention. Building our infrastructure systems and services in ways that minimize the impact on future generations is our foremost concern. Sustainability is defined as understanding the impact of materials and energy we consume, measuring the lifecycle of those materials, and applying best practices to reduce the impact of their use. Running applications in a Cloud has been more sustainable from day one. A typical enterprise data center will have fifteen to twenty percent utilization at best. The multi-tenant nature of the Cloud, which allows for efficient resource sharing combined with techniques such as the spoke market, enables much higher utilization. According to 451 Research, running your applications in the Cloud reduces your carbon footprint of your systems by eighty-eight percent. Sixty-one percent of the carbon reduction can be attributed to more efficient servers and higher server utilization. Eleven percent to more efficient data center facilities. And seventy percent can be attributed to reduced electricity consumption and renewable energy usage. This is why AWS is committed to running our business in the most environmentally friendly way possible and achieving hundred percent renewable energy usage for our global infrastructure. In general, sustainability in the Cloud is focused on innovation that the data center architects continue to do. For example, to reduce energy usage and water usage. But I am frequently asked by our customers, “But what can we do?” I believe, therefore, the sustainability of our digital systems, we need a shared responsibility model as well. At AWS, we will continue to innovate in building the most sustainable infrastructure. But I think it’s time we start thinking about what we can do as application builders to get better control over resource usage and start making sure that we can meet the sustainability goals our businesses have set. And because AWS charges by consumption, many times reducing your cost also has the impact of reducing your environmental impact. But I have really seen several cases with our customers where they are willing to spend more to meet their sustainability goals. The best solution architects are now available to help our customers with how to analyze software and hardware patterns for sustainability and the sustainability of your development and deployment processes. And not just on the architecture side. It also provides guidance with respect to a business decision which could be made, such as considering things such as trading the quality of results to use less resources. Looking at factors such as availability, response time, or even accuracy which can all be adjusted to use fewer resources in yet another example of how we as builders need to build controls that make the business to be able to make those sorts of trade-offs. In addition to the climate pledge and broader sustainability goals at Amazon, I’m excited that we’re taking these steps to help our customers learn how to be more sustainable. And one of the interesting recent developments in server design that may help you meet your sustainability goals is the introduction of the ARM processor for date center compute. One of my colleagues, James Hamilton, distinguished engineer at Amazon, once challenged me to count the number of ARM processors running in my office. I believe I came to seventeen. From printers to cameras. From home automation devices to mobile phones. Everything was running ARM. Only my laptop was running a non-ARM CPU. At AWS, we started developing ARM based processors years ago, before running it for real compute workloads was a mainstream idea. Initially, we focused on non-customer facing technologies such as our storage engines. But quickly we started to realize that the tremendous performance and price advantages could be had by our customers if we just would bring ARM to the compute engines as well. We had already seen that the popularity of ARM-based platforms such Raspberry Pi ensured that many Linux distributions and packages would support ARM. Now the recently released AWS Graviton2-based instances make it easy to run ARM applications on the cloud performantly. These processes are ARMv.8.2 compliant, have 64 physical cores and use 7 nanometer manufacturing with 30 billion transistors with ARM variants of many instances including our memory, compute and general purpose instances. We’ve found that high level languages such as Node.js, Python, Java, .NET Core, .NET 5, typically run without problems. Amazon ECS, Amazon EKS, Amazon ECR, CodeBuild, Docker all have support for Graviton2 instances. I see migrate into Graviton and ARM as a long-term business investment. ARM processes have lower costs and higher price per performance ration and for example, the Graviton2 has 20% lower cost but 40% better price performance. Think about that. When was the last time you could see this significant of an improvement just to changing to a different processor? And now whether lifting and shifting will work is in general dependent on third party applications and related dependencies. Even if your code will run on ARM, it doesn’t mean that the Python library that you’re using which may actually be a wrapper around some C code is. To be successful, build pipeline and container clusters need to be created and managed separately for ARM-based containers and binaries. ARM is definitely a platform with a significant innovative future ahead. And I advise every business to investigate whether they are willing to make the investment to migrate. Being a builder means a lifelong of learning and it is one of the things I truly love about being a developer. There’s always a new language, a new library, an open-source project or a new service that will make your head spin and look at development differently. I believe we have the most amazing jobs in the world. We have the opportunity to create something new every day or to improve something small that makes you smile – and proud. And to truly delight your customers in ways that really puts a smile on your face. That’s why a lot of the hundred days of code in the early days of the pandemic challenging you to learn something new. I learned a new language – Rust. I hadn’t worked with teams inside AWS that were using Rust because of its safety properties, but it blew my mind how to do system level programming in a truly modern language. I will talk a bit more about Rust later. Something that saw a tremendous rise in activity was the Amazon Builder library. It helps developers benefit from 25 years of developing secure and scalable systems at Amazon. Builders clearly took the opportunity to dive deep on hard topics and educate themselves when otherwise they may not have had time for it. AWS Academy provides high education institutions with a free ready to teach cloud computing curriculum. And one of the programs I’m really proud of is AWS re/Start. It focuses on cloud skills for unemployed and under-employed individuals including military veterans, their families and young people. We’re also working with other organizations like the ReDi School who has helped refugees learn coding skills that will either lead to employment, more education or even creating start-ups of their own. And by investing in programs like this we are dedicated to educating a diverse group of future builders who have lived through this pandemic and will be building better because of it. [music playing] Almost every year at re:Invent I’ve talked about how over the years we’ve implemented full tolerance at Amazon and based on those experiences, what best practice advice we can give you to meet the full tolerance requirements of your application. But today I would like to take a step up and talk about how full tolerance is only one of the means to achieve the more encompassing concept of dependability. Dependability is when the delivery of a service can be trusted with the avoidance of unacceptable frequent or severe failures. Now what is unacceptable is a business decision, not a technology one. If certain failures or rate of failures are acceptable, then the system is still dependable. Dependability has many properties – availability, reliability, safety, confidentiality, integrity, maintainability and observability. If you think about how to achieve dependability, there are several means to it. It’s not just full tolerance, but equally important are fault prevention, fault removal and fault forecasting. We often talk about other properties related to dependability as well. Concepts like robustness. Can your system be dependable in the presence of external faults? Is the system survivable? Can it handle active in progress faults? And resilience. Can the system deal with changes? And resilience is a unique property because it deals with three different types of changes. There are foreseen changes like when you launch a new version of your system or foreseeable changes like when you move to a new hardware platform. And then there are unforeseen changes like when the number of service requests explodes beyond what you had planned for. Now in years past, I thought quite a bit how to build evolvable software and how that deals with changes. I remember I talked to you about how we built S3. We knew that the system under the covers would need to change either with new orders of magnitude of scale or with the introduction of new features. Evolvability deals with all the properties of dependability. A less commonly known approach to building dependable systems is the concept of diversity which is basically using two or more components that have the same functionality, but have been built in isolation from each other. For example, here at Sugar City, there are two generators. We’re going to take a look at them in a minute. Diversity means that each generator would have been sourced from different manufacturers, but again this is a business decision based on whether failures would be unacceptably frequent and/or severe. One of the interesting things I’ve learned about this factory is that it only processed sugar about three months out of the year. This was to coincide with the sugar beet harvest. They prepared all year for the three months that farmers would bring the sugar beets here to be processed. It was so busy that workers had to bring in mattresses and sleep in the corners to have enough time to get everything done even though it was regularly 40 degrees Celsius which is well over a hundred degrees Fahrenheit. Downtime during this three months would be catastrophic. Whoa, there’s quite an echo here. Although you can also still hear a hum. Even this derelict factory still feels it’s alive. So electricity was crucial for the factory. In this turbine hall, the two generators were used to produce electricity that powered the entire factory. Unfortunately in those days and today and in the future, faults are a given. As are changes in the system or impacting external factors. In today’s world of digital technologies, much of the thinking that built dependable systems and processes in those days is still applicable today. [music playing] Just like a sugar factory, every online retailer knows that just a few months out of the year can be important to have a thriving business. A profitable holiday season and the launch of a new product are the times that are most critical to the success of a company. That’s why I’m excited to have Nicole here from Lego to talk to us today. Nicole is an engineering manager working on Lego’s direct shopper technology team. She has a great story about the resilience challenges they overcame after the application crashed during the launch of a really cool Lego set. Thanks Werner. I’m very excited to represent the direct shopper technology team at the Lego Group here today and take you on our serverless journey so far. At the Lego Group our mission is to inspire and develop the builders of tomorrow by becoming a global force for innovating and establishing learning through play. As an almost 90-year old company, we have consistently innovated and reimagined our approach to achieving our mission and technology has been a key enabler for getting our message out there. We use a variety of services across the company each fit for the purpose that they were selected for. I’m going to talk about the journey of the team behind Lego.com, specifically the pages where you’re browsing for products, redeeming VIP rewards or completing your order through checkout. We are the direct shopper technology team with engineers in the UK, United States and Denmark. Our main challenges typically to the world of e-commerce, the traffic patterns are extremely spikey and with regular product launches and sales events driving customers to the site all at the same time. This diagram shows our typical traffic patterns on one of our busier sales days and every year the number of visitors to the site keeps growing. Now imagine trying to tackle all of that spikiness and year on year growth with an on-premise monolith tied to backend systems with limited scale. Back in 2017 we had a highly anticipated sales event for the Millennium Falcon set, the biggest Lego set at the time and an extremely popular product line. On the launch day we experienced a huge spike in traffic that resulted in our backend services being overwhelmed and all our customers could see was the maintenance page. The service that failed the hardest was a small piece of functionality that calculated sales tags. It made a call back to our on-premise tax calculation system which very quickly reached its limits. At that point we knew that we were on a trajectory for growth that could no longer be sustained with an on-premise system. There were three key drivers that took us to the cloud. Renting the commodities meant that we instead of maintaining infrastructure that was not a differentiating factor for the Lego Group, meant we could instead focus that energy on building awesome shopper experiences and you saw the profiles that I showed before. Having that flexibility to scale to support a very spiky demand profile and having the exact capacity we need when we needed it was critical. And finally having a composable architecture down to the most granular levels enabled us to have speed to market and also flexibility to keep innovating and pushing boundaries. Here is a high-level view of where we are today. We made the conscious decision to extract and focus on our business logic and decompose that across several layers of serverless services. We batched them by carefully selected third party vendors who provide specialized services like payments providers and content management systems. Each of our layers is designed to scale automatically and independently to support an ever-changing traffic profile and this design also allows us to handle multiple squads operating across all parts of the site concurrently and you’ll see why that’s important in a moment. Our journey to the cloud started with migrating a single user facing service, the one to calculate sales tax and three backend processing services back in 2018 just to show that serverless could work for us. Ten months later we then matched our existing capabilities with a completely serverless platform that immediately started handling the same level of traffic and transactions as our existing platform. We then immediately started exceeding those rates of transactions and traffic and setting new records every few months. We started off this year with an ambitious roadmap, a growing team and a platform that was only a couple of months old and then the question came, could we deliver on that ambitious roadmap with twice the number of engineers all onboarded remotely and keep the platform stable all while handling high season levels of traffic? The answer was yes. And not only have we doubled the number of services in production, we’ve managed to do so while handling increasingly busy sales periods. To put some numbers to the growth we experienced in the past year and a half, we now have three times the number of engineers in the team and we’ve since launched another 36 serverless services, pushing the number of lambda functions in production over 260. The growing team meant that we had to distribute many tasks previously held centrally by an infrastructure team, so automation has been key to supporting the ever-growing number of squads and application engineers to get their services and features into production, not only at their own pace, but safely. We’ve moved to a self-service model where possible, for example, creating a script that creates standard integration and deployment pipelines for any new service. The ultimate goal is to develop our application engineers into DevOps engineers who own and operate their services and production. One of our first steps towards that goal was to introduce a standard where all serverless services were to implement canary deployments using AWS Code Deploy and this gives the automatic rollback when necessary. And that leads me on to serverless operations. We have focused on observability and our fan-out operations model where our on-call team monitors some key high-level metrics centrally and then each service has been categorized based on how critical it is to our key user journeys. We then have a default set of alerts that are to be implemented on each service and tuned to the profile of that service by the squad that owns it. This is giving our engineering team a starting point of how to monitor their services in production and not only detect, but react to issues that are happening in this space quickly. The growing team means we can’t have tacit standards anymore. The team is made up of engineers at all stages of their careers and with differing levels of experience with the technologies we’re using. I’ve mentioned already a couple of standards that we’ve set and they define what a good service means to us. They define the hallmarks of a good service that not only uses the latest coding practices and patterns that we’ve developed over time, but also those guidelines for safer deployment and monitoring of services, so that they’re easier to own and operate. We have started comparing our standards with the AWS well architected framework and specifically the AWS Serverless Lens and this has shown that they mainly are in the operational excellency pillar with a little bit in cost optimization and a bit in security. And so looking to what we’re focusing on next, we want to define the standards and the remaining reliability and performance pillars and then we can start looking at making the standard more visible. We want to show our engineers the services that they own, what state they’re in, all in one place. And then we can start adding on things like maybe a leaderboard. This should add in that competitive element, so that owning a service becomes a point of pride within the team. And then we can start playing with the next level of serverless operations. We have chaos engineering on our roadmap and this should enable the team to really break apart their services, test out the failure cases for an entire third party and that means that we can craft our shopper experiences to still be awesome even when part of the platform disappears for a bit. It’s hard to imagine that we only started our journey to the cloud back in 2017, but time flies when you’re having fun and we’ve learnt a lot along the way. The name Lego is an abbreviation of two Danish words – leg and godt, meaning ‘play well’ and when play is at the core of everything you do, the learnings and opportunities are endless. Thank you. Thanks Nicole. What is great about the Lego story is that in addition to providing a more stable platform, migrating from a monolithic design to serverless also freed up engineer time to work on new features. Like Lego, Zoom is another great example of how a company used smart application design to scale quickly as they were flooded with traffic during the pandemic. Just like Brainly they had well-understood traffic patterns and even relied on manual scaling on AWS because their traffic was so predictable. But after the pandemic hit, they quickly experienced a 30x growth. One of the really smart decisions they made before the pandemic was to split their application into different services. They broke the part of the applications that handled the management of meetings like authentication and scheduling and starting a meeting from the part of the application that streams video. Meeting management is a pretty light task and isn’t dependent on latency. On the other hand, streaming video requires a significant amount of bandwidth and compute power and it needs to be close to the users. Before the pandemic, the management component had sometimes more traffic than the streaming part had. After the pandemic hit, they relied on scale of AWS and our ability to keep up with capacity needs to add media service to handle the video and streaming for people around the world. Over the last several months there were times when their engineers were adding 5,000-6,000 servers a night on AWS and just like Brainly, Zoom also saw customer behavior change because of the pandemic. When they started out, Zoom primarily provided services to business users. Today they see significant usage by individuals in education institutions. And this affected the design of their application and how they approached scaling. Take education, for example. Before the pandemic, only a few teachers at a school might be using Zoom at the same time. Today, in a district that is fully remote, every teacher is likely to be running a meeting in parallel with other teachers. And these type of changes meant that Zoom was regularly hitting new bottlenecks and challenges that they hadn’t seen before. But because of their architecture, they were able to quickly address them and even add many new features while handling extreme scaling. By following best practices and decomposing their system into separate components that each had unique scaling, performance and security requirements, Zoom designed a system that can adapt to these unexpected changes. And they’ve been able to succeed and provide a service that is essential to their customers. If you follow cloud best practices around microservices, load balancing, multiple availability zones, you’re likely in great shape from an application architecture perspective. But what about the application code that actually runs within the system. When it comes to dependability, a step that is often missed is the logic of the application code itself. Now almost all developers today write unit tests and provide code reviews to look for errors as they’re developing. But at AWS we take this a step further and use tools based on mathematical logic to prove that our code does only what we want it to do and nothing more. This is hard. It’s time consuming and something that most businesses don’t have the ability to invest in. For example, one of our teams has recently published a paper that outlines how we prove the correctness of the KMS protocol. When we say that we have the most secure, global infrastructure, this is rooted in the continuous work we do to demonstrate this claim with mathematical rigor. We do this to answer the most important question – are we really protecting our customers? I want to talk to you about this more, but before I do, I want to tell you how I think about reasoning. Reasoning is the actual thinking about something in a logical, sensible way. As engineers we reason our problems all the time. We design systems to solve complex problems with many variables. Again, look at this factory. To produce sugar there’s a multi-stage process. Farmers bring beets to the factory where they then are washed with high-pressure cannons. Then they’re sliced into small strips before being combined with hot water during a process called diffusion. The diffusers create a sort of juice that’s put in an evaporator and spun in a centrifuge to separate the sugar from the water. After all this happens the sugar is packaged and shipped. Now this process is not all that complicated. It has a distinct start and end and the steps in between have to be followed exactly in that order. If you put it in a flowchart it would be pretty simple. And when Jean Baptiste invested this process in the early 1800s he had to reason out each stage based on the input it had and the output it created and what was required to get to the end stage. This is the same process you’re using when you’re developing software. You create functions, each with an input and an output and they pass information to each other. And when your program is small it’s pretty simple to see how data is going to move from one function to another. You could easily create a flowchart to describe all the possible paths your application logic will take. But in reality, most applications aren’t that simple, or if they are, they don’t stay that way for long. In most applications there are hundreds to thousands of unique functions with each taking the different set of inputs and creating an output that is passed onto another function and instead of always being followed in a specific order, functions can call other functions depending on what they’re asked to do or the specific of the data that was passed to them. Now imagine the complexity of the flowchart that you would need to build to represent all the paths that your application could take. There’s no way that a human can possibly know all of this and that’s where the field of automatic reasoning comes in. Automatic reasoning can also be called computer-aided development. Automatic reasoning comes in many forms and it is simply the process of using computers to prove anything that can be described with complex logic. In fact, this is one of the first major use cases for early scientific computers. Some of the first programs ever written were attempting to find proofs in mathematical logic. One form of automatic reasoning is formal verification. Formal verification involves converting the logic of an application into a specification and then mathematically proving that the spec does exactly what it is supposed to do – no matter what. It’s very expensive and difficult and time-consuming. It only makes sense for applications where the cost of failure is truly devastating. Organizations like NASA use formal verification to ensure that space missions that cost hundreds of millions of dollars and risk people’s lives, don’t fail because of software bugs. Intel invested heavily in formal verification after the floating point bug in early processors caused a massive recall. Today, all the leading chip manufacturers including AWS develop processors using formal verification tools. Another area where we’re using formal verification is in one of my favorite services – Amazon S3. When we pioneered S3 in 2006 we built it because we knew customers wanted storage for the internet. Our top design priorities in 2006 were elasticity, reliability, durability and performance because that was what mattered to our customers. Given that nobody had ever built anything of this scale before, we relied on fundamental distributed systems concepts to make sure we could meet those guarantees. In fact I think S3 is the only product we ever launched with a list of distributed systems concepts in the press release. But with the knowledge that failures of components big and small are a given, remember everything fails, all the time, we need to balance the trade-off between availability and consistency based on the now infamous CAP theorem. Given the applications we initially target, we decided that availability, always be able to access your object, had a higher priority than getting the last update to your object. And there could be a small window of time between when you updated your object and an all read request would give you the updated data. The data was there stored reliably and durable, but the metadata was eventually consistent. As S3 has evolved into a more generalized storage engine, it is now used for many real-time processing applications like for example, analytics on top of a data lake. The majority of S3 access now is machine to machine instead of human to machine. Now it turns out that machines have a much harder time reasoning about eventual consistency than I ever anticipated. And there is the law of preservation of complexity. If you need strong consistency on top of an eventual consistent system, you need to do the work. Over time, we learnt that quite a few of our customers were building that complexity into their applications to achieve strong consistency. We knew that if we did that work for them, life would be a lot easier for our customers. To achieve that, we had to introduce new protocols for cache coherence and distributed systems communication in the request path for S3’s metadata storage. This allowed us to ensure that every S3 index storage service always has the latest metadata and was in sync with the cache servers. And we had to introduce all of this while continuing to run S3 without impact to availability or performance. S3 had to evolve into the next generation without our customers being impacted in any way. So while we did traditional performance, integration, load, and unit testing, we knew there were some edge cases that this would not be sufficient for. For example, when read operations race with write operations at very high concurrency on a versioned item, the system can enter edge states leading to eventual consistency again. And this is where formal verification comes in. We relied heavily on it and exploit millions of combinations of states through formal mathematical models. We would never have been able to prove to ourselves that we covered all edge cases without formal verification. And because of all that we are now able to confidently announce that Amazon S3 delivers strong read-after-write consistency for only storage request without changes to any performance or availability, without sacrificing regional isolation for applications and at no additional cost. The field of automated reasoning is having an impact on AWS services in many other ways. I’ve always talked about security and you’ve heard me say it every year, “Encrypt everything.” Security is the most important thing that we can be focused on for our customers and I think it’s something that to be on top of the mind of every customer themselves as well, no matter what they are building. When you are developing something new, it is possible to build the system that unintentionally allows access you didn’t intend on. At AWS we asked ourselves, “How can we help customers with this?” And we were making significant investments into security-focused tools that use automated reasoning to help detect security gaps. One of these is the recently announced VPC Reachability Analyzer. In AWS it’s amazingly simple to set up a network. It is truly something that impresses me to this day. And if you need to isolate different parts of your application from other parts, or from the internet, I can easily do that. I can create a network ACL, a security group, or a subnet to isolate these parts from the network. But just like my application logic, the logic of a network increases in complexity the more work you do on it. Most workloads on AWS of any complexity have many subnets, have many network interfaces, have many security groups, and often all these things reference each other. If you’re trying to trace the path from one end to the other, it quickly looks like the flow charts logic problem we had with our application earlier. And it’s why we built the Amazon VPC Reachability Analyzer. The Reachability Analyzer uses the same automated reasoning processes to look at your AWS configuration without sending a single packet. It can tell if your system is doing what you want it to do. You can use it to troubleshoot why you can’t reach one server from another. With the Reachability Analyzer you only need to tell us what rules you want to validate and we automatically build the logic to test and verify it. This is just one of the many tools we offer to customers that are powered with the automated reasoning. And the more complex your applications grow, the more important tools like this are becoming. What all these tools have in common though is that we make it easy to define rules and test them because they are based on configuration. The S3 block public access feature. Allows you to configure bucket permissions as long as they don’t provide unrestricted access to the internet. Our AWS IAM Access Analyzer validates that your IAM policies only allow what you intended. Without automatic reasoning powered tools you tell us both what the rules or policies are and they need to tell what you want to verify. All of this is powered by technology developed by the AWS automatic reasoning group, a technology called ‘Zelkova.’ Zelkova translates policy into precise mathematical language and then uses automated reasoning tools to check their properties. These tools include automated reasoners called ‘Satisfiable Modulo theories,’ SMT, which automatically prove or disprove formulas over constant strings, regular expressions, dates, and IP addresses. Zelkova makes broad statements about all resource requests because it’s based on mathematics and proves instead of heuristics, patterns, or simulation. Next to the services we already discussed is used in AWS Config, AWS Trusted Advisor, Amazon Macie, Amazon GuardDuty, AWS IoT Device Defender, and more. One of the most important benefits is that users can leverage these tools to help identify gaps in the design phase before any data is exposed. Furthermore, this acknowledges automated. We are moving manual human intervention to help provide greater scalability. And finally, it is provable. By converting the problem into a logic-based mathematical equation it can be proven with mathematical certainty. Now I would to change it up for a minute. I will talk more about how and why we build the things the way we do at AWS. At Amazon we have a long history of allowing our engineers to pick the tools that work best for the problem at hand. We have a number of centralized supported languages and tools, but we do not force them on anyone. For example, when we needed to do fast UI prototyping for Amazon Fresh, Ruby on Rails was the best tool for that even when we knew it wouldn’t be the language used to scale it up. Marc Brooker gave a great talk at re:Invent called ‘Creating Technology Standards at Amazon,’ where he talks about the trade-offs between using what’s familiar and safe, Java or C++, and what may be new and risky. He makes an important point here that’s worth paying attention to. The time to develop software is relatively small compared to the time your software will be running and needs to be maintained. As such, when the team decided to develop the next generation of a load balancer in Rust, it was not about what was easy and comfortable, which would have been Java or C++, but what would be better for the long-term success of the system. Next we are having great safety features and no garbage collector, the main driver for choosing Rust was that it’s a great language for using automated reasoning and formal verification of the application logic. We continue to improve our reasoning tools and because of that the team prefer to take the approach of learning a new language to build the foundation with the support of automated reasoning. This allowed them the benefit to prove that our code was doing exactly what they intended. And because of the ability to enable verification, Rust has become extremely popular in AWS and more and more of our systems are working in Rust, especially those systems required to build the bendable, secure services for our customers. Another area of dependability deals with usability. In essence what if then, our faults in your input data. How are you dealing with faults in general? The fault handling part of your code are probably not frequently exercised. So, how can you be sure that your application will continue to be dependent, or dependable, regardless what you throw at it? Similar to automated reasoning, Fault Injection allows you to discover these unknowable unknowns. One way this is done with application code is fuzz testing, or fuzzing. Fuzzing works by sending bad data to your application in a way that tries to make it crash. Imagine a form or your websites or an API that accepts data, or even a mobile phone application, what would happen if your application was sent malformed data that causes errors? Or what if this happened maliciously? This is exactly what happened in the case of Heartbleed a few years ago. The open SSL library was making assumptions that during a Heartbleed request the server should send exactly what was requested. This bug was originally discovered by fuzz testing because the fuzzer made a malformed request and the offending function happily returned it. And because of things like this, we are using fuzzing extensively at AWS. I use it to test both our APIs and our code itself. But even if your code is perfectly written, hardware fails, packets get dropped, and unexpected traffic spikes happen. Even well-built cloud applications today can depend on dozens of services, and components all working together. And failures can happen at every level. To understand how your application will behave when these events happen, the best form of testing is chaos engineering. Now this isn’t a new idea and has been around for quite a while. Netflix popularized this concept many years ago and the tool they built was called ‘Chaos Monkey.’ The goal of chaos engineering is to understand how your application responds to issues by injecting failures into your infrastructure, usually winning against production systems. Chaos experiments can include generating a baseline traffic load against the system, adding latency into all database calls, and then validating timeouts and retries. And unlike automated reasoning, we believe that chaos engineering is for everyone, not just shops running at Amazon or Netflix scale, and that’s why, today, I am excited to preannounce a new service built to simplify the process of running chaos experiments in the cloud. AWS Fault Injection Simulator is a fully managed chaos engineering service that makes it easy to discover vulnerable parts in your applications. AWS FIS helps developers easily set up and run controlled chaos engineering experiments across a range of AWS services. FIS gives you the ability to inject faults so it has latency, or failure or underlying compute networking databases, and more, that includes control playing level fault such as API throttling and server errors that weren’t previously possible for customers to do this in the cloud. And FIS makes it easy to run safe experiments. We build it to follow the typical chaos experimental workflow where you understand your steady state, set a hypothesis, inject faults, and momentary application. When the experiment is over, FIS will tell you if your hypothesis was confirmed and you can use the data collected by CloudWatch to decide where you need to make improvements. FIS removes the barriers to adopting chaos engineering. I see a lot of benefits for incorporating chaos engineering into your business. From running game days or incorporating chaos experiments into your CI/CD workflows, most people think about chaos engineering for resilience, and that’s true, but it’s also performance improvements you can make. It is the blind spots that you can catch in your monitoring. And perhaps the most underrated is the experience your teams will get learning how to respond to infrequent but critical events. Mean time to resolution is not just about your architecture and automation, but it is also about the operational muscles that you build and exercise over time. There is no better way to test your system than chaos engineering. And with FIS you won’t need to be an expert to incorporate it into your organization. AWS Fault Injection Simulator will be available in early 2021 and will include the ability to run fault actions against services such as EC2, RDS, ECS, EKS and more. Out of all the things we’re announcing this year, this is the one that I am most excited about. By offering this as a service, I believe that we are going to have a massive positive impact on building more robust, more durable, and more dependable systems in the cloud. [music playing] When we talked about the history of dependability, I mentioned systems theory. A related field of that is systems control theory. It has been crucial in building many dependable industrial and other systems. The most important pioneer in this field was Professor Rudolf Kálmán. In 1960 he defined concepts such as whether a system was controllable and observable, or actually unobservable. To be observable is something we know as software engineers all too well. How can you infer the internal state of a system from its outputs? These can be functional outputs like voltage and amperage or the turbines, or nonfunctional output like the turbine temperature sensors or rotation speed measurements. And this is what we try to achieve with the observability property of a dependable system. How can we infer the internal state of our digital systems from its outputs? This can include both functional replies from API calls for example, or requests to other parts of the system from nonfunctional information that we collect through other means. We’ve been monitoring systems for years. And I remember the earliest monitoring tools came with Unix, VMStat, Syslog, Netstat, et cetera. And it wasn’t until the late 90s that tools like Nagios and RRDtool became popular. Monitoring is for operators. Just like this factory, an operator would stand in front of this dashboard to watch a gauge start to close to end or a light would start flashing whenever there was a problem. Monitoring means that you already know what is important. You think, have all the data you need, and you are just watching it and get an alert when it goes out of spec. And this was possible for a factory like this because, relatively speaking, there wasn’t a lot they could monitor. The generators were the most important thing, which is why all the dashboards were there. The equipment that was processing the beats was all managed by people who worked with it day in and day out and could tell when it wasn’t cutting as well as it had before, or if it started to hum in a different way. A factory like this is a living, breathing thing, and when your workers spend literally all day here, they know when something is not right and when it is going to break before it does. It’s a bit like the experienced car mechanic that just needs to listen to your engine to tell you the crankshaft is about to go out. There’s a pair of magnificent sensors of course and a lot of experience how to interpret output and thus, its internal state. However, if there is a car he hasn’t seen before, or a problem sound that he has never heard, his experience doesn’t help that much. But in our world what happens when everything is automated, and you’re not working with the same equipment day in and day out, or when your ears and eyes are not sufficient. Classical monitoring deals with two questions. What is broken and why is it broken? Monitoring uses a predefined set of metrics and logs to determine known failures. In general with monitoring and alarming, you can’t predict when things will fail. You can only take action when they do. And it’s why we used to call people who monitored these systems ‘operators,’ and not engineers. The generators here at Sugar City are extremely complex. It is unlikely that an operator of the dashboard knew how to repair it when something went wrong. But they did know when the sounding alarm and gauges moved to where they weren’t supposed to go. And systems continue to increase in complexity. It is impossible to put every important metric for that system on a single dashboard that the user watches. Think about everything that goes into a modern application. They have metrics for the service containers and functions that you are managing. Your application has counters and logs for all the work it’s doing. You may have anywhere from thousands to millions of customers, all which have data about what they are doing and how they are interacting with your application. It is impossible to put all of this on a dashboard that a human watches to define alerts for each of these metrics to tell you when they’re going out of spec. At Amazon we’ve been on a 25-year journey to improve the processes of managing our systems. And we’ve long left a notion that just monitoring was sufficient to manage our systems. We’ve embarked on a holistic approach to operations from collecting massive amounts of data and logs, to how we analyze them to how we solve and talk about problems when they do happen, and this is what observability is all about. How can we make sure we have the data, the tools, and the mechanisms, to quickly resolve problems in a fundamental way? How can we without reaching into the system infer internal state from the data that we have? At Amazon, our most important drivers have always been customer centric. Find and resolve problems before they impact customers. Understand the impact on your customers where you couldn’t prevent it and fix the problem so that it never happens again. [music playing] Observability centers on three enabling technologies, metrics, logging, and tracing. All three are important but serve different purposes, but I do think logging is the most fundamental one to get right. Back at the first re:Invent I made a bold statement. I told you to “Log everything” and I meant it. Log everything. Logs are the source of truth for what’s going on at any given moment inside your infrastructure. At Amazon, every service form the Hypervisor to network gear generates logs that are indexed, compressed, and stored durably. We have gone through a few different log collection systems at Amazon over the years. In the early days we were mounting NFS servers and writing directly to logs across the network. And we were digging through those logs with bash, sets, cut and org. It was painful and time consuming. But it was the best option at the time. And as we grew and matured we built the custom client that stored logs in their sleep. And this made it easy to get logs over our servers and store them durably. The search for errors and to generate ad hoc metrics we queried those logs with a distributed Hadoop cluster. And this was significantly more scalable than a mounted file system past with Linux commands. But every system grows in complexity. Our largest applications are made up of hundreds of microservices, each potentially generating terabytes of logs a day. Using Amazon EMR to query the logs for those services could take hours. So, we took everything we had learned so far and we created CloudWatch logs. With it, you have a single pane of glass where you can view all logs from every service, every container, and every application that you monitor. You can search your logs for specific error codes, filter them based on different fields, or create alarms for different conditions. It’s a powerful service but it’s still just the first step. When building an end-to-end observability system you need logs, metrics, and tracing. If your logs are the source of truth the metrics that you monitor and graft should come from your logs. One of the most challenging things for software developers however is, especially if they are disconnected from operations, is to create metrics in a monitoring system. And this is when why, when you ask an organization what metrics they are monitoring, they will start talking about system-level metrics like CPU utilization or service level metrics like number of requests. Those sort of metrics are usually collected automatically for you by AWS, or they are modules that are built into your monitoring application. For Amazon we heard from our developers that it was too difficult to publish metrics. And there are lots of ways to get metrics into a system but they often require configuration or additional dependencies on libraries that you need to keep up to date. We needed a system that made creating metrics dead easy so we stopped and asked ourselves, “What’s the easiest way for a developer to write data?” Well, anyone who has written code without a debugger knows how easy writing to STDOUT is. And you don’t need any libraries. You don’t need any dependencies and there’s no configuration. We let our developers define, graph and track to one data by just building a string and writing it to STDOUT. And because we are already logging to STDOUT we can generate JSON and have it forwarded to CloudWatch logs. We are doing this through logs instead of creating a new metric because this typically ends up becoming high cardinality data. Now, one of the challenges with allowing developers to create arbitrary metrics at Amazon scale, is the level of cardinality it creates. Now, cardinality sounds like a complex topic but it's actually relatively simple. The total number of unique entries in your data the higher the cardinality is. Now let me show you an example. If you’re familiar with time series databases this might sound familiar. Each event has one or more dimensions associated with it. Suppose I have a CloudWatch metric called ‘Disc Free Bytes,’ whose dimensions are Amazon EC2 instance ID and and the OS mount point or a drive letter. This has a cardinality of two because there are two unique metrics one for each drive. If you have 100 of those instances each with two drives then the cardinality of this is 200. Now suppose you have a CloudWatch metric called ‘Response Latency,’ whose dimensions are the server’s instance ID, the customer ID. And suppose that I have a hundred servers and one million customers. The cardinality of that metric will be 100 million. Creating a 100 million metrics in CloudWatch will be rather expensive, inefficient, and difficult to analyze. Now suppose we have thousands of hosts, millions of accounts and billions of records, the level of cardinality in a data set like this makes it impossible to use standard metrics and tools to investigate problems and to understand the impact of events. Graphing individual data points also loses the context of a metric. Imagine a request log that measured which server took the request, which account made it, and how long it took. If you graphed requests, account, and time individually, there’s no way to cross reference them. All of the details in relationships that are in the logs are lost. If there is a fault there is no way of telling which server that it came from or, more importantly, which users were affected. You won’t be able to tell if the faults happened for all requests or just specific ones. But since now we have all your data in one place, you can start to understand how your metrics interact with each other. And this is where the real cool stuff happens. Let’s say you’re monitoring your API you have a graph like this. There’s probably a few questions that you’re going to have to answer right away. Which API is failing? How many customers are impacted? Which customers are impacted the most? Is it one bad host or multiple? Which shards, partitions, buckets are having issues. Finding these answers in high cardinality data is exactly why we build Contributor Insights. With Contributor Insights, you can analyze logs to dynamically extract and report on contributor data. Our goal is to make it easy to take a graph like this and to show you what’s contributing to the change. You can see metrics about top-end contributors, the total number of unique contributors and their usage. For example, can you find bad hosts, identify the heaviest network users, or find URLs that generate the most errors? And of course your metrics are related we can graph and report on metrics as they relate to each other. Now let’s take our API faults example and show you how contributive insights lets you find out what the problem is. Let’s start with a series of logs that just include a few fields. UI, host, customer ID, and whether the request failed or not. Using CloudWatch alarms, we’re going to learn that a number of API failures involve a value that we defined. So we open contributive insights and look at a few rules that we have already defined. First, we can look at the rule that shows top five failures by request time. So it doesn’t look like the problem is specific to any API. Now what if you fail it, look at it, by host. Ah, so host A has the failure. Let’s say we fix it and want to know which customers were most impacted so we can contact them. You can build whatever rules you need to understand which metric is contributing to the behavior of your application. If you needed to reference these regularly you could generate persistent graphs from these rules and add to them your dashboards and alarms. We are providing built-in rules that you can use to analyze metrics from other AWS services. For example, with Dynamo DB rules, you can quickly determine which items or partition keys are the cause of any shuttling that is happening in your database. With Contributor Insights, you have the power to understand cause, impact and blast radius of faults quickly and easily. This re:Invent will also announce Amazon DevOps Guru. DevOps Guru takes the expertise that we’ve built from operating the infrastructure and detecting failures over the last 20 years, into a service. It uses machine learning to identify potential issues inside of your account and also provide links to recommend fixes. Available through the console, you can search and visualize operational data across Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray. For example, DevOps Guru can predict when the amount of data in an RDS database will exceed you provision storage. This would have been great for Brainly during the early stages of the pandemic since this one was of the services they were scaling manually. It’s all recommended changes to instance sizes and configurations, before an application runs out of resources. Now, the last requirement of an observable system is tracing. Most application requests involve many services working together to fulfill the request. Even simple 3-tier apps have to pass through a load balancer, to servers and a database. When working on a distributed system, tracing the source of intermittent and undetermined failures can be extremely difficult. And this can be mitigated by the use of a request ID that’s passed through every layer of your stack. A trace ID is a unique identifier that is stamped on the request by the Front Door service. From there, the trace ID is propagated to every other service the request touches. If you’ve ever logged a support into S3, you have been asked to provide this. For S3 the X-AMZ-Request-ID is an ID that is passed to every one of the services that make up S3. This allows us to correlate logs between different backend services. It helps us understand exactly which hosts and services serve the request, to find any errors associated with it. Tracing provides you with the ability to understand exactly what happens throughout your entire system. But, let me repeat this. You must pass the trace ID between your tiers and put it in your logs. They make extensive use of canaries to monitor our system. The term ‘canaries’ actually comes from coal miners. They placed canaries in the mines to know when there was a problem. And because they are smaller and more sensitive, the canary would get sick and stop singing before any of the miners did. For application monitoring, your canary has the same purpose. Canaries perform the same actions as your customers. They continuously verify your customer experience even when you don’t have any customer traffic. Canaries should run continuously, and especially during deployments, and alarm any time there is something unexpected. Even if your application hasn’t failed, a canary can give you an early warning by alerting when something took, for example, longer than expected. And to build your own canaries we have CloudWatch Synthetics. Canaries built with CloudWatch Synthetics are NodeJS scripts, they run as Lambda functions in your account. They work for both HTTP and HTTPS and can test for UI components by providing programmatic access to a headless Google Chrome. One of the great things about building canaries with CloudWatch Synthetics is that it integrates with CloudWatch Service Lens, and AWS X-Ray, to provide a graphical end-to-end view of the services in your application. Now, I would like to introduce now, Becky Weiss. She is a Senior Principal Engineer who worked on many of our systems, including AWS IAM, Amazon VPC and Lambda. Becky is one of the many experts we have on monitoring. She has some insights into how Amazon engineers think about data and why it’s unique. Thank you, Werner, and welcome to all you re:Invent attendees. My name is Becky Weiss, I am an engineer at AWS, or, as we like to call it here, a ‘service owner’, and it’s my honor to give you the AWS service owner’s perspective on this topic of monitoring and observability. Now if you ever read anything about how Amazon does business, you know that a principle sitting in the middle of everything we do is customer obsession. And our principle doesn’t stop at our product roadmap, rather it extends into every aspect of how we operate our services. And yes, maybe even especially how we think about the eventuality of failure. Everything fails eventually, you know that, and you design in a way that expects it. And for me and for the many talented operators that I get to work with at AWS, the name of the game is whether we can see these signals of impending failure before our customers actually experience failure. So how do we do that? Well, we’ve got our logging system, we’ve got our graphs, we’ve got our dashboards, we’ve got great tools, we continually invest heavily in those tools, we are never done. And you might be expecting me to say next that it’s all automated, and maybe it’ll surprise you to hear this from an AWS engineer, but automation is necessary but not completely sufficient for operational excellence. Don’t get me wrong, automation does play a large role, and we rely on it heavily, so what’s the rest of it? Mindset and experience. The good old-fashioned practice of a human brain doing what the human brain does, looking for patterns, and being curious about what it sees. We train our operators to be optimistic pessimists, they are optimistic about the business and ever-expanding universe of possibilities it creates for our customers, but we’re pessimistic and curious when it comes to operational health. For a couple of minutes, we are going to have a little bit of fun. I am going to take you into the human side of all of these things that we measure, plot and chart. I am going to show you graphs like the ones we see, take you through what we think about them and the questions we ask. And to do that, I am going to take you through three contrived scenarios. These aren’t real graphs, I actually drew them by hand, but they’re similar to the ones we see, and I am going to take you inside our brains as we look at them. Here’s our first fake graph. Well, we’re privileged here at AWS to get to work with a lot of graphs shaped like this one, and I hope you are too in your business, because this graph shows more volume over time. That’s great, that’s business success. Alright. Now I’m going to show you a different graph for the same service. This graph is measuring something where more of it, higher value, is worse. It could be latency, it could be other things that negatively correlate with your customer experience, like maybe how long it takes to complete a workflow or process a chunk of data. So what do I see here? Well, I see that as my business grows over time, this metric’s growing too, and that’s bad. And not only that, but it’s getting more variable, and that’s bad too. So now I can’t tell you exactly what’s going wrong here, it’s fake, I literally can’t tell you that. But at AWS, based on our experience at scale, I could almost guarantee you from a graph like this that we’re approaching some kind of constraint, limit, contended resource, maybe a new pattern that we didn’t know about before. And we might even be starting to bump our heads on it, even if, from a customer’s perspective, things are still mostly fine. So we are always actively looking for shapes like this on our dashboards, because sometimes these things do take quite a bit of work to find and fix and we make those investments. Okay, here’s our second graph. So we followed best practices, our service runs canaries that automate end-to-end scenarios that ensure they are working. Yours should too, we talked a little bit before about CloudWatch Synthetics as a great tool to help you do this. So, we have a graph of failures, and it looks like we’ve failed a run, so you review it, maybe you even know why, like maybe you knew that a direct dependency of yours was having a problem during that minute. But again, let’s look at it through that mindset of that curious optimist/pessimist. There is a lot of empty space in this picture. So I know what happened during that one minute, my canary failed is what happened. But what about all that other time? I don’t know. It could be good news, it could be bad news. No news could be good news, because it didn’t fail. It could also be bad news because maybe the canary couldn’t even fun for some reason. I don’t know. And we don’t like not knowing. So if we see a graph like that, and it’s something we monitor and care about, we want this graph instead. We want that canary posting a zero value when it runs and succeeds, because then we know that no news is bad news, so we can take action. Okay, final example, I’ll show you a little bit of good news. This one looks good, right, I am measuring my latencies and percentiles. I’ve got 99th percentile here, I’ve set an alarm threshold that’s meaningful to me and my customers, mm, pretty good, yeah, great job. And you know what we would do with a graph like this? We’d lower the threshold. Why did we lower that threshold? Because, typically, the original threshold, we would have put some thought into, we would have set it well within the bounds of what we’ve determined to be an acceptable customer experience, so our customers were fine with anything under that original line. But once again, a lot can happen under that line. The graph could go like this. And even if our customers wouldn’t have been affected, it’s a signal for us curious pessimists that something changed. We want to know what that is, maybe do something about it if there’s a new constraint being quietly encountered. Okay, now all that might have just looked like obvious commonsense. You know what it is. But the reality is, when you go around and you look at the various approaches to operations out there in the world, there is a wide range. Everybody has got their metrics around latency and service faults and great tools, but what about your mindset? Are you measuring the things your customers care about? And if it were to change, would you get that signal? Are you looking at that data like a curious optimistic pessimist? The operationally trained brain is primed to ask these questions. I personally find operating AWS services with that mindset to be one of the most interesting and rewarding things I have ever had the opportunity to do. And if you approach operating your own systems with that same sense of curiosity, I bet you’ll find the same over wherever you are, doing whatever it is that you do in the cloud. Thank you very much. And have a wonderful re:Invent. Happy operating. Thanks, Becky. I think you’re absolutely right. At the end of the day, all these systems are being monitored by humans. They’ll all just guessing what method you think we are going to need and where they should alarm. For example, when your metrics fallout a particular ways for a certain time period. We talked a lot about CloudWatch today, but when it comes to collecting and visualizing modern operations data, there are a few open source tools that have become very popular. As part of the Cloud Computing Native Foundation, or actually I say that wrong, Cloud Native Computing Foundation, Prometheus is a tool that makes it easy for customers to monitor container environments at scale. Grafana is an open source project for interactive data visualization services used for monitoring and alarming, that’s commonly used with the Prometheus open source project. Grafana supports multiple data sources, such as Prometheus, Amazon CloudWatch, AWS X-Ray, Elasticsearch and AWS Timestream, allowing for the creation of dashboards and alerts from multiple sources. Although it’s easy to deploy a single Prometheus or Grafana server in AWS, it can take weeks of work to scale across multiple servers and configure the entire environment for high availability. So I am excited to announce Amazon Managed Service for Grafana and Prometheus. Using these services, we will manage the provisioning and setup for Prometheus along with, of course, ongoing maintenance and scaling operations. The Prometheus Query Language is optimized for large volumes of data, commonly in container monitoring. This makes it easy to search and group metrics, such as CPU memory and latency at a granular level so the container issues can be isolated and alarmed on quickly. Engineering teams can use the same familiar Prometheus Query Language to filter, aggregate and alarm on metrics and quickly gain performance visibility without any code changes. The Amazon Managed Service for Grafana makes it simple for engineering teams to query, visualize and alert on data services such as metrics, logs and traces, no matter where they are stored. These services are available today as a preview on AWS. When it comes to observability, we talked a lot about CloudWatch and other AWS services. But AWS isn’t the only company building these services. And I’ve always said that AWS is so much more than just AWS services. And I’d be remiss if I didn’t recognize all the tools being developed for a complete monitoring and alerting ecosystem. We have great partners who are also operating in this space, like Sumo Logic and Splunk and Datadog and New Relic and AppDynamics. But all these have a different approach to collecting data. And it can be challenging to combine the different approaches, which is something the Cloud Native Computing Foundation is trying to create a foundational approach for, with the OpenTelemetry Project. Open Telemetry provides open source APIs, libraries and agents, to collect traces and metrics via partition monitoring. The AWS distro for OpenTelemetry consists of collectors that are built into the application and exporters that send data back to backend analytics targets. In addition to supporting AWS targets like CloudWatch and X-Ray, customers can also send traces and application metrics to a number of AWS partners and third-party providers. The distro for OpenTelemetry simplifies the process of collecting data by allowing you to instrument your applications just once, instead of using multiple tools from different vendors to collect metrics and traces. We are excited about the OpenTelemetry Project and, in addition to providing this distribution, we are also contributing to the upstream project for a number of components. Now, no matter how you choose to log, monitor, trace and alert, there is a tool that fits your needs. [music playing] We have covered a lot of ground today. We have talked about the importance of development, how to build dependable applications, and how to effectively run them. If you paid close attention, you will notice that there has been a trend with all these things. More and more AWS is taking tasks that can be slow, difficult, or time-consuming, and making them easier to use by using advanced technologies to simplify them. These technologies can include automated reasoning, or even machine learning. Take this AWS Panorama appliance, for example. This device allows you to deploy machine learning models to existing industrial cameras, like the ones that will be here in this factory. With this device a business could run computer vision models for tasks such as quality control, or part identification or security or workplace safety. And this isn’t changing any of the processes that weren’t already done, but it’s improving them and allowing the operators who perform them to work more efficiently. Or there’s Amazon Lookout for Equipment, a machine learning service that detects abnormal equipment behavior using IoT sensors. As technologies advance, you will continue to see these technologies improve our work. So instead we can all be more efficient. At AWS, with every new service we build, we ask how we can make it better by using machine learning. And you can see this with many of the services we’ve released over the past few weeks around databases, security, and operations. These aren’t machine learning services, but they are services enhanced by machine learning. Amazon isn’t alone in doing this. Just like Ava, businesses around the world are integrating machine learning in its existing applications and data to get more value out of what they already have. As technology that powers our work advances, we will continue to chip away at the heavy lifting that we all do on a daily basis. So, as you are building or using a system, take a few minutes to think about which parts can evolve from simple automation, and can make use of advanced technologies such as machine learning. You might be surprised by all the places that ML can help. When we started today, we talked a lot about how AWS is meeting developers where there are. And what if we applied machine learning to that? Services like Amazon CodeGuru were built to solve problems exactly like this. When you are writing software, there are a lot of things that need to be checked. Problems like memory leaks, or hard coded credentials and duplicate lines after refactoring, which won’t prevent your code from compiling, but can still cause problems for your application. Typically, these problems are identified during when your code reviews before branches are merged. But these are difficult tasks, and some vulnerabilities are easy to miss, especially if there are many changes that are happening at once. And that’s why we build Amazon CodeGuru. CodeGuru uses machine learning to automate code you’ve used during application development and to profile applications after they have been deployed. As code is checked in, CodeGuru reviewer will automatically give you your code, just like a senior engineer in your team would do. It provides advice on what’s wrong, and gives you links to documentation. It was great, but how it works is that it does these checks automatically when you check in the code, just like you would do with another code review. This way you can find and fix problems early, and the code reviews performed by members of your team can focus on more important aspects of your business logic. CodeGuru Profiler, on the other hand, attaches to running applications in your test or production environment. Using machine learning, it inspects your running applications in order to find performance bottlenecks. It allows you to troubleshoot latency, and CPU utilization issues. Learn where you can reduce infrastructure costs. It identifies application performance issues by combining automated reviews from CodeGuru with the learnings from AWS Fault Injection Simulator and the recommendations from DevOps Guru, we can improve the entire development lifecycle using machine learning. And this is just the beginning. There are so many parts of application development that involve writing and rewriting bits of our applications that are essentially just plumbing and aren’t of any business value. As the tools we use advance, machine learning is going to continue to remove undifferentiated heavy lifting of building software. Like machine learning, another field I am getting really excited about, and that I think is going to change our assumptions about what is possible in the world, is quantum computing. I know. We’ve been hearing about Quantum for years now, and how it will be the next big thing. I truly believe that at some point it will become the next game-changing technology. It’s going to happen slow at first, providing small optimizations and enhancements, but eventually it will revolutionize the areas it is well-suited for. Chemistry research, drug discovery, material sciences, they are all going to be some of the first industries that will benefit from quantum computing. Just like GPUs have changed the field of machine learning, I believe that quantum processors will eventually do the same for many of these scientific fields. AWS is investing heavily in quantum. Our Quantum Solutions Lab connects experts with organizations to build internal expertise and strategies required to run quantum workloads. The AWS Center for Quantum Computing is a partnership with Caltech where we are researching quantum computing algorithms and heart rate. Just last week, our scientists published a new research paper showing a theoretical quantum computing system with groundbreaking improvements in error correction. We also launched Amazon Braket which democratizes access to quantum resources from a number of providers. And this is where the power of the cloud really shines. A popular use case of Braket is for developers to learn to evaluate if quantum could enhance their workloads as it advances. Historically, you’d have to wait for technology like this to leave the research phase, be turned into a mass-produced product, and purchased at enormous cost. Only then would you be able to determine if it would actually help you. But making quantum computers an expert assistance available to every developer today, we are helping AWS customers stay one step ahead. As a software developer, now is the time to start thinking about quantum computing. It is the way that we are starting to see machine learning make an impact on our daily lives, and the future that technologies like quantum computing will enable that make me so excited about cloud and technology as a whole. And I am excited to see how developers will use these technologies to truly improve the world around us. I want to thank you for spending time with me today as we explore this amazing space for my favorite city. This year has been challenging in a lot of ways, but I also believe that challenges are the best time to reflect and think about whether you are building the right services for your customers. We have talked about how AWS is meeting developers where they are, and that’s because you are our customers. As you develop your applications, think about what you can do to meet your customers where they are. Many of us have experienced severe anxiety and uncertainty about the future in the past year. Uncertainty about their jobs, health, financial future, family, and much more. I strongly urge our customers, for example those in financial services, to be conscious about this when they build new ways to engage their customers. Address these important issues upfront, and let them come back, for example, in the way you design your interfaces, or what services you could build for your customers that helps them in addressing their uncertainty. For almost all of us, digital services have become essential. But this means that these services are not just for the digital natives with their fiber connections and the latest smartphones. Consider the experiences you are building for them, and how they access them. Not everyone has 5G or even a strong Wi-Fi connection at home. If you build essential services, make sure they also work on low bandwidth, high latency connections. There are enough news reports of kids having to go to the parking lot of a grocery store or a fast-food restaurant to just get a decent internet connection for school. The applications you build are essential for your customers, whether you are building a service that helps people budget what they have, and helps them predict their immediate financial future, or maybe you are building a website that helps people stay connected to each other where they wouldn’t be able to otherwise. We, as developers, have a responsibility to our customers to build the best applications we can for them in ways that take the current reality very seriously. It's never been a better time to use your knowledge, skills and talents to make a difference in the world. Now, go build. [music playing]

Info

Channel: Amazon Web Services

Views: 418,841

Rating: undefined out of 5

Keywords: AWS, Amazon Web Services, Cloud, AWS Cloud, Cloud Computing

Id: jt-gV1YwmnI

Channel Id: undefined

Length: 115min 53sec (6953 seconds)

Published: Fri Dec 18 2020