[ambient sounds] If there’s an upside to this past
year of working from home, it’s that I’ve been able
to spend it in my hometown. Amsterdam. I can’t remember
when I was able to spend such an unbroken stretch
of time in one place. And it’s my place. [ambient noise] Amsterdam is the story
of people harnessing technology. All of these canals
that were built centuries ago as a way to direct water away
from settlements became the centerpiece technology
that enabled the expansion of a city, the flow of commerce, and ultimately a connection
that reached out to the entire world. For me, this beautiful city is
a reminder of what becomes possible when you connect people
and ideas. It shows what can be done
by artists and technologists on behalf of people
in communities. [music playing] All we need is some creativity,
the right tools, and some room to build.
Technology always moves forwards but sometimes it's also good
to look back, to look at where we came from,
to understand our foundation, and the building blocks
that have propelled us at lightning speeds
into the world we live in now. Which is why I’m so excited
to have found a place that will help me tell the stories
of yesterday, today, and beyond. [music playing] [music playing] I’m speaking to you today from the CSM Suikerfabriek or Sugar City
as it is called today. A factory in Halfweg
just outside of Amsterdam and today
a traditional heritage site as it stands at the site of a castle
used by the Rijnland Water Board. [music playing] Sugar City or the Suikerfabriek
in Dutch was built in 1863 and was one of the many
sugar beet processing plants here in the Netherlands. The site has evolved
over the years and today it is used for music
events and retail shopping. And although
it has changed a lot, the stories this place tells about
technology, resilience and operations are still valuable
in the digital world we live in today. These are the evaporation boilers
that were used for sugar processing. Equipment like this was managed
from a central control room. If it were running today, it would
be operating much differently. Modern factories rely heavily
on IT devices, data gathered
directly from equipment, and real time computing to quickly
alert when there is a problem. A device, that is AWS Snowcone,
could be one way for Sugar City, if it were operating today, to collect data as it is coming
from the equipment. The Snowcone could process
the data, store it, and connect the boiler
to the factory’s alerting and out emissions systems. This way, the factory
isn’t bound by latency or transfer speeds
of the internet connection. This factory has stood here
for over a hundred and fifty years. It has a lot of stories to tell. We can learn a lot
from places like this. We just need to know where to look. [music playing] It has been a challenging year
for all of us. Everyone I know has been affected,
either professionally or personally. And I’ve talked to quite
a few of our customers who’ve been affected
in the most dramatic way, with either family members
or colleagues passing away. We have a long way to go
before we can leave 2020 behind. As builders,
we face unique challenges to keep doing what we were doing.
But we also have the opportunity to make a disproportional
impact as digital has become the default way
to access services. With the success of these
essential digital services, we’ll continue to expand on that even if we can return
to some form of normal. 2020 has tested all of our systems.
Physically and digital, and mentally. And with the changing needs
and behaviors of our customers, the needs of our applications
are changing as well. Just like this factory, many businesses have had to change
and adapt in order to survive. It’s never been more important
to stop, assess your needs, and ensure you’re focusing
on the right things. At Amazon and AWS, we’ve been
very fortunate in a lot of ways. Although many things are different, not a lot has changed
about how we work. Our focus on small
independent teams that are self-sustainable
and self-directed set us up for success
in these challenging times. And we’ve seen all around us
that many companies that rely on complex supply chains
to meet increased demand are having
serious challenges. And it is also very fortunate
at AWS that we’re part of a business
that relies on robust procurement and fulfillment processes.
It allowed us to keep up with demand even when other Cloud providers
were struggling. We’ve been doing remote development
for a long time so we already had the tools,
processes, and mechanisms in place to distribute
a collaboration for teams that are spread out
all over the world. We have a long history
to go where the talent is instead of forcing everyone
to come to Seattle. All our engineers are working
from home or, actually, anywhere they would like,
just not the office. And given the success we had with it, I don’t think that’s going
to change any time soon. Whilst Amazon and AWS have been
very fortunate to be able to be paid for this, many of our customers
have not been so lucky. Think about the hospitality
and the airline industries. In talking to some of
our airline industry customers, they expect that even if we can leave
the pandemic behind in 2021, they do not expect
to fully recover until 2025. And they expect that
long-haul business travel may never return
to previous levels. Their cargo divisions, however,
do expect to survive. And I’ve seen how some of them
are pivoting to find new businesses or are doubling down
on the things that do work. And there’s an interesting analogy
with this sugar factory. With the diminishing need for sugar, they pivoted to focusing
on storage and packaging. And when that moved offsite, this factory was converted
into a retail and events space. But when this factory was focused
on only adapting to changes, the other sugar factories
in the Netherlands were branching out
into other industries. Today, there are two significant
Dutch sugar providers. But in addition to sugar, one processes other crops
like potatoes and onions, and the other one is a producer
of baking ingredients. They diversified to better
be able to withstand unexpected hardships like this. Now, we were very fortunate
that we had established a number of years ago an AWS
Disaster Response Team whose task it is
to immediately reach out to customers who may have been
negatively affected by a natural disaster,
like earthquakes or typhoons. And as such,
we hit the ground running when we saw how customers started
to get affected by the pandemic. Personally, I’ve been working
with quite a number of our customers who have been adversely affected to help them get fundamental control
over their cost by moving to, what I’ve called in past keynotes,
cost aware architectures so that the business has knobs
to turn with respect to scale, reliability, performance to meet
their needs now and in the future. A number of these customers
have also seen down times as an opportunity
to accelerate and innovate. One of the most important
ways that technology and the business
can work together is by using data to power
new experiences for your customers. I want to introduce you
to an AWS customer who really embraced this idea. Ava is a digital health company
with a mission to advance women’s
reproductive health by bringing together artificial
intelligence and clinical research. So, let’s hand over the camera
to Lea von Bidder, Ava’s co-founder and CEO, who’s recording in her home
in Switzerland. She has a great story
about the amazing insights that Ava is able to produce
for their customers by making new use
of the data they collect. Thanks, Werner.
I’m really happy to be here today and share with you insights
about how we at Ava harnessed the power of machine
learning, clinical research, Cloud technologies, and AWS services
to improve women’s health. Digital technology
and Cloud services have transformed almost every aspect
of our daily lives. With just one tap,
I can order a ride, get a room, share a photo. And, of course, digital technologies
have also started to revolutionize
modern medicine with AI being able to do
really interesting things, such as diagnose fractures
or detect atrial fibrillation. However, when it comes
to women’s health, progress has been very slow and this
is largely due to the lack of funding and due to the lack
of research in this space. We at Ava want to change that because we believe that so many women
all around the world face unique and often unaddressed
health challenges when it comes
to their reproductive health. Be that contraception, conception,
pregnancy or menopause. Our vision is to be
a long-term trusted companion for women across
their reproductive journey by giving them personalized,
data-driven and actionable insights about their health.
And how do we do that? We’re doing it by bringing together
artificial intelligence and clinical research. For our users, it all starts
with the Ava bracelet. A wearable device that collects
individual health data about her, including breathing rate, heart rate,
heart rate variability ratio, perfusion, and skin temperature. Every night, we collect more
than three million data points from each of our customers and integrate that data
into their health histories where our multiple algorithms
improve over time, learning each woman’s
menstrual cycle patterns and making individualized
health recommendations. In 2016, we launched
our first product. The Ava fertility tracker. And we have since then helped over
forty thousand couples get pregnant using the bracelet by empowering them
with actionable insights and by giving them a daily
real time fertility indication so they can better time
conceptive sex. But we are not simply
in the fertility business. What we’ve actually built
is a technology platform for collecting long-term personal
health data for women over five, ten, fifteen years and more. And in fact, it is managing
this huge dataset that is really at the core
of our business. So, when we started our journey
five years ago, we put a lot of thinking into
how to best set this up and which Cloud provider to choose
because we knew, if we were successful,
our data would grow exponentially and we would have
massive amounts of data to be processed and stored
in close to real time. We knew we needed to be able
to scale our computing power to manage the daily traffic peaks
that result from our users syncing their device each morning.
We needed security to protect the sensitive personal
health data we collect. And we needed
a flexible architecture to be able to add new services
and applications on a regular basis. Ultimately, we selected AWS
as our Cloud provider. Especially for the reliability
and efficiency of Managed Services, data solutions like
Amazon Relational Database Service, MongoDB on AWS,
and Amazon Simple Storage Service that enable us to operate the three hundred terabytes
of users’ health data in a reliable
and secure way, the ease with which
we can orchestrate services and deploy web applications,
and of course, a large community of developers
and guidance around best practices to bring it all together.
Thanks to our collaboration with AWS, we have the technology we need
to accelerate our ability to innovate and to provide new medical
grade services and applications. Following the success
of our fertility application and in line with our vision,
we will soon be launching a non-hormonal
contraceptive solution. We’ve also been able to move
very quickly to address the current
COVID-19 pandemic by developing an early detection
algorithm that works for everybody but is particularly
well suited for women. When COVID-19 hit, we found
ourselves in an unusual position. The Ava bracelet was one of
the only medical grade devices on the market with multiple sensors potentially related
to COVID-19 symptoms such as temperature
or breathing rate data. We quickly realized that we could
utilize our unique clinical knowledge as well as our personalized
understanding of our users’ data to detect anomalies
in body temperature, heart rate,
and breathing rate in a way that could aid in early detection
of COVID-19 infection, even before users realize they’re ill
or they experience symptoms. This represented a huge
step forward in point of care. Providing users
with actionable insights, it can lead them to seek testing or to self-isolate
to reduce transmission risks. In just a couple of weeks,
we developed and deployed a two-thousand-person pilot study,
Liechtenstein, to collect data and train an algorithm
for early COVID-19 detection. Now, we are ready
to validate our algorithm on a much larger sample
of twenty thousand men and women. Soon, we will be opening recruitment
for this medical trial as part of the COVID-RED
Consortium funded by the Innovative
Medicine Initiative. This project brings together
leading academic and industry experts
in public health, epidemiology, wearable technology,
and machine learning. So, how does it work? Participants will wear
the Ava bracelet for nine months, during which time they will receive
real time daily indicators about their likelihood
of having a COVID-19 infection based on their personal health data
and self-reported symptoms. Our goal is to see
if we can leverage the power of AI
and personal health data to aid in early detection
as well as provide advice on when to seek testing
or other treatment so that medical resources can be
conserved and used more efficiently. The speed at which we’ve been able
to support the COVID-RED Consortium and begin clinical trials
is a great example of what’s possible when a small flexible company
like Ava is backed by AWS, with a scalable, flexible, and well-architected
infrastructure. And it’s exciting because
the clinical work we’re doing with the COVID-RED Consortium
has utility beyond COVID-19. It has the potential to aid
in early detection of other types
of infection as well. Because when we started with Ava,
our goal was to improve women’s lives for the better
by giving them more control and understanding of
their reproductive journey. I am really proud of the work
we’ve done at Ava to realize and expand
on that vision, and even more so to see
our data-driven and scientifically proven insights applied to the global good
of early COVID-19 detection. Thanks to our fantastic team,
our wonderful customers, the partners who have
supported us along the way, and everyone at AWS
who has been part of our journey. Thanks, Lea. That was great. The thing I really like
about Ava’s story is how they consider
themselves a data company, much more than
a device manufacturer. And this really shows
how business and technology can work together to create
something greater than that they were able
to do on their own. Customers like Brainly did a lot
to prepare for events like COVID but still had
quite a few surprises. I talked to them a few weeks ago
for a recent episode of Now Go Build. They’re an online learning system that helps students
with their homework. And even before the pandemic,
they were growing and built an infrastructure
to handle their growth. As schools moved out of the classroom
after the pandemic hit, Brainly saw a dramatic
increase in usage. In the span of four months,
they hit a growth target that they’d originally thought
would take twelve months to achieve. They grew from a hundred
and fifty million users at the end of 2019 to two hundred
and fifty million users in May 2020. I remember their CTO saying,
“I thought we knew our business.” The way students learnt
and worked together before the pandemic
was very different after they started
and attended classes online. Brainly had built
significant automation for scaling up and down based on well-known
traffic patterns they had. But not everything
was automated because their long-term scaling
had been extremely predictable and their systems weren’t built
to handle dramatic spikes in usage. And certain services
like caching and databases were alerting only a few hours before
they degraded instead of weeks. Cost efficiency unexpectedly
scaled better than the infrastructure and they were able
to exploit economies of scale. Their estimation is that
it cost forty percent less to scale to this level
that initially predicted. Another interesting
observation they made is that fewer distractions
and interruptions in the office let them much better
develop productively. Now, today’s developers need
the right tools to be able to work
outside the office. And it’s not just working from home,
it’s working from anywhere. Your tools should all be part
of the heavy lifting of IT. They should enable you
to move fast and get a job done
reliably and securely. But as our infrastructure is changing
and the way systems are being built is changing rapidly
because of the Cloud, we need to update
our tools as well. And many of us do not want to be
tied to our laptops for developing. And especially engineers have become
hooked on Amazon WorkSpaces. Actually, an interesting story is that one of my colleagues
had a laptop crash but without easy access
to centralized IT helpdesk, he was forced to move to WorkSpace
as a temporary solution. But he’s so happy now,
he will not go back. Now, it’s typically thought of
as a business productivity too but it is also ideal
for development from anywhere. And you can make
multiple workspace configurations with everything set up ready
to go like your IV of choice, environmental code,
and things like that. So, this is the equivalent
of your Cloud Developer Desktop. Makes it easy to work
wherever you are and from any type of device. But AWS Cloud9 is really the next
generation of this experience. Browser based, limits spend of needs,
and improves responsiveness. If you work on multiple
different projects, you can have different developer
environments for each of those. And it is used
by many teams at Amazon as an alternative to Cloud Desktops.
Cloud9 has the concept of builders, a series of tasks
that built your system, runners, how and where you want
to run your system, and debuggers. What is extremely useful
in Cloud9 is that you can share
what you’re doing live with someone else which enables,
for example, remote pair programming. But because it’s a Cloud IDE
with live sharing features, we saw a significant increase
in usage of Cloud9. One particular area
where Cloud9 has become extremely popular
is in education. For example, Harvard’s online
CS50 course uses Cloud9
for teaching computer science. The AWS Console and other dashboards
are great for exploring services and seeing what they can do.
But if you are as old as me, I prefer to use the terminal
for common tasks. But I also have a bunch
of not-so-great scripts to then set up the environment that I need for
any particular task at hand. So, I’m excited to announce today another way where we’re meeting
developers where they are and giving them
the best of both worlds. Today, we’re launching
AWS CloudShell. It is a new service
built into the AWS Console that gives you access to a
Linux terminal inside your browser by just clicking on an icon. And when you start
a new CloudShell session, it is automatically preconfigured
to have the same API permissions as you do in your console. This means you don’t have
to manage multiple profiles or API credentials for different dev,
test and production environments like you would normally have
if you worked in a terminal. With these credentials
being automatically forwarded, it is simple to start
a new CloudShell session and use the preinstalled
AWS tools straight away. The AWS CLI,
Elastic Container Service CLI, and the Service Application Model CLI
are all standards inside CloudShell. And CloudShell is not just an AWS
command line interface though. It is a fully featured
Shell environment based on a Linux 2 and includes several other
common CLI tools preinstalled,
like Python and notes, Bash, PowerShell, FIM, Git, and so on.
And just like any other regular Shell environment, you can install
whichever additional tool you need to get a job done.
One of my favorites would be the CDK. Unlike a regular
environment though, changes to the operating system
don’t persist between sessions so you don’t need to worry
about cleaning up anything if you accidentally
break the environment. However, you do have up to one
gig of persistent storage in your home directory
at no cost so you can keep your data
even if you forget to download it or copy it to S3
when you finish working. CloudShell continues that trend
in making the right tools available whenever you need them
from wherever you are. And we hope you find it
a valuable addition to your process of building software
and managing infrastructure. [music playing] Next to the major issues
this year of COVID-19, racial, social
and economic inequalities, battling climate change deserves
the highest priority of our attention. Building our infrastructure
systems and services in ways that minimize the impact
on future generations is our foremost concern. Sustainability is defined
as understanding the impact of materials
and energy we consume, measuring the lifecycle
of those materials, and applying best practices
to reduce the impact of their use. Running applications in a Cloud has been more sustainable
from day one. A typical enterprise data center will have fifteen to twenty
percent utilization at best. The multi-tenant nature
of the Cloud, which allows for efficient resource
sharing combined with techniques such as the spoke market,
enables much higher utilization. According to 451 Research, running
your applications in the Cloud reduces your carbon footprint
of your systems by eighty-eight percent. Sixty-one percent
of the carbon reduction can be attributed
to more efficient servers
and higher server utilization. Eleven percent to more efficient
data center facilities. And seventy percent
can be attributed to reduced electricity consumption
and renewable energy usage. This is why AWS is committed
to running our business in the most environmentally
friendly way possible and achieving hundred
percent renewable energy usage for our global infrastructure. In general, sustainability
in the Cloud is focused on innovation that the data center
architects continue to do. For example, to reduce
energy usage and water usage. But I am frequently asked
by our customers, “But what can we do?”
I believe, therefore, the sustainability
of our digital systems, we need a shared
responsibility model as well. At AWS, we will continue
to innovate in building the most
sustainable infrastructure. But I think it’s time
we start thinking about what we can do
as application builders to get better control
over resource usage and start making sure that we
can meet the sustainability goals our businesses have set. And because AWS charges
by consumption, many times reducing
your cost also has the impact of reducing
your environmental impact. But I have really seen
several cases with our customers where they are willing to spend more
to meet their sustainability goals. The best solution architects
are now available to help our customers
with how to analyze software and hardware patterns
for sustainability and the sustainability of your
development and deployment processes. And not just on
the architecture side. It also provides guidance
with respect to a business decision
which could be made, such as considering things such as
trading the quality of results to use less resources. Looking at factors
such as availability, response time, or even accuracy
which can all be adjusted to use fewer resources
in yet another example of how we as builders
need to build controls that make the business to be able
to make those sorts of trade-offs. In addition to the climate pledge and broader sustainability
goals at Amazon, I’m excited that we’re
taking these steps to help our customers learn
how to be more sustainable. And one of the interesting
recent developments in server design that may help you meet
your sustainability goals is the introduction of the ARM
processor for date center compute. One of my colleagues, James Hamilton,
distinguished engineer at Amazon, once challenged me to count
the number of ARM processors running in my office. I believe I came to seventeen.
From printers to cameras. From home automation devices
to mobile phones. Everything was running ARM. Only my laptop
was running a non-ARM CPU. At AWS, we started developing ARM
based processors years ago, before running it for real compute
workloads was a mainstream idea. Initially, we focused
on non-customer facing technologies
such as our storage engines. But quickly we started
to realize that the tremendous performance
and price advantages could be had by our customers if we just would bring ARM
to the compute engines as well. We had already seen
that the popularity of ARM-based platforms
such Raspberry Pi ensured that many Linux distributions
and packages would support ARM. Now the recently released AWS
Graviton2-based instances make it easy to run ARM applications
on the cloud performantly. These processes
are ARMv.8.2 compliant, have 64 physical cores
and use 7 nanometer manufacturing with 30 billion transistors
with ARM variants of many instances
including our memory, compute and general
purpose instances. We’ve found
that high level languages such as Node.js,
Python, Java, .NET Core, .NET 5, typically run without problems. Amazon ECS, Amazon EKS,
Amazon ECR, CodeBuild, Docker all have support
for Graviton2 instances. I see migrate into Graviton and ARM
as a long-term business investment. ARM processes have lower costs and higher price per performance
ration and for example, the Graviton2 has 20% lower cost
but 40% better price performance. Think about that. When was the last time you could see
this significant of an improvement just to changing
to a different processor? And now whether lifting
and shifting will work is in general dependent
on third party applications and related dependencies.
Even if your code will run on ARM, it doesn’t mean
that the Python library that you’re using
which may actually be a wrapper around some C code is. To be successful, build pipeline
and container clusters need to be created
and managed separately for ARM-based
containers and binaries. ARM is definitely a platform with a significant
innovative future ahead. And I advise every business
to investigate whether they are willing to make
the investment to migrate. Being a builder means
a lifelong of learning and it is one of the things
I truly love about being a developer. There’s always a new language,
a new library, an open-source project
or a new service that will make your head spin
and look at development differently. I believe we have the most
amazing jobs in the world. We have the opportunity
to create something new every day or to improve something small
that makes you smile – and proud. And to truly delight
your customers in ways that really puts
a smile on your face. That’s why a lot
of the hundred days of code in the early days of the pandemic
challenging you to learn something new.
I learned a new language – Rust. I hadn’t worked with teams
inside AWS that were using Rust
because of its safety properties, but it blew my mind
how to do system level programming in a truly modern language. I will talk a bit more about
Rust later. Something that saw a tremendous
rise in activity was the Amazon Builder library. It helps developers benefit
from 25 years of developing secure
and scalable systems at Amazon. Builders clearly took the opportunity
to dive deep on hard topics and educate themselves when otherwise
they may not have had time for it. AWS Academy provides
high education institutions with a free ready to teach cloud
computing curriculum. And one of the programs
I’m really proud of is AWS re/Start. It focuses on cloud skills
for unemployed and under-employed individuals
including military veterans, their families
and young people. We’re also working with other
organizations like the ReDi School who has helped refugees
learn coding skills that will either lead
to employment, more education or even creating
start-ups of their own. And by investing in programs
like this we are dedicated to educating
a diverse group of future builders who have lived
through this pandemic and will be building better
because of it. [music playing] Almost every year at re:Invent
I’ve talked about how over the years we’ve implemented
full tolerance at Amazon and based
on those experiences, what best practice advice
we can give you to meet the full tolerance
requirements of your application. But today I would like
to take a step up and talk about how full tolerance
is only one of the means to achieve the more encompassing
concept of dependability. Dependability is when the delivery
of a service can be trusted with the avoidance of unacceptable
frequent or severe failures. Now what is unacceptable
is a business decision, not a technology one. If certain failures or rate
of failures are acceptable, then the system
is still dependable. Dependability has many properties –
availability, reliability, safety, confidentiality, integrity,
maintainability and observability. If you think about
how to achieve dependability, there are several means to it.
It’s not just full tolerance, but equally important
are fault prevention, fault removal and fault forecasting.
We often talk about other properties related to dependability as well.
Concepts like robustness. Can your system be dependable
in the presence of external faults? Is the system survivable? Can it handle active
in progress faults? And resilience.
Can the system deal with changes? And resilience
is a unique property because it deals with three
different types of changes. There are foreseen changes like when you launch a new version
of your system or foreseeable changes like when you move
to a new hardware platform. And then there are
unforeseen changes like when the number
of service requests explodes beyond
what you had planned for. Now in years past,
I thought quite a bit how to build evolvable software
and how that deals with changes. I remember I talked to you
about how we built S3. We knew that the system
under the covers would need to change either with new orders
of magnitude of scale or with the introduction
of new features. Evolvability deals with all
the properties of dependability. A less commonly known approach
to building dependable systems is the concept of diversity which is basically using
two or more components that have the same functionality, but have been built
in isolation from each other. For example, here at Sugar City,
there are two generators. We’re going to take a look
at them in a minute. Diversity means that each generator
would have been sourced from different manufacturers,
but again this is a business decision based on whether failures would be
unacceptably frequent and/or severe. One of the interesting things
I’ve learned about this factory is that it only processed sugar
about three months out of the year. This was to coincide
with the sugar beet harvest. They prepared all year
for the three months that farmers would bring the sugar
beets here to be processed. It was so busy that workers
had to bring in mattresses and sleep in the corners to have enough time
to get everything done even though it was regularly
40 degrees Celsius which is well over
a hundred degrees Fahrenheit. Downtime during this three months
would be catastrophic. Whoa, there’s quite an echo here. Although you can also
still hear a hum. Even this derelict factory
still feels it’s alive. So electricity was crucial
for the factory. In this turbine hall, the two generators were used
to produce electricity that powered the entire factory.
Unfortunately in those days and today and in the future,
faults are a given. As are changes in the system
or impacting external factors. In today’s world
of digital technologies, much of the thinking
that built dependable systems and processes in those days
is still applicable today. [music playing] Just like a sugar factory,
every online retailer knows that just a few months
out of the year can be important to have
a thriving business. A profitable holiday season
and the launch of a new product are the times that are most critical
to the success of a company. That’s why I’m excited to have Nicole
here from Lego to talk to us today. Nicole is an engineering manager working on Lego’s
direct shopper technology team. She has a great story
about the resilience challenges they overcame after
the application crashed during the launch
of a really cool Lego set. Thanks Werner. I’m very excited
to represent the direct shopper technology team
at the Lego Group here today and take you on
our serverless journey so far. At the Lego Group our mission
is to inspire and develop
the builders of tomorrow by becoming a global force
for innovating and establishing
learning through play. As an almost 90-year old company, we have consistently innovated
and reimagined our approach to achieving our mission
and technology has been a key enabler
for getting our message out there. We use a variety of services
across the company each fit for the purpose
that they were selected for. I’m going to talk about the journey
of the team behind Lego.com, specifically the pages
where you’re browsing for products, redeeming VIP rewards or completing
your order through checkout. We are the direct shopper technology
team with engineers in the UK, United States and Denmark. Our main challenges typically
to the world of e-commerce, the traffic patterns
are extremely spikey and with regular product
launches and sales events driving customers to the site
all at the same time. This diagram shows
our typical traffic patterns on one of our busier sales days and every year the number of visitors
to the site keeps growing. Now imagine trying to tackle
all of that spikiness and year on year growth
with an on-premise monolith tied to backend systems
with limited scale. Back in 2017 we had
a highly anticipated sales event for the Millennium Falcon set,
the biggest Lego set at the time and an extremely popular
product line. On the launch day we experienced
a huge spike in traffic that resulted in our backend
services being overwhelmed and all our customers could see
was the maintenance page. The service that failed the hardest
was a small piece of functionality that calculated sales tags. It made a call back to our
on-premise tax calculation system which very quickly
reached its limits. At that point we knew that
we were on a trajectory for growth that could no longer be sustained
with an on-premise system. There were three key drivers
that took us to the cloud. Renting the commodities
meant that we instead of maintaining infrastructure that was not a differentiating factor
for the Lego Group, meant we could
instead focus that energy on building
awesome shopper experiences and you saw the profiles
that I showed before. Having that flexibility to scale
to support a very spiky demand profile and having the exact capacity we need
when we needed it was critical. And finally having
a composable architecture down to the most granular levels
enabled us to have speed to market and also flexibility to keep
innovating and pushing boundaries. Here is a high-level view
of where we are today. We made the conscious decision
to extract and focus
on our business logic and decompose that across several
layers of serverless services. We batched them by carefully
selected third party vendors who provide specialized services
like payments providers and content management systems. Each of our layers is designed
to scale automatically and independently to support
an ever-changing traffic profile and this design also allows us
to handle multiple squads operating across all parts
of the site concurrently and you’ll see why
that’s important in a moment. Our journey to the cloud started with migrating a single
user facing service, the one to calculate sales tax and three backend
processing services back in 2018 just to show that serverless
could work for us. Ten months later we then
matched our existing capabilities with a completely
serverless platform that immediately started handling
the same level of traffic and transactions
as our existing platform. We then immediately
started exceeding those rates of transactions
and traffic and setting new records
every few months. We started off this year
with an ambitious roadmap, a growing team and a platform
that was only a couple of months old and then the question came, could we deliver on
that ambitious roadmap with twice the number of engineers
all onboarded remotely and keep the platform stable all while handling high season
levels of traffic? The answer was yes. And not only have we doubled
the number of services in production, we’ve managed to do so while handling
increasingly busy sales periods. To put some numbers to the growth
we experienced in the past year and a half, we now have three times
the number of engineers in the team and we’ve since launched
another 36 serverless services, pushing the number of lambda
functions in production over 260. The growing team meant that
we had to distribute many tasks previously held centrally
by an infrastructure team, so automation has been key
to supporting the ever-growing number of squads
and application engineers to get their services
and features into production, not only at their own pace,
but safely. We’ve moved to a self-service model
where possible, for example, creating a script
that creates standard integration and deployment pipelines
for any new service. The ultimate goal is to develop
our application engineers into DevOps engineers who own and operate
their services and production. One of our first steps
towards that goal was to introduce a standard
where all serverless services were to implement canary deployments
using AWS Code Deploy and this gives the automatic
rollback when necessary. And that leads me on
to serverless operations. We have focused on observability
and our fan-out operations model where our on-call team monitors
some key high-level metrics centrally and then each service
has been categorized based on how critical it is
to our key user journeys. We then have a default
set of alerts that are to be implemented
on each service and tuned to the profile of that
service by the squad that owns it. This is giving
our engineering team a starting point of how to monitor
their services in production and not only detect, but react to issues that are
happening in this space quickly. The growing team means we can’t
have tacit standards anymore. The team is made up of engineers
at all stages of their careers and with differing levels
of experience with the technologies
we’re using. I’ve mentioned already
a couple of standards that we’ve set and they define
what a good service means to us. They define the hallmarks
of a good service that not only uses the latest
coding practices and patterns
that we’ve developed over time, but also those guidelines
for safer deployment and monitoring of services, so that
they’re easier to own and operate. We have started comparing
our standards with the AWS well architected framework and
specifically the AWS Serverless Lens and this has shown
that they mainly are in the operational
excellency pillar with a little bit in cost
optimization and a bit in security. And so looking to what
we’re focusing on next, we want to define the standards and the remaining reliability
and performance pillars and then we can start looking at
making the standard more visible. We want to show our engineers
the services that they own, what state they’re in,
all in one place. And then we can start adding on
things like maybe a leaderboard. This should add in that
competitive element, so that owning a service becomes
a point of pride within the team. And then we can start playing with the next level
of serverless operations. We have chaos engineering
on our roadmap and this should enable the team
to really break apart their services, test out the failure cases
for an entire third party and that means that we can craft
our shopper experiences to still be awesome even when part of
the platform disappears for a bit. It’s hard to imagine
that we only started our journey to the cloud back in 2017, but time flies when you’re having fun
and we’ve learnt a lot along the way. The name Lego is an abbreviation
of two Danish words – leg and godt, meaning ‘play well’ and when play is at the core
of everything you do, the learnings
and opportunities are endless. Thank you. Thanks Nicole. What is great
about the Lego story is that in addition to providing
a more stable platform, migrating from a monolithic design
to serverless also freed up engineer time
to work on new features. Like Lego, Zoom is
another great example of how a company used
smart application design to scale quickly as they were flooded
with traffic during the pandemic. Just like Brainly they had
well-understood traffic patterns and even relied
on manual scaling on AWS because their traffic
was so predictable. But after the pandemic hit, they quickly experienced
a 30x growth. One of the really smart decisions
they made before the pandemic was to split their application
into different services. They broke the part
of the applications that handled the management
of meetings like authentication and scheduling
and starting a meeting from the part of the application
that streams video. Meeting management
is a pretty light task and isn’t dependent on latency. On the other hand,
streaming video requires a significant amount
of bandwidth and compute power and it needs
to be close to the users. Before the pandemic,
the management component had sometimes more traffic
than the streaming part had. After the pandemic hit,
they relied on scale of AWS and our ability to keep up with
capacity needs to add media service to handle the video and streaming
for people around the world. Over the last several months
there were times when their engineers were adding 5,000-6,000
servers a night on AWS and just like Brainly, Zoom
also saw customer behavior change because of the pandemic. When they started out, Zoom primarily
provided services to business users. Today they see significant usage by
individuals in education institutions. And this affected the design
of their application and how they approached scaling. Take education, for example. Before the pandemic,
only a few teachers at a school might be using Zoom at the same time. Today, in a district
that is fully remote, every teacher is likely to be
running a meeting in parallel with other teachers. And these type of changes meant that
Zoom was regularly hitting new bottlenecks and challenges
that they hadn’t seen before. But because of their architecture,
they were able to quickly address them and even add many new features
while handling extreme scaling. By following best practices
and decomposing their system into separate components
that each had unique scaling, performance
and security requirements, Zoom designed a system that can adapt
to these unexpected changes. And they’ve been able to succeed
and provide a service that is essential to their customers. If you follow cloud best practices
around microservices, load balancing,
multiple availability zones, you’re likely in great shape from an
application architecture perspective. But what about the application code
that actually runs within the system. When it comes to dependability,
a step that is often missed is the logic of the application
code itself. Now almost all developers today
write unit tests and provide code reviews to look
for errors as they’re developing. But at AWS we take this a step
further and use tools based on mathematical logic to prove
that our code does only what we want it to do
and nothing more. This is hard.
It’s time consuming and something that most businesses
don’t have the ability to invest in. For example, one of our teams
has recently published a paper that outlines how we prove
the correctness of the KMS protocol. When we say that we have the most
secure, global infrastructure, this is rooted
in the continuous work we do to demonstrate
this claim with mathematical rigor. We do this to answer
the most important question – are we really
protecting our customers? I want to talk to you
about this more, but before I do, I want to tell you
how I think about reasoning. Reasoning is the actual thinking
about something in a logical, sensible way. As engineers we reason
our problems all the time. We design systems to solve complex
problems with many variables. Again, look at this factory. To produce sugar
there’s a multi-stage process. Farmers bring beets to the factory
where they then are washed with high-pressure cannons.
Then they’re sliced into small strips before being combined with hot water
during a process called diffusion. The diffusers create a sort of juice
that’s put in an evaporator and spun in a centrifuge
to separate the sugar from the water. After all this happens the sugar
is packaged and shipped. Now this process
is not all that complicated. It has a distinct start and end and the steps in between have to be
followed exactly in that order. If you put it in a flowchart
it would be pretty simple. And when Jean Baptiste invested
this process in the early 1800s he had to reason out each stage
based on the input it had and the output it created and what
was required to get to the end stage. This is the same process you’re using
when you’re developing software. You create functions,
each with an input and an output and they pass information
to each other. And when your program is small
it’s pretty simple to see how data is going to move
from one function to another. You could easily create a flowchart
to describe all the possible paths your application logic will take. But in reality, most applications
aren’t that simple, or if they are,
they don’t stay that way for long. In most applications there are hundreds
to thousands of unique functions with each taking the different
set of inputs and creating an output
that is passed onto another function and instead of always being
followed in a specific order, functions can call other functions
depending on what they’re asked to do or the specific of the data
that was passed to them. Now imagine the complexity
of the flowchart that you would need to build
to represent all the paths
that your application could take. There’s no way that a human
can possibly know all of this and that’s where the field
of automatic reasoning comes in. Automatic reasoning can also be
called computer-aided development. Automatic reasoning
comes in many forms and it is simply the process
of using computers to prove anything that can be described
with complex logic. In fact, this is one of the first
major use cases for early scientific computers. Some of the first programs
ever written were attempting to find proofs
in mathematical logic. One form of automatic reasoning
is formal verification. Formal verification involves
converting the logic of an application
into a specification and then mathematically proving
that the spec does exactly what it is
supposed to do – no matter what. It’s very expensive and difficult
and time-consuming. It only makes sense for applications where the cost of failure
is truly devastating. Organizations like NASA
use formal verification to ensure that space missions that
cost hundreds of millions of dollars and risk people’s lives,
don’t fail because of software bugs. Intel invested heavily
in formal verification after the floating point bug in early
processors caused a massive recall. Today, all the leading
chip manufacturers including AWS develop processors
using formal verification tools. Another area where we’re
using formal verification is in one of my favorite
services – Amazon S3. When we pioneered S3 in 2006
we built it because we knew customers
wanted storage for the internet. Our top design priorities in 2006
were elasticity, reliability, durability
and performance because that was what mattered
to our customers. Given that nobody had ever built
anything of this scale before, we relied on fundamental
distributed systems concepts to make sure
we could meet those guarantees. In fact I think S3 is the only
product we ever launched with a list of distributed systems
concepts in the press release. But with the knowledge that failures
of components big and small are a given,
remember everything fails, all the time, we need to balance
the trade-off between availability and consistency based on the now
infamous CAP theorem. Given the applications
we initially target, we decided that availability,
always be able to access your object, had a higher priority than getting
the last update to your object. And there could be
a small window of time between when you updated
your object and an all read request would give you
the updated data. The data was there stored
reliably and durable, but the metadata
was eventually consistent. As S3 has evolved into a more
generalized storage engine, it is now used for many real-time
processing applications like for example,
analytics on top of a data lake. The majority of S3 access now
is machine to machine instead of human to machine. Now it turns out that machines
have a much harder time reasoning about eventual consistency
than I ever anticipated. And there is the law
of preservation of complexity. If you need strong consistency on top
of an eventual consistent system, you need to do the work. Over time, we learnt that quite
a few of our customers were building that complexity
into their applications to achieve strong consistency. We knew that if we did
that work for them, life would be a lot easier
for our customers. To achieve that, we had to introduce
new protocols for cache coherence and distributed systems communication
in the request path for S3’s metadata storage. This allowed us to ensure
that every S3 index storage service always has the latest metadata and
was in sync with the cache servers. And we had to introduce all of this
while continuing to run S3 without impact to availability
or performance. S3 had to evolve
into the next generation without our customers
being impacted in any way. So while we did traditional
performance, integration, load, and unit testing,
we knew there were some edge cases that this would not
be sufficient for. For example, when read operations
race with write operations at very high concurrency
on a versioned item, the system can enter
edge states leading to eventual consistency again. And this is where
formal verification comes in. We relied heavily on it and exploit
millions of combinations of states
through formal mathematical models. We would never have been able
to prove to ourselves that we covered all edge cases
without formal verification. And because of all that we are
now able to confidently announce that Amazon S3 delivers
strong read-after-write consistency for only storage request without changes to any performance
or availability, without sacrificing
regional isolation for applications and at no additional cost. The field of automated reasoning is having an impact on AWS
services in many other ways. I’ve always talked about security and you’ve heard me say
it every year, “Encrypt everything.” Security is the most
important thing that we can be focused on
for our customers and I think it’s something
that to be on top of the mind of every customer themselves as well,
no matter what they are building. When you are developing
something new, it is possible to build the system that unintentionally allows
access you didn’t intend on. At AWS we asked ourselves, “How can we help customers
with this?” And we were making
significant investments into security-focused tools
that use automated reasoning to help detect security gaps. One of these is the recently
announced VPC Reachability Analyzer. In AWS it’s amazingly simple
to set up a network. It is truly something
that impresses me to this day. And if you need to isolate
different parts of your application
from other parts, or from the internet,
I can easily do that. I can create a network ACL,
a security group, or a subnet to isolate
these parts from the network. But just like my application logic, the logic of a network
increases in complexity the more work you do on it. Most workloads on AWS of any
complexity have many subnets, have many network interfaces,
have many security groups, and often all these things
reference each other. If you’re trying to trace the path
from one end to the other, it quickly looks like
the flow charts logic problem we had with our
application earlier. And it’s why we built the Amazon
VPC Reachability Analyzer. The Reachability Analyzer uses
the same automated reasoning processes to look
at your AWS configuration without sending a single packet. It can tell if your system
is doing what you want it to do. You can use it to troubleshoot why you can’t reach
one server from another. With the Reachability Analyzer
you only need to tell us what rules you want to validate and we automatically build
the logic to test and verify it. This is just one of the many tools
we offer to customers that are powered
with the automated reasoning. And the more complex
your applications grow, the more important tools
like this are becoming. What all these tools
have in common though is that we make it easy
to define rules and test them because they are based
on configuration. The S3 block
public access feature. Allows you to configure
bucket permissions as long as they don’t provide
unrestricted access to the internet. Our AWS IAM Access Analyzer validates
that your IAM policies only allow what you intended. Without automatic reasoning
powered tools you tell us both what the rules or policies are and they need to tell
what you want to verify. All of this is powered
by technology developed by the AWS automatic reasoning group,
a technology called ‘Zelkova.’ Zelkova translates policy
into precise mathematical language and then uses automated
reasoning tools to check
their properties. These tools include
automated reasoners called ‘Satisfiable
Modulo theories,’ SMT, which automatically prove
or disprove formulas over constant strings,
regular expressions, dates, and IP addresses. Zelkova makes broad statements
about all resource requests because it’s based on mathematics
and proves instead of heuristics,
patterns, or simulation. Next to the services we already
discussed is used in AWS Config, AWS
Trusted Advisor, Amazon Macie, Amazon GuardDuty, AWS
IoT Device Defender, and more. One of the most important
benefits is that users can leverage these tools to help
identify gaps in the design phase before any data is exposed. Furthermore,
this acknowledges automated. We are moving manual
human intervention to help provide
greater scalability. And finally, it is provable.
By converting the problem into a logic-based
mathematical equation it can be proven
with mathematical certainty. Now I would to change it up
for a minute. I will talk more about how and why we build the things
the way we do at AWS. At Amazon we have a long history
of allowing our engineers to pick the tools that work best
for the problem at hand. We have a number of centralized
supported languages and tools, but we do not force them on anyone. For example,
when we needed to do fast UI prototyping for Amazon Fresh, Ruby on Rails was the best tool
for that even when we knew it wouldn’t be the language
used to scale it up. Marc Brooker gave a great talk
at re:Invent called ‘Creating
Technology Standards at Amazon,’ where he talks
about the trade-offs between using what’s familiar
and safe, Java or C++, and what may be new and risky. He makes an important point here
that’s worth paying attention to. The time to develop software
is relatively small compared to the time
your software will be running and needs to be maintained. As such, when the team
decided to develop the next generation
of a load balancer in Rust, it was not about what was
easy and comfortable, which would have been Java or C++, but what would be better for
the long-term success of the system. Next we are having
great safety features and no garbage collector,
the main driver for choosing Rust was that it’s a great language
for using automated reasoning and formal verification
of the application logic. We continue to improve
our reasoning tools and because of that the team prefer
to take the approach of learning a new language
to build the foundation with the support
of automated reasoning. This allowed them the benefit
to prove that our code was doing
exactly what they intended. And because of the ability
to enable verification, Rust has become
extremely popular in AWS and more and more of our systems
are working in Rust, especially those systems
required to build the bendable, secure services for our customers. Another area of
dependability deals with usability. In essence what if then,
our faults in your input data. How are you dealing
with faults in general? The fault handling part of your code are probably
not frequently exercised. So, how can you be sure
that your application will continue to be dependent, or dependable,
regardless what you throw at it? Similar to automated reasoning, Fault Injection allows you to discover
these unknowable unknowns. One way this is done with application
code is fuzz testing, or fuzzing. Fuzzing works by sending bad data
to your application in a way that tries
to make it crash. Imagine a form or your websites
or an API that accepts data, or even a mobile phone application,
what would happen if your application was sent malformed data
that causes errors? Or what if this happened maliciously? This is exactly what happened
in the case of Heartbleed a few years ago. The open SSL library
was making assumptions that during a Heartbleed request
the server should send
exactly what was requested. This bug was originally
discovered by fuzz testing because the fuzzer
made a malformed request and the offending function
happily returned it. And because of things like this,
we are using fuzzing extensively at AWS. I use it to test both our APIs
and our code itself. But even if your code
is perfectly written, hardware fails, packets get dropped,
and unexpected traffic spikes happen. Even well-built
cloud applications today can depend on dozens of services,
and components all working together. And failures can happen
at every level. To understand how your
application will behave when these events happen, the best form of testing
is chaos engineering. Now this isn’t a new idea and has
been around for quite a while. Netflix popularized
this concept many years ago and the tool they built
was called ‘Chaos Monkey.’ The goal of chaos engineering
is to understand how your application
responds to issues by injecting failures
into your infrastructure, usually winning
against production systems. Chaos experiments can include generating a baseline traffic load
against the system, adding latency
into all database calls, and then validating
timeouts and retries. And unlike automated reasoning, we believe that chaos engineering
is for everyone, not just shops running
at Amazon or Netflix scale, and that’s why, today,
I am excited to preannounce a new service built to simplify
the process of running chaos
experiments in the cloud. AWS Fault Injection Simulator
is a fully managed chaos engineering service
that makes it easy to discover vulnerable parts
in your applications. AWS FIS helps developers easily
set up and run controlled chaos engineering experiments across
a range of AWS services. FIS gives you the ability
to inject faults so it has latency, or failure or underlying compute
networking databases, and more, that includes
control playing level fault such as API throttling
and server errors that weren’t previously possible for
customers to do this in the cloud. And FIS makes it easy
to run safe experiments. We build it to follow the typical
chaos experimental workflow where you understand
your steady state, set a hypothesis, inject faults,
and momentary application. When the experiment is over, FIS will tell you if your
hypothesis was confirmed and you can use the data
collected by CloudWatch to decide where you need
to make improvements. FIS removes the barriers
to adopting chaos engineering. I see a lot of benefits
for incorporating chaos engineering into your business. From running game days
or incorporating chaos experiments
into your CI/CD workflows, most people think about chaos
engineering for resilience, and that’s true, but it’s also performance
improvements you can make. It is the blind spots that you
can catch in your monitoring. And perhaps the most underrated
is the experience your teams will get learning how to respond
to infrequent but critical events. Mean time to resolution is not just about
your architecture and automation, but it is also about
the operational muscles that you build
and exercise over time. There is no better way to test
your system than chaos engineering. And with FIS you won’t need
to be an expert to incorporate it
into your organization. AWS Fault Injection Simulator
will be available in early 2021 and will include the ability
to run fault actions against services such as EC2,
RDS, ECS, EKS and more. Out of all the things
we’re announcing this year, this is the one
that I am most excited about. By offering this as a service, I believe that we are going to have
a massive positive impact on building more robust,
more durable, and more dependable systems
in the cloud. [music playing] When we talked about
the history of dependability, I mentioned systems theory. A related field of that
is systems control theory. It has been crucial in building
many dependable industrial and other systems. The most important pioneer
in this field was Professor Rudolf Kálmán. In 1960 he defined concepts
such as whether a system was controllable and observable, or actually unobservable. To be observable is something we know
as software engineers all too well. How can you infer the internal state
of a system from its outputs? These can be functional outputs
like voltage and amperage or the turbines,
or nonfunctional output like the turbine temperature sensors
or rotation speed measurements. And this is what we try to achieve
with the observability property of a dependable system. How can we infer
the internal state of our digital systems
from its outputs? This can include
both functional replies from API
calls for example, or requests to other parts
of the system from nonfunctional information
that we collect through other means. We’ve been monitoring systems
for years. And I remember the earliest
monitoring tools came with Unix, VMStat, Syslog, Netstat,
et cetera. And it wasn’t until the late 90s
that tools like Nagios and RRDtool became popular. Monitoring is for operators.
Just like this factory, an operator would stand
in front of this dashboard to watch a gauge start
to close to end or a light would start flashing
whenever there was a problem. Monitoring means that you
already know what is important. You think, have all
the data you need, and you are just watching it and get
an alert when it goes out of spec. And this was possible
for a factory like this because, relatively speaking, there wasn’t a lot
they could monitor. The generators were
the most important thing, which is why all
the dashboards were there. The equipment that was
processing the beats was all managed by people
who worked with it day in and day out and could tell when it wasn’t cutting
as well as it had before, or if it started to hum
in a different way. A factory like this is a living,
breathing thing, and when your workers
spend literally all day here, they know when something is not right and when it is going
to break before it does. It’s a bit like
the experienced car mechanic that just needs to listen
to your engine to tell you the crankshaft
is about to go out. There’s a pair of magnificent
sensors of course and a lot of experience
how to interpret output and thus, its internal state. However, if there is a car
he hasn’t seen before, or a problem sound
that he has never heard, his experience
doesn’t help that much. But in our world what happens
when everything is automated, and you’re not working with the same
equipment day in and day out, or when your ears and eyes
are not sufficient. Classical monitoring deals
with two questions. What is broken
and why is it broken? Monitoring uses a predefined
set of metrics and logs to determine
known failures. In general with monitoring
and alarming, you can’t predict
when things will fail. You can only take action
when they do. And it’s why we used
to call people who monitored these systems
‘operators,’ and not engineers. The generators here at Sugar City
are extremely complex. It is unlikely that an operator
of the dashboard knew how to repair it
when something went wrong. But they did know when
the sounding alarm and gauges moved to where
they weren’t supposed to go. And systems continue
to increase in complexity. It is impossible to put every
important metric for that system on a single dashboard
that the user watches. Think about everything that
goes into a modern application. They have metrics
for the service containers and functions
that you are managing. Your application has counters
and logs for all the work it’s doing. You may have anywhere from thousands
to millions of customers, all which have data
about what they are doing and how they are interacting
with your application. It is impossible to put all of this
on a dashboard that a human watches to define alerts
for each of these metrics to tell you
when they’re going out of spec. At Amazon we’ve been
on a 25-year journey to improve the processes
of managing our systems. And we’ve long left a notion
that just monitoring was sufficient to manage our systems. We’ve embarked on a holistic
approach to operations from collecting
massive amounts of data and logs, to how we analyze them
to how we solve and talk about problems
when they do happen, and this is what observability
is all about. How can we make sure
we have the data, the tools, and the mechanisms, to quickly resolve problems
in a fundamental way? How can we without
reaching into the system infer internal state
from the data that we have? At Amazon, our most important drivers
have always been customer centric. Find and resolve problems
before they impact customers. Understand the impact
on your customers where you couldn’t prevent it
and fix the problem so that it never happens again. [music playing] Observability centers on three
enabling technologies, metrics, logging, and tracing. All three are important
but serve different purposes, but I do think logging is the most
fundamental one to get right. Back at the first re:Invent
I made a bold statement. I told you to “Log everything”
and I meant it. Log everything.
Logs are the source of truth for what’s going on at any given
moment inside your infrastructure. At Amazon, every service
form the Hypervisor to network gear generates logs
that are indexed, compressed, and stored durably. We have gone through a few
different log collection systems at Amazon over the years. In the early days we were
mounting NFS servers and writing directly to logs
across the network. And we were digging through
those logs with bash, sets, cut and org.
It was painful and time consuming. But it was the best option
at the time. And as we grew and matured
we built the custom client that stored logs in their sleep. And this made it easy
to get logs over our servers and store them durably. The search for errors
and to generate ad hoc metrics we queried those logs with
a distributed Hadoop cluster. And this was significantly
more scalable than a mounted file system
past with Linux commands. But every system
grows in complexity. Our largest applications are made up
of hundreds of microservices, each potentially generating
terabytes of logs a day. Using Amazon EMR to query the logs
for those services could take hours. So, we took everything
we had learned so far and we created CloudWatch logs. With it, you have
a single pane of glass where you can view all logs
from every service, every container, and every application
that you monitor. You can search your logs
for specific error codes, filter them based
on different fields, or create alarms
for different conditions. It’s a powerful service
but it’s still just the first step. When building an end-to-end
observability system you need logs,
metrics, and tracing. If your logs are the source of truth the metrics that you monitor
and graft should come from your logs. One of the most challenging things
for software developers however is, especially if they are
disconnected from operations, is to create metrics
in a monitoring system. And this is when why,
when you ask an organization what metrics they are monitoring,
they will start talking about system-level metrics
like CPU utilization or service level metrics
like number of requests. Those sort of metrics
are usually collected automatically
for you by AWS, or they are modules that are built
into your monitoring application. For Amazon we heard
from our developers that it was too difficult
to publish metrics. And there are lots of ways
to get metrics into a system but they often require configuration
or additional dependencies on libraries
that you need to keep up to date. We needed a system that made
creating metrics dead easy so we stopped
and asked ourselves, “What’s the easiest way
for a developer to write data?” Well, anyone who has written code
without a debugger knows how easy writing
to STDOUT is. And you don’t need
any libraries. You don’t need any dependencies
and there’s no configuration. We let our developers define,
graph and track to one data by just building a string
and writing it to STDOUT. And because we are already logging
to STDOUT we can generate JSON and have it forwarded
to CloudWatch logs. We are doing this through logs
instead of creating a new metric because this typically ends up
becoming high cardinality data. Now, one of the challenges
with allowing developers to create arbitrary metrics
at Amazon scale, is the level of cardinality
it creates. Now, cardinality sounds
like a complex topic but it's actually
relatively simple. The total number of unique
entries in your data the higher the cardinality is. Now let me show you an example. If you’re familiar with time series
databases this might sound familiar. Each event has one or more
dimensions associated with it. Suppose I have a CloudWatch metric
called ‘Disc Free Bytes,’ whose dimensions are Amazon
EC2 instance ID and and the OS mount point
or a drive letter. This has a cardinality of two because there are two unique metrics
one for each drive. If you have 100 of those instances
each with two drives then the cardinality of this is 200. Now suppose you have a CloudWatch
metric called ‘Response Latency,’ whose dimensions are the server’s
instance ID, the customer ID. And suppose that I have a hundred
servers and one million customers. The cardinality of that metric
will be 100 million. Creating a 100 million metrics
in CloudWatch will be rather expensive, inefficient,
and difficult to analyze. Now suppose we have
thousands of hosts, millions of accounts
and billions of records, the level of cardinality in a data
set like this makes it impossible to use standard metrics
and tools to investigate problems and to understand
the impact of events. Graphing individual data points
also loses the context of a metric. Imagine a request log that measured
which server took the request, which account made it,
and how long it took. If you graphed requests, account,
and time individually, there’s no way
to cross reference them. All of the details in relationships
that are in the logs are lost. If there is a fault there is
no way of telling which server that it came from or, more importantly,
which users were affected. You won’t be able to tell
if the faults happened for all requests
or just specific ones. But since now we have
all your data in one place, you can start to understand how your metrics interact
with each other. And this is where
the real cool stuff happens. Let’s say you’re monitoring
your API you have a graph like this. There’s probably
a few questions that you’re going to have
to answer right away. Which API is failing?
How many customers are impacted? Which customers
are impacted the most? Is it one bad host or multiple? Which shards, partitions,
buckets are having issues. Finding these answers
in high cardinality data is exactly why we build
Contributor Insights. With Contributor Insights, you can
analyze logs to dynamically extract and report on contributor data. Our goal is to make it easy
to take a graph like this and to show you
what’s contributing to the change. You can see metrics
about top-end contributors, the total number of unique
contributors and their usage. For example,
can you find bad hosts, identify the heaviest
network users, or find URLs that generate
the most errors? And of course your metrics
are related we can graph and report on metrics
as they relate to each other. Now let’s take
our API faults example and show you
how contributive insights lets you find out
what the problem is. Let’s start with a series of logs
that just include a few fields. UI, host, customer ID, and whether the request failed or not. Using CloudWatch alarms,
we’re going to learn that a number of API failures
involve a value that we defined. So we open contributive insights and look at a few rules
that we have already defined. First, we can look at the rule that shows top five failures
by request time. So it doesn’t look like the problem
is specific to any API. Now what if you fail it,
look at it, by host. Ah, so host A has the failure. Let’s say we fix it and want to know
which customers were most impacted so we can contact them. You can build whatever rules
you need to understand which metric is contributing to
the behavior of your application. If you needed to reference these regularly you could generate
persistent graphs from these rules and add to them
your dashboards and alarms. We are providing built-in rules
that you can use to analyze metrics from other AWS services. For example,
with Dynamo DB rules, you can quickly determine
which items or partition keys are the cause of any shuttling that
is happening in your database. With Contributor Insights, you have
the power to understand cause, impact and blast radius of faults
quickly and easily. This re:Invent will also
announce Amazon DevOps Guru. DevOps Guru takes the expertise that we’ve built from operating
the infrastructure and detecting failures
over the last 20 years, into a service. It uses machine learning
to identify potential issues inside of your account and also provide links
to recommend fixes. Available through the console, you can search
and visualize operational data across Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS
CloudFormation, and AWS X-Ray. For example, DevOps Guru can predict
when the amount of data in an RDS database will exceed
you provision storage. This would have been great
for Brainly during the early stages
of the pandemic since this one was of the services
they were scaling manually. It’s all recommended changes
to instance sizes and configurations, before an application
runs out of resources. Now, the last requirement
of an observable system is tracing. Most application requests
involve many services working together
to fulfill the request. Even simple 3-tier apps
have to pass through a load balancer, to servers and a database.
When working on a distributed system, tracing the source of intermittent
and undetermined failures can be extremely difficult. And this can be mitigated
by the use of a request ID that’s passed through
every layer of your stack. A trace ID is a unique identifier
that is stamped on the request by the Front Door service. From there,
the trace ID is propagated to every other service
the request touches. If you’ve ever logged
a support into S3, you have been asked to provide this. For S3 the X-AMZ-Request-ID is an ID that is passed to every
one of the services that make up S3. This allows us to correlate logs
between different backend services. It helps us understand exactly which
hosts and services serve the request, to find any errors
associated with it. Tracing provides you
with the ability to understand exactly what happens
throughout your entire system. But, let me repeat this. You must pass the trace ID between
your tiers and put it in your logs. They make extensive use of canaries
to monitor our system. The term ‘canaries’
actually comes from coal miners. They placed canaries in the mines
to know when there was a problem. And because they are smaller
and more sensitive, the canary would get sick and stop
singing before any of the miners did. For application monitoring,
your canary has the same purpose. Canaries perform the same actions
as your customers. They continuously verify
your customer experience even when you don’t have
any customer traffic. Canaries should run continuously,
and especially during deployments, and alarm any time
there is something unexpected. Even if your application
hasn’t failed, a canary can give you
an early warning by alerting when something took,
for example, longer than expected. And to build your own canaries
we have CloudWatch Synthetics. Canaries built with CloudWatch
Synthetics are NodeJS scripts, they run as Lambda functions
in your account. They work for both HTTP and HTTPS
and can test for UI components by providing programmatic access
to a headless Google Chrome. One of the great things
about building canaries with CloudWatch Synthetics is that it integrates
with CloudWatch Service Lens, and AWS X-Ray, to provide a graphical
end-to-end view of the services
in your application. Now, I would like to introduce now,
Becky Weiss. She is a Senior Principal Engineer
who worked on many of our systems, including AWS IAM,
Amazon VPC and Lambda. Becky is one of the many experts
we have on monitoring. She has some insights into how
Amazon engineers think about data and why it’s unique. Thank you, Werner, and welcome
to all you re:Invent attendees. My name is Becky Weiss,
I am an engineer at AWS, or, as we like to call it here,
a ‘service owner’, and it’s my honor to give you
the AWS service owner’s perspective on this topic of monitoring
and observability. Now if you ever read anything
about how Amazon does business, you know that a principle
sitting in the middle of everything we do
is customer obsession. And our principle doesn’t stop
at our product roadmap, rather it extends into every aspect
of how we operate our services. And yes, maybe even especially
how we think about the eventuality of failure. Everything fails eventually,
you know that, and you design
in a way that expects it. And for me and for the many
talented operators that I get to work with at AWS, the name of the game is whether
we can see these signals of impending failure
before our customers actually experience failure.
So how do we do that? Well, we’ve got our logging system,
we’ve got our graphs, we’ve got our dashboards,
we’ve got great tools, we continually invest heavily
in those tools, we are never done. And you might be expecting me
to say next that it’s all automated, and maybe it’ll surprise
you to hear this from an AWS engineer,
but automation is necessary but not completely sufficient
for operational excellence. Don’t get me wrong, automation
does play a large role, and we rely on it heavily,
so what’s the rest of it? Mindset and experience. The good old-fashioned practice
of a human brain doing what the human brain does,
looking for patterns, and being curious
about what it sees. We train our operators
to be optimistic pessimists, they are optimistic
about the business and ever-expanding universe
of possibilities it creates for our customers,
but we’re pessimistic and curious when it comes
to operational health. For a couple of minutes, we are going
to have a little bit of fun. I am going to take you
into the human side of all of these things
that we measure, plot and chart. I am going to show you graphs
like the ones we see, take you through what we think
about them and the questions we ask. And to do that,
I am going to take you through three contrived scenarios. These aren’t real graphs,
I actually drew them by hand, but they’re similar
to the ones we see, and I am going to take you inside
our brains as we look at them. Here’s our first fake graph.
Well, we’re privileged here at AWS to get to work with a lot of graphs
shaped like this one, and I hope you are too
in your business, because this graph shows
more volume over time. That’s great,
that’s business success. Alright. Now I’m going to show you a
different graph for the same service. This graph is measuring
something where more of it, higher value, is worse. It could be latency,
it could be other things that negatively correlate
with your customer experience, like maybe how long it takes
to complete a workflow or process a chunk of data.
So what do I see here? Well, I see that as my business
grows over time, this metric’s growing too,
and that’s bad. And not only that, but it’s getting
more variable, and that’s bad too. So now I can’t tell you exactly
what’s going wrong here, it’s fake, I literally
can’t tell you that. But at AWS, based on
our experience at scale, I could almost guarantee you
from a graph like this that we’re approaching
some kind of constraint, limit, contended resource, maybe a new pattern
that we didn’t know about before. And we might even be starting
to bump our heads on it, even if, from
a customer’s perspective, things are still mostly fine. So we are always actively
looking for shapes like this on our dashboards, because sometimes these things
do take quite a bit of work to find and fix
and we make those investments. Okay, here’s our second graph. So we followed best practices,
our service runs canaries that automate end-to-end scenarios
that ensure they are working. Yours should too, we talked
a little bit before about CloudWatch Synthetics
as a great tool to help you do this. So, we have a graph of failures,
and it looks like we’ve failed a run, so you review it,
maybe you even know why, like maybe you knew that
a direct dependency of yours was having a problem
during that minute. But again, let’s look at it
through that mindset of that curious
optimist/pessimist. There is a lot of empty space
in this picture. So I know what happened
during that one minute, my canary failed is what happened.
But what about all that other time? I don’t know. It could be good news,
it could be bad news. No news could be good news,
because it didn’t fail. It could also be bad news because maybe the canary
couldn’t even fun for some reason. I don’t know.
And we don’t like not knowing. So if we see a graph like that, and it’s something we monitor
and care about, we want this graph instead. We want that canary posting a
zero value when it runs and succeeds, because then we know that no news
is bad news, so we can take action. Okay, final example, I’ll show you
a little bit of good news. This one looks good, right, I am measuring my latencies
and percentiles. I’ve got 99th percentile here, I’ve set an alarm threshold
that’s meaningful to me and my customers,
mm, pretty good, yeah, great job. And you know what we would do
with a graph like this? We’d lower the threshold.
Why did we lower that threshold? Because, typically,
the original threshold, we would have put
some thought into, we would have set it well
within the bounds of what we’ve determined to be
an acceptable customer experience, so our customers were fine with
anything under that original line. But once again,
a lot can happen under that line. The graph could go like this. And even if our customers
wouldn’t have been affected, it’s a signal for us curious
pessimists that something changed. We want to know what that is,
maybe do something about it if there’s a new constraint
being quietly encountered. Okay, now all that might have just
looked like obvious commonsense. You know what it is. But the reality is,
when you go around and you look
at the various approaches to operations out there in the world,
there is a wide range. Everybody has got
their metrics around latency and service faults and great tools,
but what about your mindset? Are you measuring the things
your customers care about? And if it were to change,
would you get that signal? Are you looking at that data
like a curious optimistic pessimist? The operationally trained brain
is primed to ask these questions. I personally find
operating AWS services with that mindset to be one of
the most interesting and rewarding things I have
ever had the opportunity to do. And if you approach
operating your own systems with that same sense
of curiosity, I bet you’ll find the same
over wherever you are, doing whatever it is
that you do in the cloud. Thank you very much. And have a wonderful re:Invent.
Happy operating. Thanks, Becky.
I think you’re absolutely right. At the end of the day, all these systems
are being monitored by humans. They’ll all just guessing
what method you think we are going to need
and where they should alarm. For example, when your metrics
fallout a particular ways for a certain time period. We talked a lot
about CloudWatch today, but when it comes to collecting and
visualizing modern operations data, there are a few open source tools
that have become very popular. As part of the Cloud Computing
Native Foundation, or actually I say that wrong,
Cloud Native Computing Foundation, Prometheus is a tool
that makes it easy for customers to monitor container
environments at scale. Grafana is an open
source project for interactive data
visualization services used for monitoring
and alarming, that’s commonly used with
the Prometheus open source project. Grafana supports
multiple data sources, such as Prometheus,
Amazon CloudWatch, AWS X-Ray,
Elasticsearch and AWS Timestream, allowing for the creation
of dashboards and alerts
from multiple sources. Although it’s easy
to deploy a single Prometheus or Grafana
server in AWS, it can take weeks of work
to scale across multiple servers and configure the entire environment
for high availability. So I am excited to announce Amazon Managed Service
for Grafana and Prometheus. Using these services, we will manage the provisioning
and setup for Prometheus along with, of course, ongoing
maintenance and scaling operations. The Prometheus Query Language
is optimized for large volumes of data,
commonly in container monitoring. This makes it easy
to search and group metrics, such as CPU memory and latency
at a granular level so the container issues can be
isolated and alarmed on quickly. Engineering teams can use the same
familiar Prometheus Query Language to filter, aggregate
and alarm on metrics and quickly gain
performance visibility without any code changes. The Amazon Managed Service
for Grafana makes it simple
for engineering teams to query, visualize and alert
on data services such as metrics, logs and traces,
no matter where they are stored. These services are available
today as a preview on AWS. When it comes to observability, we talked a lot about CloudWatch
and other AWS services. But AWS isn’t the only company
building these services. And I’ve always said that AWS is
so much more than just AWS services. And I’d be remiss if I didn’t
recognize all the tools being developed for a complete
monitoring and alerting ecosystem. We have great partners who are
also operating in this space, like Sumo Logic and Splunk
and Datadog and New Relic and AppDynamics. But all these have a different
approach to collecting data. And it can be challenging
to combine the different approaches, which is something the Cloud
Native Computing Foundation is trying to create
a foundational approach for, with the OpenTelemetry Project. Open Telemetry provides open source
APIs, libraries and agents, to collect traces and metrics
via partition monitoring. The AWS distro for OpenTelemetry
consists of collectors that are built
into the application and exporters that send data back
to backend analytics targets. In addition to supporting AWS
targets like CloudWatch and X-Ray, customers can also send traces
and application metrics to a number of AWS partners
and third-party providers. The distro for OpenTelemetry simplifies the process
of collecting data by allowing you to instrument
your applications just once, instead of using
multiple tools from different vendors to collect metrics and traces. We are excited about
the OpenTelemetry Project and, in addition
to providing this distribution, we are also contributing
to the upstream project for a number of components. Now, no matter
how you choose to log, monitor, trace and alert,
there is a tool that fits your needs. [music playing] We have covered
a lot of ground today. We have talked about
the importance of development, how to build dependable applications,
and how to effectively run them. If you paid close attention, you will notice that there has been
a trend with all these things. More and more AWS is taking tasks
that can be slow, difficult, or time-consuming,
and making them easier to use by using advanced technologies
to simplify them. These technologies can include
automated reasoning, or even machine learning. Take this AWS Panorama
appliance, for example. This device allows you
to deploy machine learning models to existing
industrial cameras, like the ones that will be here
in this factory. With this device a business
could run computer vision models for tasks such
as quality control, or part identification or security
or workplace safety. And this isn’t changing
any of the processes that weren’t already done, but it’s improving them
and allowing the operators who perform them
to work more efficiently. Or there’s Amazon Lookout
for Equipment, a machine learning service that detects abnormal equipment
behavior using IoT sensors. As technologies advance, you will continue to see
these technologies improve our work. So instead we can
all be more efficient. At AWS, with every
new service we build, we ask how we can make it better
by using machine learning. And you can see this
with many of the services we’ve released over the past
few weeks around databases, security, and operations. These aren’t machine
learning services, but they are services
enhanced by machine learning. Amazon isn’t alone
in doing this. Just like Ava, businesses around
the world are integrating machine learning in its existing applications and data to get more value out
of what they already have. As technology that powers
our work advances, we will continue to chip away
at the heavy lifting that we all do
on a daily basis. So, as you are building
or using a system, take a few minutes to think about
which parts can evolve
from simple automation, and can make use
of advanced technologies such as machine learning. You might be surprised by all
the places that ML can help. When we started today,
we talked a lot about how AWS is meeting
developers where there are. And what if we applied machine
learning to that? Services like Amazon CodeGuru
were built to solve problems exactly like this.
When you are writing software, there are a lot of things
that need to be checked. Problems like memory leaks,
or hard coded credentials and duplicate lines
after refactoring, which won’t prevent
your code from compiling, but can still cause problems
for your application. Typically, these problems
are identified during when your code reviews
before branches are merged. But these are difficult tasks,
and some vulnerabilities are easy to miss, especially if there are many changes
that are happening at once. And that’s why we build
Amazon CodeGuru. CodeGuru uses machine
learning to automate code you’ve used during
application development and to profile applications
after they have been deployed. As code is checked in,
CodeGuru reviewer will automatically
give you your code, just like a senior engineer
in your team would do. It provides advice on what’s wrong,
and gives you links to documentation. It was great, but how it works
is that it does these checks automatically
when you check in the code, just like you would do
with another code review. This way you can find
and fix problems early, and the code reviews performed
by members of your team can focus on more important aspects
of your business logic. CodeGuru Profiler, on the other hand,
attaches to running applications in your test
or production environment. Using machine learning,
it inspects your running applications in order to find
performance bottlenecks. It allows you
to troubleshoot latency, and CPU utilization issues. Learn where you can
reduce infrastructure costs. It identifies
application performance issues by combining automated
reviews from CodeGuru with the learnings from AWS
Fault Injection Simulator and the recommendations
from DevOps Guru, we can improve the entire
development lifecycle using machine learning.
And this is just the beginning. There are so many parts
of application development that involve writing and rewriting
bits of our applications that are essentially just plumbing
and aren’t of any business value. As the tools we use advance,
machine learning is going to continue to remove undifferentiated heavy
lifting of building software. Like machine learning, another field
I am getting really excited about, and that I think is going
to change our assumptions about what is possible in the world,
is quantum computing. I know. We’ve been hearing
about Quantum for years now, and how it will be
the next big thing. I truly believe
that at some point it will become the next
game-changing technology. It’s going to happen slow at first, providing small optimizations
and enhancements, but eventually it will revolutionize
the areas it is well-suited for. Chemistry research,
drug discovery, material sciences, they are all going to be
some of the first industries that will benefit
from quantum computing. Just like GPUs have changed
the field of machine learning, I believe that quantum processors
will eventually do the same for many of these
scientific fields. AWS is investing
heavily in quantum. Our Quantum Solutions Lab
connects experts with organizations to build internal expertise
and strategies required to run
quantum workloads. The AWS Center for Quantum Computing
is a partnership with Caltech where we are researching quantum
computing algorithms and heart rate. Just last week, our scientists
published a new research paper showing a theoretical quantum
computing system with groundbreaking improvements in error correction. We also launched Amazon Braket
which democratizes access to quantum resources
from a number of providers. And this is where the power
of the cloud really shines. A popular use case of Braket is
for developers to learn to evaluate if quantum could enhance
their workloads as it advances. Historically, you’d have to wait
for technology like this to leave the research phase, be turned into
a mass-produced product, and purchased at enormous cost. Only then would you
be able to determine if it would actually help you. But making quantum computers
an expert assistance available to every developer today, we are helping AWS
customers stay one step ahead. As a software developer, now is the time to start thinking
about quantum computing. It is the way that we are
starting to see machine learning make an impact
on our daily lives, and the future that technologies
like quantum computing will enable that make me so excited about cloud
and technology as a whole. And I am excited to see
how developers will use these technologies to truly
improve the world around us. I want to thank you for spending
time with me today as we explore this amazing space
for my favorite city. This year has been challenging
in a lot of ways, but I also believe that challenges
are the best time to reflect and think about
whether you are building the right services
for your customers. We have talked about how AWS
is meeting developers where they are, and that’s because
you are our customers. As you develop your applications, think about what you can do to meet
your customers where they are. Many of us have experienced
severe anxiety and uncertainty
about the future in the past year. Uncertainty about their jobs,
health, financial future, family, and much more.
I strongly urge our customers, for example those
in financial services, to be conscious about this when they build new ways
to engage their customers. Address these important
issues upfront, and let them come back,
for example, in the way you design
your interfaces, or what services you could
build for your customers that helps them in addressing
their uncertainty. For almost all of us, digital
services have become essential. But this means that these services
are not just for the digital natives with their fiber connections
and the latest smartphones. Consider the experiences
you are building for them, and how they access them. Not everyone has 5G or even a strong
Wi-Fi connection at home. If you build essential services, make sure they also work on low
bandwidth, high latency connections. There are enough news reports of kids having to go to the parking lot
of a grocery store or a fast-food restaurant to just get a decent
internet connection for school. The applications you build
are essential for your customers, whether you are
building a service that helps people budget
what they have, and helps them predict
their immediate financial future, or maybe you are
building a website that helps people stay
connected to each other where they wouldn’t
be able to otherwise. We, as developers, have
a responsibility to our customers to build the best applications
we can for them in ways that take the current
reality very seriously. It's never been a better time
to use your knowledge, skills and talents
to make a difference in the world. Now, go build. [music playing]