- Well, good morning everybody. Welcome to SVS 404, a
closer look at AWS Lambda. Hello everyone. Welcome to the room. If you're here today,
super cool to see you. I also know there's an
overflow room somewhere in the region of Las Vegas,
so you're on the camera. Hello to everybody there. And also this is gonna be recorded, so if you are watching this
on YouTube later, I suppose, hello into the future. My name's Julian Wood, I'm a senior developer advocate
here at AWS Serverless, and I work on a cool team within the serverless product organization and I really love helping developers and builders understand how best to build serverless applications, as also, as well as also being
your voices internally to make sure we're creating the best products and features. And I'm also gonna be joined shortly by the awesome Chris Greenwood. He's one of our principal
engineers who works in the Lambda team. Now, this is a talk today
that's gonna follow on from two previous Lambda
under the hood talks. The first was the re:Invent 2018, where Holly Mesrobian and Marc Brooker first showed how some of Lambda worked, where they covered the
synchronous invoke path and talked about the Firecracker MicroVM technology. Then in 2019 they covered
how async invoke works as well as how you could then use Lambda to pull just from kinesis and also talked about scaling up with provision concurrency. Now today I'm gonna start
with a bit of a recap of how the different Lambda
invocation models work, and then both Chris and
I are gonna dive deep into some of the interesting
ways we've evolved the service since then and solve some interesting challenges to bring new functionality to Lambda. And you can use the QR
codes which use s12d.com. It's a link shortener for serverless land to watch the previous journey talks. Now as a 400 level talk, we're not gonna go into the
basics of what Lambda is, but it's worth highlighting how Lambda is the fastest way to
build modern applications with the lowest total cost of ownership. And we strive to help
teams build more and faster by doing everything that
makes cloud development hard. We make your distributed system
challenges Lambda's problem to solve. Teams can focus on just their code, which avoids all the
maintenance costs and drudgery of making cloud software work. And as a cool bonus,
costs align with usage, which avoids a lot of
waste from idle capacity. Lambda really is the compression algorithm for experience. You can turn code into a
well-architected service with negligible operational efforts. Now this is why serverless
adoption is growing so fast. A serverless strategy enables you to focus on things that benefit your customers rather than infrastructure management. And we launched Lambda eight
years ago at re:Invent, and today more than a million customers have built applications that already drive more than a trillion, more
than 10 trillion Lambda invocations per month. And Lambda still continues
to grow at a phenomenal rate. We see customers building a wide variety of
applications with Lambda, but there are a few common areas where they tend to see huge benefits. Many people start out
maybe with IT automation, using Lambda functions
to validate some config each time they launch an EC2 instance. Then teams may then start using
Lambda integrations with S3 or Kinesis Streams or Kafka to build data processing pipelines. And then multiple teams may then start building whole
microservices-based applications using Lambda often for web applications and use managed Lambda with some other managed services as a critical part of their
event-driven applications. And of course, we've got many customers also using Lambda in machine learning applications. And we do do a ton of innovation in all areas of the Lambda service to make the complex simple. Give you more, while crucially
keeping it simple to use so you can take advantage of
all that Lambda has to offer to solve many challenges at any scale. And we think Lambda's been the pioneer in helping developers build
serverless applications from GA in 2015 through support for 15 minute functions in 2018. Things like provision, concurrency and EventBridge as a source in 2019, container images and 10
gig functions in 2020. And in 2021 we brought Lambda extensions and arm-based graviton two. This year, it keeps on going. We've got bigger 10 gig/temp
space and function URLs, all while adding more languages, shrinking latency and proving integration with other services. For example, Coca-Cola. That's a great customer
who had to react quickly to the changes brought
on by the COVID pandemic. And they wanted to provide
a touchless experience for their Freestyle drink dispensers. And so built a new smartphone app that allows customers to
order and pay for drinks without coming into contact
with their vending machines. Now because they're built with Lambda, their team was able to
focus on the application rather than spending
huge amount of extra time on security, latency, or scalability. Since with Lambda, that's all built in. And as a result, they built this new application
in just a hundred days and now over 30,000 machines
have that touchless capability. Now one of the important
benefits of Lambda is that it responds to
data in near-real time. And as customers create more data streams, a serverless approach
to processing their data is really super appealing. So an example is like
Nielsen Marketing Cloud is doing this at an incredibly high scale. They are using Lambda to process
250 billion events per day. Yes, that's with a B, all
while maintaining quality, performance and cost using
a fully serverless pipeline. And the scale of the system
is really why they chose to use Lambda. On a peak day, Nielsen
received 55 terabytes of data with 30 million Lambda invocations and the system managed it with no problem, and in fact, they only had up
to 3000 concurrent invokes. On normal days, they see between one and five terabytes of data and they were able to
scale up and meet this need without additional work. I'm really sure the on-call engineers could breathe a sigh of relief. While we're working on Lambda, we focus on things like availability, utilization, scale,
security and performance. And today we're gonna give you a taste of some of the engineering
efforts in these areas, but we also do have a priority list. The top priorities for Lambda
are security, durability, availability and features
in that specific orders. Now you may hear that security
is always job zero at AWS because that comes before everything else. But for Lambda, after then we then focus on operational excellence for
durability and availability, like running your functions
in multiple availability zones to ensure that it's
available to process events in case of a service
interruption in a single zone and even in the unlikely
event of an AZ outage. And we ensure that
these basics are covered before we even work on new features. With security being job zero, we have the shared responsibility model, which has a separating line which defines how AWS is responsible of security in the cloud. This is the infrastructure and software that runs all our services and you are responsible
for security in the cloud. Sorry I said AWS security of the cloud. And then you're responsible
for security in the cloud to ensure your applications
and data are secure. But when building serverless
applications, as with Lambda, AWS takes on even more of
the security in the cloud. So you can worry less about many things like operating system patching, things like firewall configuration and maybe distributing certificates. You get to focus even more
on your application and data and let us manage the underlying platform. And a really clear example of this was on December the 9th, 2021, when a seriously bad
remote code vulnerability in Apache's Log4j was
announced to the world. Many customers were scrambling to update Java across their environments, yikes, last December, ruining it for so many engineering teams, but, and probably for
people in this room as well, but, if you were using Lambda, we handled most of it for you. Now, Lambda doesn't actually include log4j in its managed runtimes
or base container images. So you were, in effect, already protected, but if it had been included, you would be patched automatically. Separately, we did also
work with a Corretto team, the Java Corretto team,
to build a mitigation. If you did have log4j as part of your Lambda function code, we could block the functionality and log that it had happened. Lambda protected you without
you needing to do anything. And one of the major technological
foundations of Lambda is the open source Firecracker. It's a cool name, I really like it. And Firecracker provides
resource utilization to run functions built
with enhanced security and workload isolation
over traditional VMs with super fast startup times, plenty more coming about Firecracker and how we're innovating in that space. Now Lambda has two
types of invoked models, synchronous where the caller calls Lambda. And this can be either
directly via the CLI or SDK or with the newish function URLs or through a service like API Gateway, which is then mapped to a Lambda function. The sends the request
to the Lambda function, does its processing, waits for a response before
returning it to the client. Now when evoked in
function asynchronously, again via the client call or maybe an S3 change identification or service like EventBridge sending an
event when a rule is matched, you don't wait for response
from the function code. You hand off to the events to Lambda and Lambda just handles the rest, it places the event on an internal queue and returns a success
response to the core, saying, "Yep, I got your event." A separate process then reads events from the internal queue and sends them to your function. There's no actual direct return pass to the calling service. Now in order to do this, we have a number of control
plane and data plane components, which we're going to talk about. Now just to tell you, despite some of the rumors I hear, this talk isn't going to be about how we've entirely rebuilt Lambda to run on Kubernetes,
nothing against Kubernetes, but just in case you wondered, we do not in fact run
Lambda on Kubernetes, but we do have a control plane service, which is where you interact
with the Lambda service, and this is where you create, configure, and update your Lambda function and upload your code to
use the various features that we expose. We manage the Lambda
service resources too, to make sure they're available
to handle your invokes. There are also a number of developer tools to interact with Lambda, which also cross over into
some of the data plane tools. And this is things like from
the console to the AWS CLI, SDKs, AWS toolkits for various IDEs, and infrastructure as code tools and then the data plane responds to get events to the land of service. And all evokes land up
being synchronously invoked with a number of internal services. Now these are the data plane services and the eagle-eyed
Lambda experts among you may also spot a service we
haven't talked about before. And this is called the assignment service and we've been evolving the
previous Worker Manager service and I'll go into more depth later on how we solved some
Worker Manager problems and built the assignment service
for improved availability. Now also the async event, invoke data plane with the Lambda pollers, handle the very powerful
asynchronous invoke model to process events and ultimately hand them to the synchronous invoke data plane. So let's have a look at how
asynchronous invoke works. Requests to invoke a
certain function arrive via an application load balancer, which distributes the invoke requests to a fleet of hosts, which then run the stateless front end invoke service. Now Lambda is a multi-AZ service, so you don't have to
worry about load balancing your functions across multiple AZs, Lambda just handles that for you. Now I am simplifying the
diagram a little bit, it does get a little crazy, but Lambda is built so all its components run transparently across multiple AZs. So, the front end service
first performs authentication and authorization of the request, and this ensures that
Lambda is super secure as only authorized callers even get through Lambda's front door. The service loads the metadata
associated with the request and caches as much information as it can to give you the best performance. The frontend then calls
the counting service. And this checks whether any quota limits may need to be enforced based on things like account, burst, or
reserved concurrency. And the counting service is
optimized for high throughputs and sub one and a half millisecond latency because it's called on each invoke. And therefore, it's
critical to the invoke path and because so, it's
made highly available, as with a lot of Lambda,
across multiple AZs. The frontend then talks
to the assignment service, which has replaced the
Worker Manager service and it's a stateful service and there's gonna be a
lot of talk about state in our talk today. So this is responsible, like
the Worker Manager service was, for routing invoke requests
to available worker hosts. If you wanna think about it, the worker is the server in serverless. It's responsible for creating
a secure execution environment within a microVM to download, mount, and run your function code. The worker is also responsible for managing the multiple run times and this is, you know, for languages like Node, Java, and Python and any other kind of languages you bring. It's also involved in setting
up the limits, such as memory, based on your function configuration and then the associated
proportional virtual CPU. And, of course, needs to
manage the Lambda agents on the host that monitor
and operate the service. Now here is also another component which does some coordination
and manages background tasks. And this is the control plane service, which handles function creation and manages the lifecycle
of assignment service nodes and then also ensures the frontend service is up to date with which assignment node to send an invoke request. So back to the assignment service, this is the coordinator,
information retriever, and distributor about what
state execution environments are on the worker host and where invokes should ultimately go. It also keeps the front end
invoke service up to date with which assignment
service node to talk to. For the first invoke of a function, a new execution environment needs to be started on a worker host. So the assignment service
talks to the placement service to create an execution environment on a worker with a time-based lease. Now, placement uses ML models to work out where best to place an
execution environment and where they do this by
maximizing packing density for better fleet utilization while still maintaining
good function performance and minimal cold path latency. It also monitors worker health and makes the call when
to maybe mark a worker as unhealthy. And we need to make sure
this runs as fast as possible to reduce latency for
any cold start invokes. So once the execution
environment is up and running, the assignment service
gets your supplied IM role for that function that
the function's gonna run with the privileges that you define. It's then gonna distribute that role to the worker along with
the environment variables. The execution environment then
starts the language runtime and downloads the function code
or runs the container image and the function
initialization process runs. The assignment service then
tells the frontend service, "I'm all ready," which then sends the invoke payload to that execution environment. The function then runs the handler code and sends its results back
through the frontend service to the original caller for
the synchronous invoke. The worker also then notifies
the assignment service saying "I'm finished with the invoke," and so it's then ready to
process a subsequent warm invoke. Then when a subsequent invoke comes in, frontend checks with account, checks the quotas with
accounting service again, and then talks to the
assignment service which says, "Yes, there is actually a warm and idle execution environment ready." and sends the invoke directly, payload directly to the execution
environment on the worker and your function can
run the handler again. When it's finishing with the invoke, The worker then tells the
assignment service, yes, I'm now ready for another
subsequent invoke. Then when the lease of
the execution environment is nearing its endtime, this assignment service
marks it as not available for any future invokes. And then we have a sort of further process that spins down that
execution environment. Also, we need to handle errors, so if there are any errors within the in it phase, in the
execution environment, assignment's able to then
mark it as not available and then it's removed from service. Also, if we need to remove
a worker from the service, maybe when it needs to be refreshed, which we do regularly
or it has any errors, we can actually gracefully drain it of execution environments
by stopping future invokes. Current invokes will carry on, but we just stop any future ones. So let's switch to talking
about asynchronous invokes, or also what is called event invokes. And you can see here we have the event invoke frontend service. Now you may remember
that for the sync invoke, we had a frontend invoke
service, slightly similar names, but they're definitely different. Well, the event invoke
frontend service sits behind the same frontend
load balances as before, but the load balances see
that it's an event invoke and so send it to the event
invoke frontend service rather than the frontend invoke service. And we actually deliberately
separated these services, building a new data plane to ensure that the async invoke path was separate to the sync invoke path to protect against a large amount of invent invokes potentially causing latency
with the sync invoke path. Just another way that we
can improve availability and performance on your behalf. Now, I'm simplifying the
diagram a bit compared to the sync invoke, but again, it's all spread across
multiple availability zones and the service again is
then gonna do authorization and authentication of the request based on who the caller is. So, bit of a simplified diagram, but please remember it's multi AZ, so we've done the authorization
and the authentication, and then the frontend is
gonna send the invoke request to an internal SQS Queue and respond with an acknowledgement to the caller that Lambda is gonna invoke
the function asynchronously. Now Lambda's gonna manage these SQS queues and you don't really have or
need any visibility of them. Now, we also run a number of queues and we can dynamically scale them up and we can scale them
down depending on the load and the function concurrency. Now, some of these queues are shared, but we also send some
events to a dedicated queues to ensure Lambda can handle a huge number of async invokes with as
little latency as possible. We then have a fleet of polling instances, which we manage, and
an event invoke poller then reads the message
from the internal SQS queue to determine the account function and what the payload is gonna be and then it sends it
to the invoke request, then synchronously to the
sync front end service. And as you can see, all Lambda invokes ultimately
land up as sync invokes. The function's then gonna use
the exact same sync pathway I talked about before and
it's gonna run the invoke. The sync invoke is then gonna return the function response and this is gonna be to the event invoke poller, which is then gonna delete
the message from the queue as it's been successfully processed. If the response is not successful, the poller then's gonna return
the message to the queue. And this uses exactly the same
visibility timeout settings as you would with your own SQS queues. Then the same or maybe another poller is gonna then be able
to pick up the message and it's gonna try again. You can also configure event destinations for async invokes to provide
callbacks after processing, and this is whether the
invokes are successful or maybe they've failed after
all the retries are exhausted. There are some additional
control pane services involved. The queue manager looks after the queues, it's gonna monitor them for any backups, and then also does the creation and deletion of the queues. It's gonna then work
with the leasing service, and this manages which pollers
are processing which queues, and it's also gonna detect
maybe if a poller fails or isn't doing its job
properly that its work can then be passed to another poller. Now, when we built the Lambda async model, we realized, well, this idea can be used for a number of other
service integrations. So an event source mapping
is a Lambda poller resource that reads from a source. Now initially this was only
for Kinesis or DynamoDB, but we've expanded this to
poll from your own SQS queues, Kafka sources, including
your own self-hosted Kafka. Yep, Lambda can even pull
from a non-AWS service, as well as Amazon MQ4,
RabbitMQ and Apache MQ. The pollers pull messages
from these sources, can optionally filter them, batch them, and then send them to your
Lambda function for processing. Let's talk a little bit more
in detail on how this works. A produce application
puts messages or records onto the stream or queue asynchronously. We then run a number of slightly
different poller fleets, as the pollers have
different clients depending on the event source. The pollers are then
gonna read the messages from the stream or queue and
can optionally filter them. And this is a super useful functionality just built right into the system. It helps reduce traffic to your functions, simplifies your code,
and reduces overall cost. It is then gonna batch
the records together into a single payload
and it's gonna send them to your function synchronously via the same sync frontend service again or Lambda invokes ultimately
land up being synchronous. And then, as with SQS, for
queues, your own SQS queues, the poller can then delete
the message from the queue when your function
successfully processes them. And then for Kinesis and Dynamo, you can actually send
information about the invoke from the poller to SQS or SNS. Now the cool thing to
remember is that we manage these pollers for you as
part of the Lambda service, and that's a huge benefit. You don't have to run and scale your own consumer EC2 instances or a fleet of containers, maybe, to pull for the messages, and it's actually free. You don't pay extra above
your function invokes for the service. It's just more of what
Lambda can do for you. There are also some other
control plane services involved here, the state
manager or the stream tracker, and that's obviously
depending on the event source, is gonna manage the pollers
and the event sources. It's gonna discover
what work there is to do and then is gonna handle
scaling the poller fleets up and down. The leasing service then
again assigns pollers to work on a specific event
or a streaming source, and if there's a problem with a poller, it's gonna move its work
around to another poller. And having these poller fleets allows us to support a huge number
of event source features. Now there's a lot on the slide, but basically means that we can do things like filtering and configuring batch sizes and windows and reporting
on partial batch failures, a sort of whole raft of settings. And this is depending on the event source. And you configure these actually on the event source mapping itself or yes, sometimes on the original event source, but it really makes it
easier to process data at scale with Lambda. So that covers the sync,
async, and poller functionality and how that works with Lambda. Now let me hand you over to
one of our top Lambda gurus, Chris Greenwood. (audience applauds) - Thank you Julian. My name is Chris Greenwood, and I'm a principal
engineer with AWS Lambda. My goal here today is to
demonstrate that in some ways, Lambda is a storage service. My first position with AWS was working on elastic block store or EBS. EBS is a large distributed storage service that backs the majority of
EC2-attached block devices in the world today. The service is faced with many of the interesting storage challenges that you may have heard about from talks from the storage track. A few years ago, I joined Lambda expecting to find a much different problem space, more serverless, more ephemeral, more stateless, but in
the months that followed, I found state management
challenges that were very similar to those I faced in EBS. What I came to realize is that Lambda is a stateless serverless
compute product to you, the function developer, but on the back end involves many of the state management challenges that are present in a storage service. To some degree, Lambda
is a storage service that you don't have to think about. So for the next segment of this talk, I'd like to look at Lambda
as a storage service. And we're gonna talk about three lessons that Lambda has learned from our peers on the storage service
side that have helped us improve the Lambda service. The first is that access patterns
to your data should drive the storage layout of that
data in your data plan. The second is that shared state
is important for performance and for utilization of a storage service, and that the best kind
of shared state comes without sharing state, and I'll explain what that
means in a little bit. And the third is that a
well-architected storage service often has an equally
well-architected presentation layer into the caller. In other words, you meet
the caller where you are, where the caller is. We'll start with lesson one. So to frame the topic, what state needs to be managed by Lambda in order to serve an invoke? Fundamentally, an invoke
involves three things. First, it involves invoke input. Invoke input is the JSON
payload provided by the caller or provided by the event
source on every invoke. Second, it involves code. Code is provided by the function developer when creating or updating the function. And third, it requires a
place in which to marry, invoke input, and code
the virtual machine. That's two pieces of state and one piece of state and compute in the virtual machine
that together make up the necessary bits of
information to serve an invoke. Julian gave us an overview
of invoke frontend and the poller fleet, which together serve to get invoke input to the correct machine
at the correct time. Now we're gonna talk about code. When Lambda launched to
deliver code to a machine in order to spin up an
execution environment, Lambda did the simple thing. We downloaded that code from S3. Each environment started
up with Lambda code that would discover which code payload governed the function, would download that code payload from S3, unpack it into the environment, start the runtime. Starting that runtime would essentially turn that code into a running VM and make that running VM ready to take invoke input and execute. This worked just fine
and is in fact the way that the majority of Lambda functions in the world work today. But as our requirements changed, so did our architecture. In 2020, we set out to build container packaging support
for the Lambda product. This changed state management in a big way in that code payloads
were now much larger. While zip code payloads are limited to 250 megabytes in size, we realized that container
images are often much larger, and we decided to support
up to 10 gigabytes in size for container packaging in Lambda. However, the code delivery architecture for Lambda made it such
that the time taken to download a piece of code and unpack it into the environment scaled
linearly with code payload size, and we felt this was unacceptable for the new code payload requirements of the Lambda container packaging product. This meant that we had
to rethink the mechanism by which we deliver a piece of code into the microVM, or, sorry, into the execution environment. We, along with the community,
realized something. In many containerized workloads, a container accesses a
subset of the container image when actually serving requests. This chart from a Harter paper published in the FAST conference in
2016 shows the disparity between total container image size, either compressed or uncompressed, and the total amount of
that container accessed on the path of a request. While the mechanism Lambda
uses ended up being different than that of the Harter paper, we realized that if we
could download and present to the execution environment only the bits of a container image that were necessary to serve a request, then we could get the
execution environment started more quickly and we could amortize code
delivery time and cost over the lifetime of the
execution environment. So, how to go from all at once
loading of a container image to incremental loading
of a container image? The first thing we had to do was to change the way the container images were persisted in the
Lambda storage subsystem. What we're looking at
is a simple dockerfile that, what a simple
dockerfile may look like for a container image
holding a nodejs function. When you build a container
image from this docker file, Docker is going to produce a
set of tar files called layers. Lambda takes those layers and flattens them into a file system. This is the file system that is present when your execution environment starts up. Once we have the flattened file system, we break its binary representation on disc into chunks on the block device. A chunk is a small piece
of data that represents one logical range of the block device, and multiple chunks appended together make up the entire block device and the entire file system. Chunking is powerful as it lets us store contents
at a sub image granularity and then also access those contents at a sub image granularity. Let's consider the example
here of a file system exposed into an execution environment. When the execution environment starts, it knows nothing of the
contents of the file system. First access of the file system, say, to list the contents of root, results in an inode read
out of the file system. Lambda software maps that
inode to a specific chunk in the container image
and fetches that chunk to serve the read. A subsequent read to a
different inode, say, to open and start the Java binary, may fall into the same chunk, which means that Lambda needs
to load no additional data to serve that inode read. Future reads such as opening
and loading the handler class for your function may
fall into different chunks and chunks are loaded as they're needed. This means that all up, Lambda delivers to the execution environment, the contents that are needed
to serve file system requests made by the VM and does
not deliver contents that are not needed. In this way, Lambda
has learned lesson one. We let access patterns to our data, specifically the fact
that container images are sparsely accessed, to influence the storage layout of our container storage subsystem, specifically the fact that we chunk and incrementally persist and access blocks of container images. Moving to the second lesson, sharing state without sharing state. The next thing we realized
about container images is that they often share
contents with similar images. This is for the simple reason that when people build with containers, they often don't start from scratch. They use base layers. A customer of Lambda may
start with the base OS layer, lay on top of that the
Lambda-vended java runtime, and finally, lay on top of
that their function code. The Lambda data plan
understanding which data is shared and which data is unique is helpful in optimizing code delivery. But it turns out the deduplication is difficult in this
context for a few reasons. The first is related to the
fact that our storage layer is block-based instead of file-based. To allow two copies of
the same file system to share contents at the block level, that file system must
deterministically flatten layers onto a file system and
deterministically serialize that file system onto a block device. But many file systems don't do this. In some file systems, if you take the same set
of layers multiple times, flatten them onto a file system, and then serialize those file
systems to a block device, you get a different binary representation in that block device for the
same logical file system state. To solve this, we did some
work in the EXT4 file system to ensure deterministic behavior when flattening and serializing to disc. The second reason deduplication is hard is related to key management. We need different keys
encrypting the contents of different customers
in different regions, but different keys mean
different encrypted payloads even if the contents are similar, which prevents deduplication
at the storage layer and at the caching layer. So ultimately to benefit
from deduplication, we need to use the same
keys to encrypt data when appropriate. A simple way to do that would be to depend on a shared key in a shared key store, but that results in a
single point of scaling across multiple customers and
across multiple resources. What we'd like is for the
keys to logically be the same, but for the discovery of and persistence of those shared keys
to be pushed out closer to where the data itself is stored. To solve this problem, we turn to a technique
called convergent encryption. When encrypting a chunked image, we start with a plain text chunk, which represents one segment
of the flattened file system. We take that chunk and
we append some extra data that is deterministically
generated by the Lambda service. Then we compute the hash of
that chunk and extra data. This hash becomes the unique
per-chunk key for this chunk. That key is then used to
encrypt both the chunk and the extra data with an
authenticated encryption scheme. Including the extra data in
the encrypted payload allows us to verify that the
payload remains unchanged at rest and in flight
on the decryption path. We do this for each
chunk in the file system, writing encrypted chunks
to our chunk store, and keeping track of the keys for each. When complete, we create a manifest containing
both keys and pointers to chunks in the chunk store. We finally encrypt that manifest with a customer specific KMS
key and persist the manifest. There are a few great things
about this encryption scheme. First, assuming we use
the same extra data, we produce the same key from the same chunk contents every time. This allows us to securely
deduplicate the contents that should be deduplicated,
and critically, this allows us to do so while having each image creation process
proceed independently without depending on a shared
key or a shared key store. Second, without changing
the encryption scheme and without changing the chunk, by simply changing the extra data, we can force two chunks to
use different encryption keys even if their contents are the same. This is helpful in ensuring, for instance, that contents in different AWS regions don't share encryption keys even if the contents are identical. So in this way, Lambda
has learned lesson 2. We improve cache performance and economics to help overall performance of the storage subsystem,
all while minimizing the shared resources on
which we take a dependency during the process of creating and then accessing a container image. To recap, we've covered invoke, input delivery, and code payload delivery. The third aspect of state
management we're gonna talk about is the place where code and
input meet: the virtual machine. A quick recap of Lambda's history with virtualization architectures. Lambda operates a large
fleet of EC2 hardware dedicated to running the
millions of active functions in a region. At launch of the service, Lambda provisioned a T2
instance on that hardware and dedicated a T2 instance to each tenant to run one or more of that
tenant's execution environments. This leveraged EC2
virtualization technology to isolate one tenant from another. Simple and secure. When a request was handled
by the routing layer, it would check for the existence of an existing execution
environment for that function. And if one did not exist, it would provision a new T2 instance and attribute it to that customer with the information necessary to download the code and
execute the function. At scale, this meant
many occupied instances in the fleet and at any
given moment in time, many instances in the process
of coming into the fleet for a function or being
taken out of that fleet to recycle to a different function. This brings us back to state management, but from a different angle. Well, with code management, our goal was to scale up the amount of state managed to manage larger code payloads. With our VM fleet, our goal was to scale down
the amount of state managed. We wanted to provision as
little unneeded compute, memory, and disc resources for each VM. Reducing overheads is good for efficiency, both the efficiency of compute at rest and the efficiency with
which we scale compute into the fleet and bring
compute out of the fleet. This need to scale down our
virtualization technology led to the introduction of Firecracker into the Lambda data plane in 2018. Firecracker is a virtual machine manager that Lambda runs on bare
metal worker instances. It uses Linux's KVM to
manage hundreds or thousands of microVMs on each worker. This allows Lambda to benefit from multi-tenant worker instances all while using secure VM isolation between execution
environments and customers. Firecracker handles IO between
the microVM guest kernel and the host kernel. Within each microVM is
the execution environment that the Lambda customer's used to. Including the runtime,
including function code and including any configured extensions. Firecracker allowed
Lambda to both scale down and right size while maintaining this secure VM boundary between environments and between tenants. Instead of allocating an entire
T2 instance to each tenant, we were able to allocate
a much smaller microVM, potentially from an existing instance, to each execution environment. Less overhead per VM made
our VM fleet more efficient and also made the operations
of standing up new compute and tearing down old
compute more efficient. This move to Firecracker as
a virtualization technology allowed us to leverage
lesson number three, meeting our caller where they are. So now in our high level architecture, we have our chunks, encrypted image and we have the microVM that intends to make use of that image. But somehow, we need to expose
a usable container image into the OR file system into the microVM. And in this regard, our job was a little bit
harder with container support than it was for zip packaging. Let's focus in on a single microVM on a single host. With zip packaging, since Lambda owns and manages
the execution environment or the guest environment, we are able to employ a small
amount of code in the VM that knows what to do
with the code payload to start the execution environment. But with container packaging, the whole concept is that
the entire guest environment is customer provided, leaving nowhere in the guest to implement image handling functionality such as chunking, decryption and so on. With our EC2 virtualization stack, this would've been the end of the road, as running in the guest
was our only option. We were, in fact, the guest of EC2. But running Firecracker on bare metal, we had space to securely run service code outside of the customer
VM but on the host. So we built a virtual file system driver that presents an EXT4 file system from the worker into the VM. When the VM issues requests
against the file system, they're handled by Lambda
code outside of the VM. Our file system implementation
interprets the manifest to deterministically map inode
reads from the file system to chunk accesses in the image. It interacts with KMS to
securely decrypt the manifest, caches chunk metadata, and serves file system rights from an overlay that is
local to the file system. In serving reads, the file system consults
various tiers of storage, from a host local cache,
to a larger NAZ cache, to an authoritative chunk store in S3. And all of this chunking, decryption, and cache sharing work is abstracted from the customer behind a
simple file system interface. In other words, we meet the caller where they are, abstracting away the storage complexity behind an interface that the customer is used to and is expecting. This mostly completes our storage journey, from management of invoke input to management of code and container images to the presentation of code into the VM. However, we realized an
additional opportunity to leverage all of this
state management work and all of these lessons learned to meet the evolving
needs of the customer. Let's talk about the elephant in the room when it comes to Lambda
outlier latencies: cold starts. Over 99% of invokes to Lambda are served by an execution environment that is already alive
when the invoke occurs. These are warm starts. but occasionally due to idleness or due to scaling up of incoming calls, An invoke must spin up a
new execution environment before running function code. This is a cold start and it impacts a function's
outlier latencies. Spinning up a new
environment involves steps like launching the VM,
downloading code, unpacking code, but also critically involves the steps of starting the runtime and
initializing function code. These steps of starting the runtime and running function initialization can dominate cold start time, especially for languages like Java with a language virtual
machine that must start up. And this time and cost must be paid on every single execution
environment that is brought into the fleet in service
for your function. So while cold starts can be rare, they can also be very impactful to the end customer experience. At Lambda, we track our control plane and invoke latencies at the P 99 level, P 99.9 level, and above. Outliers are rare but have outsized impact on a customer experience. So outliers matter to service owners. Thinking about the initialization
process in the abstract, what this process effectively does is it takes a piece of code sitting in the virtual machine and it turns that into a running VM that is then
ready to serve and invoke. So what if we could
use the storage lessons learned by Lambda to avoid
this conversion process of converting code into a running VM in the first place? To do so, what if instead of
delivering code to the VM, we were to just deliver
a different artifact, an actual running VM to the host. This is essentially what we're doing with Lambda SnapStart. SnapStart automates the management of execution environment snapshots to significantly improve
the outlier latency of Lambda functions. It is initially available for
the Corretto Java 11 runtime, and it's powered by the open source work in Firecracker supporting microVM, snapshot, and restore. With SnapStart, the life cycle of an execution environment is different. When you update your function
code or configuration and publish a new function version, Lambda will asynchronously
create an execution environment, download code, unpack code, and initialize the execution environment by running customer function code up to the point where it is
ready to serve an invoke. And critically, no
invoke has occurred yet. Lambda will then take a VM
snapshot, chunk it, encrypt it, and persist the resulting
manifest in chunk contents. Then on the invoke path, if a new execution environment is needed, Lambda will restore a new microVM from that persisted snapshot. After a short restore phase, the execution environment will
be ready to serve traffic. The net result of the application is that long tail latencies
are drastically reduced, in many cases by up to 90%. You might notice one
application in this test suite that actually did not
initially get much benefit from enabling the feature. But interestingly, upon making
a few minor code changes, this application achieved
the largest speed up of the applications sampled. Most functions will see
significant outlier latency benefit with only the configuration change. But occasionally a function may need simple code changes to benefit. As with all performance features in AWS, try it out and let us know how it goes. And this improvement to
the cold start experience was made possible by
turning a compute problem, converting code into a running VM, into a storage problem, of delivering a VM snapshot to a host. So in summary, Lambda has employed a few lessons learned by storage services to improve the performance, efficiency, and overall experience
of the Lambda service. Lesson one is that we use
customer access patterns to influence how data is laid
out in our storage subsystem. Lesson two is that shared state
is important for utilization and performance and that the
best kind of shared state comes without actually sharing resources. And lesson three is that storage services spend a ton of time meeting their caller where they are to hide
the complexities inherent in a storage service from a customer. I'm now gonna invite
Julian back to the stage for the last segment of this talk. Thank you. (audience applauds) - Thanks Chris. What a great story. Anybody happy that we
just happen to manage and solve Java cold starts with some pretty cool
technology behind the scenes? Anybody?
(audience applauds) Excellent, good to hear. Well, Chris has been
talking a lot about state and how he views lessons from storage to solve moving some
pretty big stateful data to support container images and SnapStart. But we also have state to
deal with elsewhere in Lambda. Yes, it may be a little
bit behind the scenes, but it's important to getting
the right input payload to the execution environment. So remember I said earlier, we had the Worker Manager service, which was the coordinator between the frontend invoke service and the worker. Well, it had a super important job to help the front end get invokes to the execution environment and then manage the execution lifecycle. Well, we had an issue, which was one of the
first parts of Lambda, and to be honest,
getting a little bloated. Each way, Worker Manager stored a list of execution
environments it was responsible for on which hosts. Looking at it slightly
the other way around, the state of any individual
execution environment was known to exactly one
Worker Manager instance, which stored it in memory. Nice and fast, but stateful and not replicated, no redundancy. We had a problem with
state with Worker Manager. So in this example, a
Worker Manager in purple, if you can see the different color, manages a whole bunch of
purple execution environments on a number of worker hosts. And we did have a control plane service to then manage the Worker Managers. If the Worker Manager fails, all those execution environments it was looking after are orphaned. Yep, they continue to handle
their existing invokes, but for any new invokes, the front end has to ask
other Worker Managers for an available execution environment. Other Worker Managers don't know about the existing warm execution environments because they're orphaned
and the placement service then has to create new ones. And that's a bad customer experience, of a cold start when there is actually a warm execution environment available, but it's orphaned. This also means we need to
have more spare capacity in the Lambda service to run these additional
execution environments until we can reap the orphaned ones, which we do. But this impacts how
efficiently we can run Lambda. And the issue gets even
worse when we think about how Lambda distributes
traffic across multiple AZs, particularly for smaller regions. Here we have an execution
environment in AZ1 that's actually owned by
a Worker Manager in AZ2. If we have a zonal failure in AZ2, which, of course, is extremely rare, but we do need to plan for
all Worker Managers fail in AZ2 obviously along with all execution environments
on the worker hosts. So that's one third of
execution environment capacity in a region, gone, unavailable. Yet as there are execution
environments in AZ 1 and 3 that are registered to
Worker Managers in AZ2, all those execution environments, or those still up and running, are not impacted by the zonal
issue also become orphaned. So that works out to
one third unavailability within each of the other two AZs for a combined total of about 55% of execution environment
capacity unavailable. Now of course we hold
spare capacity in Lambda to handle this sort of failure, but that means a large capacity buffer and a poor customer experience as each execution environment that needs recreating means a cold start. And this also means the
placement service needs to be scaled enough to handle this huge increase in additional requests. And so we decided that Worker Manager needed a refresh and built
the assignment service. Instead of a single worker manager and a single AZ managing a
set of execution environments across multiple AZs. We built the assignment service, which is made up of three node
partitions split across AZs. Looking at this logically, a single partition consists of
one leader and two followers, each in different AZs. And we run many partitions
depending on the load. Each assignment service host
hosts multiple partitions. The assignment service partition members use an external journal log service to replicate execution environment data. The leader writes to the journal and then the followers
read from the log stream from the journal to keep up to date with the assignments. And then the partition
members can also use the log journal approach
to elect the leader. The frontend talks to the leader and the leader communicates
with the placement service to create new execution environments and keep track of assignments on the worker hosts and then
writes the info to the log, which the followers read. And then if the leader fails, a follower can take over really quickly and we don't lose states of which execution
environments are available to service subsequent invokes. This means, in a zonal outage, execution environments in the
working AZs are not orphaned, which means fewer cold
starts, less idle capacity, and less load on the placement service. It also means the assignment service has much better static stability. The system maintains
stability and state itself. We don't need an external service
to fail over functionality to keep the system running
and servicing requests. Good static stability is
something we are always working towards in AWS and Lambda. A good way to maintain state. When we do then bring up
a replacement follower or maybe when we need to add and remove assignment service nodes for maintaining the system, we can actually bootstrap the
state of all assignments owned by the partition by just reading
from the journal log stream from the time of the oldest
execution environment to quickly get up to date. We still do have an assignment
control pane service and this is gonna manage the
creation of the partitions and also ensure the frontend
nodes which partition to talk to you for a
particular function arm. So the assignment service
is fully resilient against host, network,
and even AZ failures using a partition leader approach. And we also did manage
to slim down the service by moving some responsibilities elsewhere. It's also written in Rust for performance, tail
latency and memory safety. And altogether triples the number of transactions per second
we can run on a host with a meaningful reduction in latency. So all the efficiencies we're
able to drive in the service means that we can get better utilization and this has a direct impact
on the cost of running Lambda. How efficiently can we run a workload given a specific set of resources? And we do a ton of work
in this area as well. Due to the small footprint of functions and our ability to distribute the workload to fit the curve of our resources, we can be the most
efficient way to run code. With Lambda, you only pay when your functions are doing
useful work, not the idle. So it's our job to minimize the idle. And inside Lambda, we
optimize to keep servers busy and reuse as much as possible. And we are continually
optimizing the work utilization to be more efficient running Lambda and also improve your
function performance. So we have systems that help us analyze the resources needed over time to optimally distribute workloads and provision the
capacity to fit the curve. Now for a given function in
an execution environment, well, you may think that
distributing the load evenly is the best way, but it means you miss out
on some inefficiencies. Things like cache locality, which you've heard about, is super important and the
ability to sort a scale. So it's actually better to have some concentration
of load within reason. The worst for efficiency is a
single workload on a server. It has a specific pattern and is inefficient with resource usage. It's better to pack
many workloads together, so the workloads are not as correlated. But we actually take this a step further. We use models and machine learning to pack workloads optimally together to minimize contention and maximize usage while securely caching common data across different functions, which also improves aggregate
performance for everyone. And we actually have an entire team that works just on this placement problem with a distinguished professor and a team of research scientists. And this is all part of the
story of how we build Lambda to be the best place to
run workloads in the cloud, handling as much of the hard
distributed computing problems, especially with state. So you can have the fastest way
to build modern applications with a total of the lowest
total cost of ownership. Now to increase your
AWS Serverless learning, you can use the QR code
to find more information to learn at your own pace,
increase your knowledges, and as of yesterday you can
even earn a serverless badge, if that's your thing. For plenty more general
serverless information, head over to serverlessland.com. This has got tons of resources
and sort of everything to do about serverless on AWS. And lastly, thanks so much for joining us. Chris and I really appreciate
you taking the time today to be with us and we
really hope we were able to look a bit under the hood of Lambda and help you know how it works and some of the challenges that we
are hoping to solve. And then lastly, if you do like deep,
400-level technical content, a bit of a bribe, but a five star rating, the session survey certainly lets us know that you'd like more and
we'll be happy to provide. Thank you very much and enjoy
the rest of your re:Invent. (audience applauds)