AWS re:Invent 2022 - A closer look at AWS Lambda (SVS404-R)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Well, good morning everybody. Welcome to SVS 404, a closer look at AWS Lambda. Hello everyone. Welcome to the room. If you're here today, super cool to see you. I also know there's an overflow room somewhere in the region of Las Vegas, so you're on the camera. Hello to everybody there. And also this is gonna be recorded, so if you are watching this on YouTube later, I suppose, hello into the future. My name's Julian Wood, I'm a senior developer advocate here at AWS Serverless, and I work on a cool team within the serverless product organization and I really love helping developers and builders understand how best to build serverless applications, as also, as well as also being your voices internally to make sure we're creating the best products and features. And I'm also gonna be joined shortly by the awesome Chris Greenwood. He's one of our principal engineers who works in the Lambda team. Now, this is a talk today that's gonna follow on from two previous Lambda under the hood talks. The first was the re:Invent 2018, where Holly Mesrobian and Marc Brooker first showed how some of Lambda worked, where they covered the synchronous invoke path and talked about the Firecracker MicroVM technology. Then in 2019 they covered how async invoke works as well as how you could then use Lambda to pull just from kinesis and also talked about scaling up with provision concurrency. Now today I'm gonna start with a bit of a recap of how the different Lambda invocation models work, and then both Chris and I are gonna dive deep into some of the interesting ways we've evolved the service since then and solve some interesting challenges to bring new functionality to Lambda. And you can use the QR codes which use s12d.com. It's a link shortener for serverless land to watch the previous journey talks. Now as a 400 level talk, we're not gonna go into the basics of what Lambda is, but it's worth highlighting how Lambda is the fastest way to build modern applications with the lowest total cost of ownership. And we strive to help teams build more and faster by doing everything that makes cloud development hard. We make your distributed system challenges Lambda's problem to solve. Teams can focus on just their code, which avoids all the maintenance costs and drudgery of making cloud software work. And as a cool bonus, costs align with usage, which avoids a lot of waste from idle capacity. Lambda really is the compression algorithm for experience. You can turn code into a well-architected service with negligible operational efforts. Now this is why serverless adoption is growing so fast. A serverless strategy enables you to focus on things that benefit your customers rather than infrastructure management. And we launched Lambda eight years ago at re:Invent, and today more than a million customers have built applications that already drive more than a trillion, more than 10 trillion Lambda invocations per month. And Lambda still continues to grow at a phenomenal rate. We see customers building a wide variety of applications with Lambda, but there are a few common areas where they tend to see huge benefits. Many people start out maybe with IT automation, using Lambda functions to validate some config each time they launch an EC2 instance. Then teams may then start using Lambda integrations with S3 or Kinesis Streams or Kafka to build data processing pipelines. And then multiple teams may then start building whole microservices-based applications using Lambda often for web applications and use managed Lambda with some other managed services as a critical part of their event-driven applications. And of course, we've got many customers also using Lambda in machine learning applications. And we do do a ton of innovation in all areas of the Lambda service to make the complex simple. Give you more, while crucially keeping it simple to use so you can take advantage of all that Lambda has to offer to solve many challenges at any scale. And we think Lambda's been the pioneer in helping developers build serverless applications from GA in 2015 through support for 15 minute functions in 2018. Things like provision, concurrency and EventBridge as a source in 2019, container images and 10 gig functions in 2020. And in 2021 we brought Lambda extensions and arm-based graviton two. This year, it keeps on going. We've got bigger 10 gig/temp space and function URLs, all while adding more languages, shrinking latency and proving integration with other services. For example, Coca-Cola. That's a great customer who had to react quickly to the changes brought on by the COVID pandemic. And they wanted to provide a touchless experience for their Freestyle drink dispensers. And so built a new smartphone app that allows customers to order and pay for drinks without coming into contact with their vending machines. Now because they're built with Lambda, their team was able to focus on the application rather than spending huge amount of extra time on security, latency, or scalability. Since with Lambda, that's all built in. And as a result, they built this new application in just a hundred days and now over 30,000 machines have that touchless capability. Now one of the important benefits of Lambda is that it responds to data in near-real time. And as customers create more data streams, a serverless approach to processing their data is really super appealing. So an example is like Nielsen Marketing Cloud is doing this at an incredibly high scale. They are using Lambda to process 250 billion events per day. Yes, that's with a B, all while maintaining quality, performance and cost using a fully serverless pipeline. And the scale of the system is really why they chose to use Lambda. On a peak day, Nielsen received 55 terabytes of data with 30 million Lambda invocations and the system managed it with no problem, and in fact, they only had up to 3000 concurrent invokes. On normal days, they see between one and five terabytes of data and they were able to scale up and meet this need without additional work. I'm really sure the on-call engineers could breathe a sigh of relief. While we're working on Lambda, we focus on things like availability, utilization, scale, security and performance. And today we're gonna give you a taste of some of the engineering efforts in these areas, but we also do have a priority list. The top priorities for Lambda are security, durability, availability and features in that specific orders. Now you may hear that security is always job zero at AWS because that comes before everything else. But for Lambda, after then we then focus on operational excellence for durability and availability, like running your functions in multiple availability zones to ensure that it's available to process events in case of a service interruption in a single zone and even in the unlikely event of an AZ outage. And we ensure that these basics are covered before we even work on new features. With security being job zero, we have the shared responsibility model, which has a separating line which defines how AWS is responsible of security in the cloud. This is the infrastructure and software that runs all our services and you are responsible for security in the cloud. Sorry I said AWS security of the cloud. And then you're responsible for security in the cloud to ensure your applications and data are secure. But when building serverless applications, as with Lambda, AWS takes on even more of the security in the cloud. So you can worry less about many things like operating system patching, things like firewall configuration and maybe distributing certificates. You get to focus even more on your application and data and let us manage the underlying platform. And a really clear example of this was on December the 9th, 2021, when a seriously bad remote code vulnerability in Apache's Log4j was announced to the world. Many customers were scrambling to update Java across their environments, yikes, last December, ruining it for so many engineering teams, but, and probably for people in this room as well, but, if you were using Lambda, we handled most of it for you. Now, Lambda doesn't actually include log4j in its managed runtimes or base container images. So you were, in effect, already protected, but if it had been included, you would be patched automatically. Separately, we did also work with a Corretto team, the Java Corretto team, to build a mitigation. If you did have log4j as part of your Lambda function code, we could block the functionality and log that it had happened. Lambda protected you without you needing to do anything. And one of the major technological foundations of Lambda is the open source Firecracker. It's a cool name, I really like it. And Firecracker provides resource utilization to run functions built with enhanced security and workload isolation over traditional VMs with super fast startup times, plenty more coming about Firecracker and how we're innovating in that space. Now Lambda has two types of invoked models, synchronous where the caller calls Lambda. And this can be either directly via the CLI or SDK or with the newish function URLs or through a service like API Gateway, which is then mapped to a Lambda function. The sends the request to the Lambda function, does its processing, waits for a response before returning it to the client. Now when evoked in function asynchronously, again via the client call or maybe an S3 change identification or service like EventBridge sending an event when a rule is matched, you don't wait for response from the function code. You hand off to the events to Lambda and Lambda just handles the rest, it places the event on an internal queue and returns a success response to the core, saying, "Yep, I got your event." A separate process then reads events from the internal queue and sends them to your function. There's no actual direct return pass to the calling service. Now in order to do this, we have a number of control plane and data plane components, which we're going to talk about. Now just to tell you, despite some of the rumors I hear, this talk isn't going to be about how we've entirely rebuilt Lambda to run on Kubernetes, nothing against Kubernetes, but just in case you wondered, we do not in fact run Lambda on Kubernetes, but we do have a control plane service, which is where you interact with the Lambda service, and this is where you create, configure, and update your Lambda function and upload your code to use the various features that we expose. We manage the Lambda service resources too, to make sure they're available to handle your invokes. There are also a number of developer tools to interact with Lambda, which also cross over into some of the data plane tools. And this is things like from the console to the AWS CLI, SDKs, AWS toolkits for various IDEs, and infrastructure as code tools and then the data plane responds to get events to the land of service. And all evokes land up being synchronously invoked with a number of internal services. Now these are the data plane services and the eagle-eyed Lambda experts among you may also spot a service we haven't talked about before. And this is called the assignment service and we've been evolving the previous Worker Manager service and I'll go into more depth later on how we solved some Worker Manager problems and built the assignment service for improved availability. Now also the async event, invoke data plane with the Lambda pollers, handle the very powerful asynchronous invoke model to process events and ultimately hand them to the synchronous invoke data plane. So let's have a look at how asynchronous invoke works. Requests to invoke a certain function arrive via an application load balancer, which distributes the invoke requests to a fleet of hosts, which then run the stateless front end invoke service. Now Lambda is a multi-AZ service, so you don't have to worry about load balancing your functions across multiple AZs, Lambda just handles that for you. Now I am simplifying the diagram a little bit, it does get a little crazy, but Lambda is built so all its components run transparently across multiple AZs. So, the front end service first performs authentication and authorization of the request, and this ensures that Lambda is super secure as only authorized callers even get through Lambda's front door. The service loads the metadata associated with the request and caches as much information as it can to give you the best performance. The frontend then calls the counting service. And this checks whether any quota limits may need to be enforced based on things like account, burst, or reserved concurrency. And the counting service is optimized for high throughputs and sub one and a half millisecond latency because it's called on each invoke. And therefore, it's critical to the invoke path and because so, it's made highly available, as with a lot of Lambda, across multiple AZs. The frontend then talks to the assignment service, which has replaced the Worker Manager service and it's a stateful service and there's gonna be a lot of talk about state in our talk today. So this is responsible, like the Worker Manager service was, for routing invoke requests to available worker hosts. If you wanna think about it, the worker is the server in serverless. It's responsible for creating a secure execution environment within a microVM to download, mount, and run your function code. The worker is also responsible for managing the multiple run times and this is, you know, for languages like Node, Java, and Python and any other kind of languages you bring. It's also involved in setting up the limits, such as memory, based on your function configuration and then the associated proportional virtual CPU. And, of course, needs to manage the Lambda agents on the host that monitor and operate the service. Now here is also another component which does some coordination and manages background tasks. And this is the control plane service, which handles function creation and manages the lifecycle of assignment service nodes and then also ensures the frontend service is up to date with which assignment node to send an invoke request. So back to the assignment service, this is the coordinator, information retriever, and distributor about what state execution environments are on the worker host and where invokes should ultimately go. It also keeps the front end invoke service up to date with which assignment service node to talk to. For the first invoke of a function, a new execution environment needs to be started on a worker host. So the assignment service talks to the placement service to create an execution environment on a worker with a time-based lease. Now, placement uses ML models to work out where best to place an execution environment and where they do this by maximizing packing density for better fleet utilization while still maintaining good function performance and minimal cold path latency. It also monitors worker health and makes the call when to maybe mark a worker as unhealthy. And we need to make sure this runs as fast as possible to reduce latency for any cold start invokes. So once the execution environment is up and running, the assignment service gets your supplied IM role for that function that the function's gonna run with the privileges that you define. It's then gonna distribute that role to the worker along with the environment variables. The execution environment then starts the language runtime and downloads the function code or runs the container image and the function initialization process runs. The assignment service then tells the frontend service, "I'm all ready," which then sends the invoke payload to that execution environment. The function then runs the handler code and sends its results back through the frontend service to the original caller for the synchronous invoke. The worker also then notifies the assignment service saying "I'm finished with the invoke," and so it's then ready to process a subsequent warm invoke. Then when a subsequent invoke comes in, frontend checks with account, checks the quotas with accounting service again, and then talks to the assignment service which says, "Yes, there is actually a warm and idle execution environment ready." and sends the invoke directly, payload directly to the execution environment on the worker and your function can run the handler again. When it's finishing with the invoke, The worker then tells the assignment service, yes, I'm now ready for another subsequent invoke. Then when the lease of the execution environment is nearing its endtime, this assignment service marks it as not available for any future invokes. And then we have a sort of further process that spins down that execution environment. Also, we need to handle errors, so if there are any errors within the in it phase, in the execution environment, assignment's able to then mark it as not available and then it's removed from service. Also, if we need to remove a worker from the service, maybe when it needs to be refreshed, which we do regularly or it has any errors, we can actually gracefully drain it of execution environments by stopping future invokes. Current invokes will carry on, but we just stop any future ones. So let's switch to talking about asynchronous invokes, or also what is called event invokes. And you can see here we have the event invoke frontend service. Now you may remember that for the sync invoke, we had a frontend invoke service, slightly similar names, but they're definitely different. Well, the event invoke frontend service sits behind the same frontend load balances as before, but the load balances see that it's an event invoke and so send it to the event invoke frontend service rather than the frontend invoke service. And we actually deliberately separated these services, building a new data plane to ensure that the async invoke path was separate to the sync invoke path to protect against a large amount of invent invokes potentially causing latency with the sync invoke path. Just another way that we can improve availability and performance on your behalf. Now, I'm simplifying the diagram a bit compared to the sync invoke, but again, it's all spread across multiple availability zones and the service again is then gonna do authorization and authentication of the request based on who the caller is. So, bit of a simplified diagram, but please remember it's multi AZ, so we've done the authorization and the authentication, and then the frontend is gonna send the invoke request to an internal SQS Queue and respond with an acknowledgement to the caller that Lambda is gonna invoke the function asynchronously. Now Lambda's gonna manage these SQS queues and you don't really have or need any visibility of them. Now, we also run a number of queues and we can dynamically scale them up and we can scale them down depending on the load and the function concurrency. Now, some of these queues are shared, but we also send some events to a dedicated queues to ensure Lambda can handle a huge number of async invokes with as little latency as possible. We then have a fleet of polling instances, which we manage, and an event invoke poller then reads the message from the internal SQS queue to determine the account function and what the payload is gonna be and then it sends it to the invoke request, then synchronously to the sync front end service. And as you can see, all Lambda invokes ultimately land up as sync invokes. The function's then gonna use the exact same sync pathway I talked about before and it's gonna run the invoke. The sync invoke is then gonna return the function response and this is gonna be to the event invoke poller, which is then gonna delete the message from the queue as it's been successfully processed. If the response is not successful, the poller then's gonna return the message to the queue. And this uses exactly the same visibility timeout settings as you would with your own SQS queues. Then the same or maybe another poller is gonna then be able to pick up the message and it's gonna try again. You can also configure event destinations for async invokes to provide callbacks after processing, and this is whether the invokes are successful or maybe they've failed after all the retries are exhausted. There are some additional control pane services involved. The queue manager looks after the queues, it's gonna monitor them for any backups, and then also does the creation and deletion of the queues. It's gonna then work with the leasing service, and this manages which pollers are processing which queues, and it's also gonna detect maybe if a poller fails or isn't doing its job properly that its work can then be passed to another poller. Now, when we built the Lambda async model, we realized, well, this idea can be used for a number of other service integrations. So an event source mapping is a Lambda poller resource that reads from a source. Now initially this was only for Kinesis or DynamoDB, but we've expanded this to poll from your own SQS queues, Kafka sources, including your own self-hosted Kafka. Yep, Lambda can even pull from a non-AWS service, as well as Amazon MQ4, RabbitMQ and Apache MQ. The pollers pull messages from these sources, can optionally filter them, batch them, and then send them to your Lambda function for processing. Let's talk a little bit more in detail on how this works. A produce application puts messages or records onto the stream or queue asynchronously. We then run a number of slightly different poller fleets, as the pollers have different clients depending on the event source. The pollers are then gonna read the messages from the stream or queue and can optionally filter them. And this is a super useful functionality just built right into the system. It helps reduce traffic to your functions, simplifies your code, and reduces overall cost. It is then gonna batch the records together into a single payload and it's gonna send them to your function synchronously via the same sync frontend service again or Lambda invokes ultimately land up being synchronous. And then, as with SQS, for queues, your own SQS queues, the poller can then delete the message from the queue when your function successfully processes them. And then for Kinesis and Dynamo, you can actually send information about the invoke from the poller to SQS or SNS. Now the cool thing to remember is that we manage these pollers for you as part of the Lambda service, and that's a huge benefit. You don't have to run and scale your own consumer EC2 instances or a fleet of containers, maybe, to pull for the messages, and it's actually free. You don't pay extra above your function invokes for the service. It's just more of what Lambda can do for you. There are also some other control plane services involved here, the state manager or the stream tracker, and that's obviously depending on the event source, is gonna manage the pollers and the event sources. It's gonna discover what work there is to do and then is gonna handle scaling the poller fleets up and down. The leasing service then again assigns pollers to work on a specific event or a streaming source, and if there's a problem with a poller, it's gonna move its work around to another poller. And having these poller fleets allows us to support a huge number of event source features. Now there's a lot on the slide, but basically means that we can do things like filtering and configuring batch sizes and windows and reporting on partial batch failures, a sort of whole raft of settings. And this is depending on the event source. And you configure these actually on the event source mapping itself or yes, sometimes on the original event source, but it really makes it easier to process data at scale with Lambda. So that covers the sync, async, and poller functionality and how that works with Lambda. Now let me hand you over to one of our top Lambda gurus, Chris Greenwood. (audience applauds) - Thank you Julian. My name is Chris Greenwood, and I'm a principal engineer with AWS Lambda. My goal here today is to demonstrate that in some ways, Lambda is a storage service. My first position with AWS was working on elastic block store or EBS. EBS is a large distributed storage service that backs the majority of EC2-attached block devices in the world today. The service is faced with many of the interesting storage challenges that you may have heard about from talks from the storage track. A few years ago, I joined Lambda expecting to find a much different problem space, more serverless, more ephemeral, more stateless, but in the months that followed, I found state management challenges that were very similar to those I faced in EBS. What I came to realize is that Lambda is a stateless serverless compute product to you, the function developer, but on the back end involves many of the state management challenges that are present in a storage service. To some degree, Lambda is a storage service that you don't have to think about. So for the next segment of this talk, I'd like to look at Lambda as a storage service. And we're gonna talk about three lessons that Lambda has learned from our peers on the storage service side that have helped us improve the Lambda service. The first is that access patterns to your data should drive the storage layout of that data in your data plan. The second is that shared state is important for performance and for utilization of a storage service, and that the best kind of shared state comes without sharing state, and I'll explain what that means in a little bit. And the third is that a well-architected storage service often has an equally well-architected presentation layer into the caller. In other words, you meet the caller where you are, where the caller is. We'll start with lesson one. So to frame the topic, what state needs to be managed by Lambda in order to serve an invoke? Fundamentally, an invoke involves three things. First, it involves invoke input. Invoke input is the JSON payload provided by the caller or provided by the event source on every invoke. Second, it involves code. Code is provided by the function developer when creating or updating the function. And third, it requires a place in which to marry, invoke input, and code the virtual machine. That's two pieces of state and one piece of state and compute in the virtual machine that together make up the necessary bits of information to serve an invoke. Julian gave us an overview of invoke frontend and the poller fleet, which together serve to get invoke input to the correct machine at the correct time. Now we're gonna talk about code. When Lambda launched to deliver code to a machine in order to spin up an execution environment, Lambda did the simple thing. We downloaded that code from S3. Each environment started up with Lambda code that would discover which code payload governed the function, would download that code payload from S3, unpack it into the environment, start the runtime. Starting that runtime would essentially turn that code into a running VM and make that running VM ready to take invoke input and execute. This worked just fine and is in fact the way that the majority of Lambda functions in the world work today. But as our requirements changed, so did our architecture. In 2020, we set out to build container packaging support for the Lambda product. This changed state management in a big way in that code payloads were now much larger. While zip code payloads are limited to 250 megabytes in size, we realized that container images are often much larger, and we decided to support up to 10 gigabytes in size for container packaging in Lambda. However, the code delivery architecture for Lambda made it such that the time taken to download a piece of code and unpack it into the environment scaled linearly with code payload size, and we felt this was unacceptable for the new code payload requirements of the Lambda container packaging product. This meant that we had to rethink the mechanism by which we deliver a piece of code into the microVM, or, sorry, into the execution environment. We, along with the community, realized something. In many containerized workloads, a container accesses a subset of the container image when actually serving requests. This chart from a Harter paper published in the FAST conference in 2016 shows the disparity between total container image size, either compressed or uncompressed, and the total amount of that container accessed on the path of a request. While the mechanism Lambda uses ended up being different than that of the Harter paper, we realized that if we could download and present to the execution environment only the bits of a container image that were necessary to serve a request, then we could get the execution environment started more quickly and we could amortize code delivery time and cost over the lifetime of the execution environment. So, how to go from all at once loading of a container image to incremental loading of a container image? The first thing we had to do was to change the way the container images were persisted in the Lambda storage subsystem. What we're looking at is a simple dockerfile that, what a simple dockerfile may look like for a container image holding a nodejs function. When you build a container image from this docker file, Docker is going to produce a set of tar files called layers. Lambda takes those layers and flattens them into a file system. This is the file system that is present when your execution environment starts up. Once we have the flattened file system, we break its binary representation on disc into chunks on the block device. A chunk is a small piece of data that represents one logical range of the block device, and multiple chunks appended together make up the entire block device and the entire file system. Chunking is powerful as it lets us store contents at a sub image granularity and then also access those contents at a sub image granularity. Let's consider the example here of a file system exposed into an execution environment. When the execution environment starts, it knows nothing of the contents of the file system. First access of the file system, say, to list the contents of root, results in an inode read out of the file system. Lambda software maps that inode to a specific chunk in the container image and fetches that chunk to serve the read. A subsequent read to a different inode, say, to open and start the Java binary, may fall into the same chunk, which means that Lambda needs to load no additional data to serve that inode read. Future reads such as opening and loading the handler class for your function may fall into different chunks and chunks are loaded as they're needed. This means that all up, Lambda delivers to the execution environment, the contents that are needed to serve file system requests made by the VM and does not deliver contents that are not needed. In this way, Lambda has learned lesson one. We let access patterns to our data, specifically the fact that container images are sparsely accessed, to influence the storage layout of our container storage subsystem, specifically the fact that we chunk and incrementally persist and access blocks of container images. Moving to the second lesson, sharing state without sharing state. The next thing we realized about container images is that they often share contents with similar images. This is for the simple reason that when people build with containers, they often don't start from scratch. They use base layers. A customer of Lambda may start with the base OS layer, lay on top of that the Lambda-vended java runtime, and finally, lay on top of that their function code. The Lambda data plan understanding which data is shared and which data is unique is helpful in optimizing code delivery. But it turns out the deduplication is difficult in this context for a few reasons. The first is related to the fact that our storage layer is block-based instead of file-based. To allow two copies of the same file system to share contents at the block level, that file system must deterministically flatten layers onto a file system and deterministically serialize that file system onto a block device. But many file systems don't do this. In some file systems, if you take the same set of layers multiple times, flatten them onto a file system, and then serialize those file systems to a block device, you get a different binary representation in that block device for the same logical file system state. To solve this, we did some work in the EXT4 file system to ensure deterministic behavior when flattening and serializing to disc. The second reason deduplication is hard is related to key management. We need different keys encrypting the contents of different customers in different regions, but different keys mean different encrypted payloads even if the contents are similar, which prevents deduplication at the storage layer and at the caching layer. So ultimately to benefit from deduplication, we need to use the same keys to encrypt data when appropriate. A simple way to do that would be to depend on a shared key in a shared key store, but that results in a single point of scaling across multiple customers and across multiple resources. What we'd like is for the keys to logically be the same, but for the discovery of and persistence of those shared keys to be pushed out closer to where the data itself is stored. To solve this problem, we turn to a technique called convergent encryption. When encrypting a chunked image, we start with a plain text chunk, which represents one segment of the flattened file system. We take that chunk and we append some extra data that is deterministically generated by the Lambda service. Then we compute the hash of that chunk and extra data. This hash becomes the unique per-chunk key for this chunk. That key is then used to encrypt both the chunk and the extra data with an authenticated encryption scheme. Including the extra data in the encrypted payload allows us to verify that the payload remains unchanged at rest and in flight on the decryption path. We do this for each chunk in the file system, writing encrypted chunks to our chunk store, and keeping track of the keys for each. When complete, we create a manifest containing both keys and pointers to chunks in the chunk store. We finally encrypt that manifest with a customer specific KMS key and persist the manifest. There are a few great things about this encryption scheme. First, assuming we use the same extra data, we produce the same key from the same chunk contents every time. This allows us to securely deduplicate the contents that should be deduplicated, and critically, this allows us to do so while having each image creation process proceed independently without depending on a shared key or a shared key store. Second, without changing the encryption scheme and without changing the chunk, by simply changing the extra data, we can force two chunks to use different encryption keys even if their contents are the same. This is helpful in ensuring, for instance, that contents in different AWS regions don't share encryption keys even if the contents are identical. So in this way, Lambda has learned lesson 2. We improve cache performance and economics to help overall performance of the storage subsystem, all while minimizing the shared resources on which we take a dependency during the process of creating and then accessing a container image. To recap, we've covered invoke, input delivery, and code payload delivery. The third aspect of state management we're gonna talk about is the place where code and input meet: the virtual machine. A quick recap of Lambda's history with virtualization architectures. Lambda operates a large fleet of EC2 hardware dedicated to running the millions of active functions in a region. At launch of the service, Lambda provisioned a T2 instance on that hardware and dedicated a T2 instance to each tenant to run one or more of that tenant's execution environments. This leveraged EC2 virtualization technology to isolate one tenant from another. Simple and secure. When a request was handled by the routing layer, it would check for the existence of an existing execution environment for that function. And if one did not exist, it would provision a new T2 instance and attribute it to that customer with the information necessary to download the code and execute the function. At scale, this meant many occupied instances in the fleet and at any given moment in time, many instances in the process of coming into the fleet for a function or being taken out of that fleet to recycle to a different function. This brings us back to state management, but from a different angle. Well, with code management, our goal was to scale up the amount of state managed to manage larger code payloads. With our VM fleet, our goal was to scale down the amount of state managed. We wanted to provision as little unneeded compute, memory, and disc resources for each VM. Reducing overheads is good for efficiency, both the efficiency of compute at rest and the efficiency with which we scale compute into the fleet and bring compute out of the fleet. This need to scale down our virtualization technology led to the introduction of Firecracker into the Lambda data plane in 2018. Firecracker is a virtual machine manager that Lambda runs on bare metal worker instances. It uses Linux's KVM to manage hundreds or thousands of microVMs on each worker. This allows Lambda to benefit from multi-tenant worker instances all while using secure VM isolation between execution environments and customers. Firecracker handles IO between the microVM guest kernel and the host kernel. Within each microVM is the execution environment that the Lambda customer's used to. Including the runtime, including function code and including any configured extensions. Firecracker allowed Lambda to both scale down and right size while maintaining this secure VM boundary between environments and between tenants. Instead of allocating an entire T2 instance to each tenant, we were able to allocate a much smaller microVM, potentially from an existing instance, to each execution environment. Less overhead per VM made our VM fleet more efficient and also made the operations of standing up new compute and tearing down old compute more efficient. This move to Firecracker as a virtualization technology allowed us to leverage lesson number three, meeting our caller where they are. So now in our high level architecture, we have our chunks, encrypted image and we have the microVM that intends to make use of that image. But somehow, we need to expose a usable container image into the OR file system into the microVM. And in this regard, our job was a little bit harder with container support than it was for zip packaging. Let's focus in on a single microVM on a single host. With zip packaging, since Lambda owns and manages the execution environment or the guest environment, we are able to employ a small amount of code in the VM that knows what to do with the code payload to start the execution environment. But with container packaging, the whole concept is that the entire guest environment is customer provided, leaving nowhere in the guest to implement image handling functionality such as chunking, decryption and so on. With our EC2 virtualization stack, this would've been the end of the road, as running in the guest was our only option. We were, in fact, the guest of EC2. But running Firecracker on bare metal, we had space to securely run service code outside of the customer VM but on the host. So we built a virtual file system driver that presents an EXT4 file system from the worker into the VM. When the VM issues requests against the file system, they're handled by Lambda code outside of the VM. Our file system implementation interprets the manifest to deterministically map inode reads from the file system to chunk accesses in the image. It interacts with KMS to securely decrypt the manifest, caches chunk metadata, and serves file system rights from an overlay that is local to the file system. In serving reads, the file system consults various tiers of storage, from a host local cache, to a larger NAZ cache, to an authoritative chunk store in S3. And all of this chunking, decryption, and cache sharing work is abstracted from the customer behind a simple file system interface. In other words, we meet the caller where they are, abstracting away the storage complexity behind an interface that the customer is used to and is expecting. This mostly completes our storage journey, from management of invoke input to management of code and container images to the presentation of code into the VM. However, we realized an additional opportunity to leverage all of this state management work and all of these lessons learned to meet the evolving needs of the customer. Let's talk about the elephant in the room when it comes to Lambda outlier latencies: cold starts. Over 99% of invokes to Lambda are served by an execution environment that is already alive when the invoke occurs. These are warm starts. but occasionally due to idleness or due to scaling up of incoming calls, An invoke must spin up a new execution environment before running function code. This is a cold start and it impacts a function's outlier latencies. Spinning up a new environment involves steps like launching the VM, downloading code, unpacking code, but also critically involves the steps of starting the runtime and initializing function code. These steps of starting the runtime and running function initialization can dominate cold start time, especially for languages like Java with a language virtual machine that must start up. And this time and cost must be paid on every single execution environment that is brought into the fleet in service for your function. So while cold starts can be rare, they can also be very impactful to the end customer experience. At Lambda, we track our control plane and invoke latencies at the P 99 level, P 99.9 level, and above. Outliers are rare but have outsized impact on a customer experience. So outliers matter to service owners. Thinking about the initialization process in the abstract, what this process effectively does is it takes a piece of code sitting in the virtual machine and it turns that into a running VM that is then ready to serve and invoke. So what if we could use the storage lessons learned by Lambda to avoid this conversion process of converting code into a running VM in the first place? To do so, what if instead of delivering code to the VM, we were to just deliver a different artifact, an actual running VM to the host. This is essentially what we're doing with Lambda SnapStart. SnapStart automates the management of execution environment snapshots to significantly improve the outlier latency of Lambda functions. It is initially available for the Corretto Java 11 runtime, and it's powered by the open source work in Firecracker supporting microVM, snapshot, and restore. With SnapStart, the life cycle of an execution environment is different. When you update your function code or configuration and publish a new function version, Lambda will asynchronously create an execution environment, download code, unpack code, and initialize the execution environment by running customer function code up to the point where it is ready to serve an invoke. And critically, no invoke has occurred yet. Lambda will then take a VM snapshot, chunk it, encrypt it, and persist the resulting manifest in chunk contents. Then on the invoke path, if a new execution environment is needed, Lambda will restore a new microVM from that persisted snapshot. After a short restore phase, the execution environment will be ready to serve traffic. The net result of the application is that long tail latencies are drastically reduced, in many cases by up to 90%. You might notice one application in this test suite that actually did not initially get much benefit from enabling the feature. But interestingly, upon making a few minor code changes, this application achieved the largest speed up of the applications sampled. Most functions will see significant outlier latency benefit with only the configuration change. But occasionally a function may need simple code changes to benefit. As with all performance features in AWS, try it out and let us know how it goes. And this improvement to the cold start experience was made possible by turning a compute problem, converting code into a running VM, into a storage problem, of delivering a VM snapshot to a host. So in summary, Lambda has employed a few lessons learned by storage services to improve the performance, efficiency, and overall experience of the Lambda service. Lesson one is that we use customer access patterns to influence how data is laid out in our storage subsystem. Lesson two is that shared state is important for utilization and performance and that the best kind of shared state comes without actually sharing resources. And lesson three is that storage services spend a ton of time meeting their caller where they are to hide the complexities inherent in a storage service from a customer. I'm now gonna invite Julian back to the stage for the last segment of this talk. Thank you. (audience applauds) - Thanks Chris. What a great story. Anybody happy that we just happen to manage and solve Java cold starts with some pretty cool technology behind the scenes? Anybody? (audience applauds) Excellent, good to hear. Well, Chris has been talking a lot about state and how he views lessons from storage to solve moving some pretty big stateful data to support container images and SnapStart. But we also have state to deal with elsewhere in Lambda. Yes, it may be a little bit behind the scenes, but it's important to getting the right input payload to the execution environment. So remember I said earlier, we had the Worker Manager service, which was the coordinator between the frontend invoke service and the worker. Well, it had a super important job to help the front end get invokes to the execution environment and then manage the execution lifecycle. Well, we had an issue, which was one of the first parts of Lambda, and to be honest, getting a little bloated. Each way, Worker Manager stored a list of execution environments it was responsible for on which hosts. Looking at it slightly the other way around, the state of any individual execution environment was known to exactly one Worker Manager instance, which stored it in memory. Nice and fast, but stateful and not replicated, no redundancy. We had a problem with state with Worker Manager. So in this example, a Worker Manager in purple, if you can see the different color, manages a whole bunch of purple execution environments on a number of worker hosts. And we did have a control plane service to then manage the Worker Managers. If the Worker Manager fails, all those execution environments it was looking after are orphaned. Yep, they continue to handle their existing invokes, but for any new invokes, the front end has to ask other Worker Managers for an available execution environment. Other Worker Managers don't know about the existing warm execution environments because they're orphaned and the placement service then has to create new ones. And that's a bad customer experience, of a cold start when there is actually a warm execution environment available, but it's orphaned. This also means we need to have more spare capacity in the Lambda service to run these additional execution environments until we can reap the orphaned ones, which we do. But this impacts how efficiently we can run Lambda. And the issue gets even worse when we think about how Lambda distributes traffic across multiple AZs, particularly for smaller regions. Here we have an execution environment in AZ1 that's actually owned by a Worker Manager in AZ2. If we have a zonal failure in AZ2, which, of course, is extremely rare, but we do need to plan for all Worker Managers fail in AZ2 obviously along with all execution environments on the worker hosts. So that's one third of execution environment capacity in a region, gone, unavailable. Yet as there are execution environments in AZ 1 and 3 that are registered to Worker Managers in AZ2, all those execution environments, or those still up and running, are not impacted by the zonal issue also become orphaned. So that works out to one third unavailability within each of the other two AZs for a combined total of about 55% of execution environment capacity unavailable. Now of course we hold spare capacity in Lambda to handle this sort of failure, but that means a large capacity buffer and a poor customer experience as each execution environment that needs recreating means a cold start. And this also means the placement service needs to be scaled enough to handle this huge increase in additional requests. And so we decided that Worker Manager needed a refresh and built the assignment service. Instead of a single worker manager and a single AZ managing a set of execution environments across multiple AZs. We built the assignment service, which is made up of three node partitions split across AZs. Looking at this logically, a single partition consists of one leader and two followers, each in different AZs. And we run many partitions depending on the load. Each assignment service host hosts multiple partitions. The assignment service partition members use an external journal log service to replicate execution environment data. The leader writes to the journal and then the followers read from the log stream from the journal to keep up to date with the assignments. And then the partition members can also use the log journal approach to elect the leader. The frontend talks to the leader and the leader communicates with the placement service to create new execution environments and keep track of assignments on the worker hosts and then writes the info to the log, which the followers read. And then if the leader fails, a follower can take over really quickly and we don't lose states of which execution environments are available to service subsequent invokes. This means, in a zonal outage, execution environments in the working AZs are not orphaned, which means fewer cold starts, less idle capacity, and less load on the placement service. It also means the assignment service has much better static stability. The system maintains stability and state itself. We don't need an external service to fail over functionality to keep the system running and servicing requests. Good static stability is something we are always working towards in AWS and Lambda. A good way to maintain state. When we do then bring up a replacement follower or maybe when we need to add and remove assignment service nodes for maintaining the system, we can actually bootstrap the state of all assignments owned by the partition by just reading from the journal log stream from the time of the oldest execution environment to quickly get up to date. We still do have an assignment control pane service and this is gonna manage the creation of the partitions and also ensure the frontend nodes which partition to talk to you for a particular function arm. So the assignment service is fully resilient against host, network, and even AZ failures using a partition leader approach. And we also did manage to slim down the service by moving some responsibilities elsewhere. It's also written in Rust for performance, tail latency and memory safety. And altogether triples the number of transactions per second we can run on a host with a meaningful reduction in latency. So all the efficiencies we're able to drive in the service means that we can get better utilization and this has a direct impact on the cost of running Lambda. How efficiently can we run a workload given a specific set of resources? And we do a ton of work in this area as well. Due to the small footprint of functions and our ability to distribute the workload to fit the curve of our resources, we can be the most efficient way to run code. With Lambda, you only pay when your functions are doing useful work, not the idle. So it's our job to minimize the idle. And inside Lambda, we optimize to keep servers busy and reuse as much as possible. And we are continually optimizing the work utilization to be more efficient running Lambda and also improve your function performance. So we have systems that help us analyze the resources needed over time to optimally distribute workloads and provision the capacity to fit the curve. Now for a given function in an execution environment, well, you may think that distributing the load evenly is the best way, but it means you miss out on some inefficiencies. Things like cache locality, which you've heard about, is super important and the ability to sort a scale. So it's actually better to have some concentration of load within reason. The worst for efficiency is a single workload on a server. It has a specific pattern and is inefficient with resource usage. It's better to pack many workloads together, so the workloads are not as correlated. But we actually take this a step further. We use models and machine learning to pack workloads optimally together to minimize contention and maximize usage while securely caching common data across different functions, which also improves aggregate performance for everyone. And we actually have an entire team that works just on this placement problem with a distinguished professor and a team of research scientists. And this is all part of the story of how we build Lambda to be the best place to run workloads in the cloud, handling as much of the hard distributed computing problems, especially with state. So you can have the fastest way to build modern applications with a total of the lowest total cost of ownership. Now to increase your AWS Serverless learning, you can use the QR code to find more information to learn at your own pace, increase your knowledges, and as of yesterday you can even earn a serverless badge, if that's your thing. For plenty more general serverless information, head over to serverlessland.com. This has got tons of resources and sort of everything to do about serverless on AWS. And lastly, thanks so much for joining us. Chris and I really appreciate you taking the time today to be with us and we really hope we were able to look a bit under the hood of Lambda and help you know how it works and some of the challenges that we are hoping to solve. And then lastly, if you do like deep, 400-level technical content, a bit of a bribe, but a five star rating, the session survey certainly lets us know that you'd like more and we'll be happy to provide. Thank you very much and enjoy the rest of your re:Invent. (audience applauds)
Info
Channel: AWS Events
Views: 39,084
Rating: undefined out of 5
Keywords: AWS, Amazon Web Services, AWS Cloud, Amazon Cloud, AWS re:Invent
Id: 0_jfH6qijVY
Channel Id: undefined
Length: 57min 56sec (3476 seconds)
Published: Tue Dec 06 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.