AWS re:Invent 2018: [REPEAT 1] A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is Holly miss rubian and I'm the director of engineering for AWS lambda in a little bit I'll be joined my bark Bricker who's a senior principal engineer and serverless mark and I work together at Amazon on lambda today we plan to walk you through some of the key pieces of the lambda architecture and also some of the innovations that we've been working on by the end of this talk you will have a conceptual understanding of the lambda architecture and understand how your code moves through its systems when you call invoke first a little bit about lambda so you'll understand the scale of what we're doing at just three years after general availability AWS lambda already processes trillions of requests across hundreds of thousands of active customers every month lambda is currently available in all 18 AWS regions and as a foundational service we launch in every new region that AWS launches we have a number of customers that are using lambda to build highly available scalable and secure services including thomson reuters whose processing 4,000 requests per second for its product insights analytics platform finra who performs half a trillion validations of stock trades daily for fraud and anomaly detection and zillow who uses lambda in Kinesis to track a subset of mobile metrics in real time now let's turn to why so many customers are adopting lambda and it's because running highly available large-scale systems is a lot of work first you need to ensure that your system has load balancing at every layer of your architecture you do this so you have redundancy on your architecture but you also so that you can handle more traffic than a single set server is able to serve when you plan to build a new service you need to plan for and provision for these load balancing layers between primary architectural components you also need to ensure you have these systems configured with appropriate routing rules such that your load is distributed evenly second on the point of more than a single server can serve you need to support scaling up so that if you have more traffic than your current service layer can handle you can continue to serve that traffic but you also need to be able to scale back down after the traffic Peaks so that you're not indefinitely over provisioned which of course is wasteful when you plan to build a new service you also need to plan for and provision for these auto scaling layers to sit in front of your fleet evaluate the capacity of the fleet and scale up with traffic volume and stress on your server pool and then back down as peak traffic decreases third continuing on the point of system failure you need to consider both when a host fails but what about a complete failure of a data center or availability zone to this you need to instrument each of your services with health checks based on key service metrics and if the service shows is unhealthy stop routing traffic to that host then you need to repeat to ensure you do this for every single system and service component that you build as a developer you're now spending a lot of your engineering hours on systems administration lambda takes care of all that for you and more helping developers to focus on business logic and writing code and not administering systems today we will show you how lamda transparently supports load balancing auto scaling and handling failures while preserving security isolation and utilization so let's start off with the lambda architecture the lambda architecture is split into the control plane and the data plane the control plane is where engineers and developers typically end up interacting with the lambda service and on that part of the system we have a set of developer tools such as the lambda console the CM CLI your favorite ID and tool chains you're probably familiar with this and underneath those tools we have a set of control plane api's and these are for configuration and resource management so when you go and you create a function upload a function you end up interoperating with these api's and the resource management does the packaging up of your code and ends up putting that up into the lambda service and it's at this point where the data plane really picks up and the data plane picks up and what's going to first talk about asynchronous invoke and events and I'll pick up here and this is where we do both asynchronous invokes which you're probably familiar with and also where we interoperate with systems like dynamodb and Kinesis and SQS and we have a set of systems here who work together Polar's state managers and Leasing service and they work together to process those events and once those events are kind of processed through that system they're handed over to the synchronous invoke area of the service and this is where we're going to spend a lot of our time today in the synchronous invoke area of the system we have a front-end invoke the counting service the worker manager the worker and the placement service and so let's walk through those system components and talk about what they do so front-end invoke it's responsible for orchestrating both synchronous and asynchronous invokes as we just talked about and as it's at the very front of the service the first thing that it needs to do is authenticate callers so when you call invoke you want to know that only valid callers are going to make it to your function and call invoke so the very first thing the service does is it kate's the callers and then assuming that that's ok it will go and load the function metadata that's things like the environment variables and the limits that you put in when you created the function through the control plane api's and then it will go and confirm the concurrency with the counting service and then what it will do assuming that we're not exceeding concurrency then what what we'll do is we will go and at that customer function to a worker manager and we scale up our worker managers based on the current running concurrency so as your function concurrency scales up the number of worker managers also scales up along with that and thereby also your more workers are being scaled up in this distributes load so the counting service is responsible for providing a region-wide view of customer concurrency to help enforce those set concurrency limits and what it does is it's always tracking the current concurrency of your function executing on the service and if it's below the granted execution it will automatically be granted execution and if it's hits the concurrency limit it may or may not be throttled and the reason I say may or may not is because we want all customers to get their full concurrency and if we started throttling as soon as you started to get to that limit then you would really never meet your full concurrency so we have some intelligence there that helps us make sure that you get the full concurrency now this function this service has to be fast and it has to be resilient and because of that it uses a quorum based protocol which you'll probably remember from your distributed systems in computer science is a two thirds agreement type protocol and as it's accessed on every single call it can't introduce latency and slow down performance and so it's designed for high throughput in low latency of less than 1.5 milliseconds in addition this is a critical component and so we make it resilient to failure and make it highly available by distributing it across multiple availability zones so the worker manager is responsible for tracking container idle and busy state and scheduling incoming and VOC requests to the available containers it handles the workflow steps around function invocation including environment variable setup and compute metering it will assume the customer supplied execution role so that the function code executes with the correct privileges and when a container is not available it will handle the scale-up path through the placement service and it will also spin-down sandboxes and workers when they become idle because we don't want those running indefinitely if if they're not being used and one of the key things that this survey service component does is it will optimize for running of code on a warm sandbox and I'm going to explain to you quite a bit that through this talk about what a warm sandbox means and what it looks like so you can stay tuned for that so the worker is a very important component of the system architecture it's responsible for provisioning a secure environment for customer code execution and how it does that it creates and manages a collection of sand boxes it sets limits on sand boxes such as the memory and CPU which is available for function execution it downloads customer code and mounts it for execution and it also manages multiple language runtimes it will execute customer code through initialization and invoke and it will manage the AWS owned agents that are required for monitoring operational controls like cloud watch it's also responsible for notifying the worker manager when a sandbox invoke completes and again this is going to tie back to talking about warm sandboxes and mark is going to be talking a lot today about the internals of the worker so last the placement service that's responsible for placing sand boxes on workers to maximize packing density without impacting the customer experience or cold pack latency so really it's the intelligence to help determine where we want to put a sandbox when we have a function ready for execution and it monitors worker health and makes the decision as to win to mark a worker as unhealthy and again you're gonna hear a lot from me about speed it's designed so that in deject no more than 100 milliseconds into the cold start latency path again our systems need to be fast and mark will also be speaking more on this later and how it affects utilization so now with the high level understanding of the primary system components let's turn back to the load balancing and how lambda does this behind the scenes lambda has several modes based on if a worker is already provisioned and then if a sandbox is provisioned and we're going to start off with the scenario where we have an existing worker but we need a new sandbox so I'm gonna walk you through that call flow so here we have a customer and they're calling and VOC and that hits an application load balancer and the application load balancer routes that call across a fleet a front-end invoke hosts now let's have you talked about earlier the first thing that that front end and VOC is going to do is it's going to go and authenticate that that's a valid caller and assuming that it is an important to note is that we do caching throughout here of course for performance reasons so assuming that it is it's going to go retrieve the function metadata and then it will go and check with the counting service the current concurrency and verify that against the concurrency limit assuming we can continue on from that point the front-end then goes to the worker manager to reserve a sandbox and the worker manager says hey great I have a worker you can put a sandbox on it and so the worker manager will go and create the sandbox download the code initialize the runtime and call the customer code in it so your init function and then once that's done we say that we have a warm sandbox the sandbox is all ready to go there's nothing more to do other than to call invoke and so the worker lets the worker manager know and the worker manager lets the front-end know and now the front-end can call invoke and that causes your code to be run on the sandbox and at the end of the code execution metrics are collected up and then the worker lets the worker manager know that it is idle and that way the worker manager knows it again has a warm sandbox so we just left off where we have a warm sandbox I want to pick up on that scenario here where we have an existing worker and an existing sandbox and so we're coming back in with another invoke the customer again hits the application load balancer hits the front end we do authentication we go and we access the function metadata and then we go and check the concurrency limits with the counting service the front end will then proceed forward to reserve a sandbox with the worker manager and it's this time where the worker manager says great I don't only have a worker I have as a warm sandbox there and it returns that back to the front end and the front end can then call invoke which causes the code your code to run and then again lets the worker manager know when it's done so that it again knows it has a warm sandbox and so I want to emphasize here this is where we spend most of our time this is the call pattern where most of our time is spent on the lambda service so load balancing is always necessary but it really shines when looking at high TPS use cases of consistent traffic that needs high availability and reliability for instance web and mobile applications as is the case for high-traffic startups like bustle and next door to enterprises like Capital One and Comcast and I love this quote from Edmunds around how quickly they were able to build a lambda based solution so in the above example we covered where we have a worker that's already provisioned but what happens when we scale up quickly and exceed the capacity of workers and we need to get a new worker the overall call pattern is similar but we - what we have but there are additional systems involved so let's pick up again I hope you like my pretty pictures here so we have the lambda customer and there's a new function or we're scaling up really quickly in this scenario and we call invoke which again hits the application load balancer and goes to the front end and at this point you're very familiar with this scenario you can probably do it in your sleep we have the and that that authenticates retrieves function metadata and then does the concurrency against the counting service and then proceeds forward to reserve a sandbox with the worker manager but this time the worker manager says I don't have a work a worker and I don't have a sandbox that I can place this function on and so what it does is it goes to claim a worker from the placement service and the placement service does it's it's its valuation its intelligence to say okay here's a good place for you to provision that sandbox or provision the sandbox and so it gives that back to the worker manager and then we per pick up the worker manager is going to go and create the sandbox download the code initialize the runtime and call in it on your code and again we say here that we have a warm sandbox and so the worker manager now now lets the front-end know and the front-end comes in and calls invoke your code runs and then it lets the worker manager know again that it's done running and that way it knows it has a warm sandbox so a little bit more about the placement service and it is responsible for ensuring sufficient worker capacity to continue to fulfill worker manager requests for hosts when the placement service hands out a worker to the worker manager it provides that worker to the worker manager with a lease of between six and ten hours the reason for the lease is to enable work or cycling however lease duration is also impacted by function duration you can't have a function that runs longer than your lease when the worker gets close to its lease expiry the worker manager must return the worker and when the placement service receives a worker with an expiring lease it will reprovision that worker if the worker manager finds its worker to be close to expiration it will stop reserving sandbox on the worker such that all of the sandbox has become idle and it's at this point that all the sandbox has become idle that the worker can be returned so auto-scaling is nearly always necessary however when it is heavily used is for workloads where you need to rapidly provision see on boxes for a limited time period and then return them when when completed like with Fannie Mae or pyrin who scale to between 20 and 50,000 concurrent executions over minutes failure is always a possibility and lambda is designed to handle cases of failure whether it be post failure or a complete availability zone failure lambdas built across multiple availability zones and system components are striped across these availability zones with as we discussed prior load balancing and redundancy across service layers in the lambda architecture in addition lambda monitors the health of hosts and removes unhealthy hosts when a worker becomes unhealthy the worker manager detects and stops provisioning sandboxes on this worker and when an entire availability zone fails the system components continue to execute as shown although without routing traffic through the failed availability zone now as discussed earlier mark will pick up and go into the details on the worker thank you [Applause] so it's no secret that click forward here there we go it's no secret that lambda functions multiple lambda functions ran on the same hardware and on the same host at the same time and the reason we do that is you know it's just not cost-effective for us to buy data center servers with 128 megabytes of RAM so we run multiple functions on servers at the same time and this leads to customers most just one of the most frequent questions we get is how do we isolate the different functions that are running on a particular worker an isolation generally means two things to people one of those things is security and the other is operational isolation and and by that I mean you know how do you ensure that functions ran at a consistent performance all with consistent performance when there are other functions on the same Hardware how do you prevent noise e neighbor impacts and so on so do dive into that I'm going to talk a little bit about the software stack that runs on these workers so Holly talked about the worker these are the hosts that that ran our that ran your code and this is what the stack on a worker looks like at the top of the stack is is the most important part and that is your code and this is the stuff that comes from your your function zip or it comes from the layers that you heard Verna talked about this morning the next layer down is the lambda runtime so this is the Java or or nodejs or Python that comes built into lambda and then the sandbox and the contents of the sandbox is pretty it's a pretty full-featured copy of Linux you can go poking around in the sandbox that lambda functions run in you know looking user bin and user Lib there's quite a lot of stuff there and you know that stuff is there because you know code expects those things to be there code that's built in on operating systems expects that stuff to be there the next layer down is the guest OS and in our case the guest OS is Amazon then we ran multiple guest OSS on a box sometimes many many hundreds or thousands isolated from each other using virtualization using a hypervisor then there's a host OS again Amazon Linux and this is the thing that that hypervisor runs on and the hardware itself so this is what it looks like from an isolation perspective the first three layers to your code layer the runtime in the sandbox are only ever used by one function and you all know if you use lambda that multiple invocations will land in the same sandbox in serial so if you call the same function once and then you call it again and then you call it again those will all go to the same sandbox in serial they won't overlap concurrently and that's where we'll scale up but we never reuse a sandbox across multiple functions then the guest operating systems are shared within an account so multiple functions within one account will run on the same guest operating system either at the same time or when we destroy the sandbox for one function and recreate one so those guest operating systems are shared across functions but never shared across multiple AWS accounts and the the boundary that we put up between accounts is virtualization and we think this is the minimum security bar for isolation of functions between accounts and in a lot of ways also the minimum operational bar so let's step through these layers a little bit and talk about how we achieve operational and security isolation underneath the sandbox layer is the same technology that powers containers and the thing about Linux containers that that you'll probably know is Linux containers don't really exist instead containers are kind of grouping of different functionality that's built into the Linux kernel a kind of toolbox that you can build sand boxes and containers out of and we use a number of the tools from those toolbox for our sandbox isolation the first one of those tools is C groups or control groups and C groups are mechanism to say you know this this process and obviously anything that Forks or any any threads that creates is only allowed to use a certain amount of CPU a certain amount of memory a certain amount of disk throughput a certain amount of memory throughput so this is the kind of operational isolation and this is how for example we we enforce the the maximum function memory footprint using using control groups and C groups are sticky actually all of these mechanisms are sticky so you know you add if at a process to a C group and it can't get out of that it can't take itself out the next mechanism we use is called namespaces so there a whole bunch of resources in the Linux kernel like process IDs and user IDs and group IDs and namespaces are just what they say are there a namespace for those IDs so if you go digging around inside the lambda sandbox you'll see that the process that your lambda function runs as always runs as process ID pit number one and you know how can you have multiple functions with the same pin number one well you don't it's actually just pit number one in its in its process namespace and it's got a real pit that is is not one but within the namespace which is where you are if you're looking at this stuff you see a namespace set of process IDs then this sitcom or sitcom PPF this is a kind of firewall for the kernel so you know the Linux kernel has a whole bunch of syscalls just exposes stuff the kernel can do like opening sockets and opening files and so on and or reading and writing from files and and what sitcom its you do is say this process can only call these sis calls or cannot call those sis calls or can call these sis calls but with only these arguments or can call those is called but not with those arguments and we you say con PPF to cut out bits of the kernel surface area and restricted to only the functionality that lambda functions actually need to ran and this is one of the primary security controls next iptables EB tables routing in various other things provide network isolation and to route bind mounts and loopback mounts provide the underlying filesystem the next layer down in the in the stack or in the isolation story is virtualization and device emulation and this is this is using virtualization features built into into the hardware so there's like VTX on Intel to make the hardware essentially just pretend to be multiple to be multiple CPUs instead of one and this is all this all controlled by the hypervisor and virtual machine monitor and I'll get into that a little bit later when I talk about firecracker so there are two ways that we build lambda today the two ways that lambda workers come together one of those whoa step I have gone way ahead here somehow hopefully I can skip that there we go a little bit of a spoiler there one of those is on ec2 instances so on Monday night she would have repeated to Santa's say when we started lambda we started by building every worker as a separate ec2 instance and we did it that way for several reasons one was that's a great security boundary and the other is that was a fast way to build the system and we still use this mode today we ran these lambda workers as normal ec2 instances exactly the same kinds of ec2 instances you could go off and launch today and we usually the instances on the mat nitro platform the other kind of isolation that we've just started talking about this week is based on our new firecracker vmm and on firecracker instead of running you know one instance per you know per account we run one bare-metal ec2 instance and again the same kinds of a metal ec2 instances that you can go off and buy and we use firecracker to launch many many micro VMs hundreds of thousands of micro VMs on top of that we're and these are more flexible or more agile boundary than than instances all for us and has some really great features and one of those really great features is simplifying the security model so instead of having the layer of you know one function one account many accounts instead we've simplified this down to one function in a macro VM and in multiple macro vm's across multiple accounts on a piece of hardware and this is really good for us in a whole lot of ways which I'll talk about when I get to talking about utilization but it's also nice for the lambda programming model because this is provides strong isolation even between functions when we're running in this firecracker mode so I want to talk a little bit about one of the innovations are we put into a firecracker which I you know helps raise the security bar so by way of introduction there you know I said we're running hundreds with thousands of firecrackers on a host obviously these boxes don't have hundreds or thousands of of network cards they don't have thousands of hard drives but each guest VM each of those micro VMs sees a network card and sees a hard drive and to user space cloud running in that micro VM those looked like hardware devices well how does this work this works through the magic of virtualization and a little bit of cooperation between the guest OS kernel and the hypervisor and 5na implementation of device emulation inside firecracker so we use a protocol called vert ire and this is a this is a way to pull a driver inside the guest kernel to implement a block device and implement a network card in a way that is very efficient and it's very simple and is very secure so the efficiency although the efficiency comes from the fact that one of the most important things in virtualization performance is reducing the number of times that the guest OS has to you have to sort of switch between the guest and the and the host operating system and so you can imagine the simplest possible interface is a way for the the guest operating system to write a byte or or write you know some some words into into the host and it would have to do this multiple times to send a network packet for example with ver tired instead what it does is builds apps and data structures in memory and then it it rings the doorbell on the on the hypervisor saying you know dingdong there's some work for you to do there's some packets here for you to send the device simulation implementation picks that stuff up and sends those to the real hardware so there's really kind of bread and butter virtualization stuff the innovation in fire cracker is that this device simulation runs inside a very restricted sandbox so this kind of a second layer sandbox it just sits around that device simulation code R with very few privileges and what's nice about this is that we get to use all of those controls that I talked about earlier all of that kind of sitcom PPF and so on to provide an additional layer of security around device emulation so we built firecracker and rest and we paid a huge amount of attention to the security of that boundary and the quality of that device emulation implementation but there is also one of the most complex pieces of code so having the second layer of sandboxing around it provides a second layer of security control which we think is very important next utilization and certainly this is about you know how do we keep those workers busy how do we keep our system our servers busy well how do you measure utilization we think of utilization as the percentage of resources and and their resources mean CPU and memory and so on doing useful work rather than being idle or being wasted and by doing useful work what that means to me is ideally I want every CPU cycle on my worker to be running your code I want every byte of RAM on my worker to be filled with your data and this is good for us because it's very efficient and good for you because you get better cache locality and better container reuse which is performance so the good news for you is that with lambda you only pay for useful work so you don't have to worry about utilization utilization is entirely you know my problem and Holly's problem this is something we're working on but there's some interesting topics here which I wanted to dig into and one of the things that my team spends a huge amount of work does a huge amount of work on is this optimization of utilization is the packing on to workers packing functions onto workers to keep those workers optimally busy so let's talk about one topic there here are seven Sam boxes for a function just arbitrarily chose the number seven and you know if you usual Holly's diagrams who scaled up this kind of seven concurrency going on here we've scaled up to create seven Sam boxes the typical kind of distributed systems approach if you had seven servers would be to load balance between them so you would take some amount of load and you would try and spread it out across the fleet as kind of evenly as you can and you do this for a couple of reasons and but the most the primary reason is that it's really hard to tell how busy computers are um and that's because there's so many many bottlenecks there's the easy stuff like CPU and memory but there's harder stuff like networks and even harder stuff like memory buses and caches and so on so it's very difficult to boil down the busyness of a server to one number or even any reasonable number of dimensions so what people do in practice is see a fairly conservative auto scaling goals make their fleet bigger when they hit some kind of CPU utilization and use that as a kind of proxy for the real load and then load balance across those servers it's a very time-honored pattern and a pretty great one we do something quite different in lambda we intentionally concentrate the load on the smallest possible number of busy Sam boxes and this is a good thing and it's good thing for your code because keeping a small number of sandbox is very busy means that any caches you have or any precomputed staff or any connections you have open are kept optimally busy and that's really great for temporal charity and cash locality and it's good for us because it gives us a really good ability to auto scale so why can we get away with us well we can get away with us because of the semantics of the lamda API there's only ever one invoke going on in a sandbox so sandbox is kind of busy in a very binary way it's either gotten invoked running on it or it hasn't gotten invoke running on it so just by counting the number of Sam boxes that have an invoke running on them we can get a very clear picture of the load across the system and by pecking load on to the smallest number of Sam boxes we can simply count the number of idle ones and scale them down or we can count the number of busy ones and seeing that see that that's approaching the total and start scaling at so this is all work to Ted placement service dads and it's worked that we can do because of the the semantics of lambda it's another topic in utilization and that's the really interesting one for me is how do you pick workloads to run on a worker so this is a worker this is a server yes there are servers in service and you know the obvious thing to do here and the thing it would you be forced to do if you were kind of building a lambda for yourself is ran multiple copies of the same workload so you cut it up into multiple sandboxes and you run multiple copies of the same workload it turns out that's a bad thing to do and that's a bad thing to do because multiple copies of the same workload will have very correlated load and what that means is when one spikes up on CPU it's quite likely another one will spike up on CPU at the same time because they're doing the same work or on memory usage or on you know bus usage or a network usage or or whatever so these loads are very correlated and that really limits how densely you can pack on hardware because your load is going to be very spiky so what can you do about that how can you how can you flatten that out well you can take advantage of statistics and you can take advantage of statistics and simply put as many uncorrelated workloads onto a server as you can so have a diverse set of workloads instead of multiple cop the same workload and this makes the workload way way better behaved it really brings down that those Peaks brings up the average and makes it easier to predict scale so that might sound counterintuitive so let's see if we can build an intuition for why that's true when I was in high school I really enjoyed playing Dungeons and Dragons and one of the things you do with D&D is throw a 20-sided die and so here I said on my desk one day I threw a 20-sided dice uh a hundred thousand times and I countered each of the twenty values how often they came up and you can see that's pretty consistent I'm obviously quite good at rolling dice so you know one night my friends and I wanted to play some D and E and and we lift our 20 sided dice at home but we have had some 10-sided dice so can we just take two two in sided dice and throw them and add the two numbers up and make that a 20 sided dice turns out you can't so this is what the distribution looks like for the sum of two ten sided dice and why is this true well it's true for a very simple reason there's only one way to make twenty that's a 10 and a 10 but there are lot of ways to make 12 you know 10 + 2 9 + 3 8 + 4 7 + 5 and so on so it just becomes much more likely that you're going to make that 9 and 10 and 11 12 13 then you're all going to need to make 20 and it turns out the more of these dice you throw these uncorrelated dice you throw the better the distribution behaves and even throwing 10 dice you can see that I've really pushed down the extremes it's really unlikely that I'm gonna roll a hundred it's really unlikely that I'm gonna roll 10 so I pushed down those extremes and I move the chances of load on my server or or some of my dice into a narrow predictable spike and the more workloads you put on a box and the more uncorrelated workloads and that's very important you put on a box the better behaved they are in aggregate so this is something that is very powerful for us at scale and the fact that aw ran so many different customer workloads gives us the ability to find uncorrelated ones and put them onto hardware and this is something that people can't do at lower scale or doesn't work well at lower scale and it's fairly unusual in computing to find problems that get easier at scale so I kind of enjoyed this one it turns out we can actually do better than that better than just chance by going and finding workloads that are anti-correlated you know ones that spike down on CPU when another one spikes up and this is something that we've started doing in our placement service going off and finding workloads that pack together really nicely and make that distribution even tighter than it would be if it was just based on chance so moving on from this topic I want you to talk about another investment that we're making over the course of 2019 are enabled by our work on firecracker and that it's an investment in improving VPC cold-start latency so let's talk about how VPC works in lambda what we do in lambda is when you create a function in your V PC and you invoke it we go off and we create an ec2 eni an elastic network interface just the kind that you would have an ec2 we attach that eni to the worker and attaching in the NI to worker takes some amount of time because ec2 has to go back and do a huge amount of rejiggering of the network to get the right packets to go to the right places and every one of those en i's consumes an IP address in your subnet so there's a great model in some ways one is that it's conceptually simple another is that it supports the full VPC feature set so this way we started as an implementation but it does have this huge downside of VPC cold-start latency which we've heard from a lot of customers is something you care about deeply so in 2019 we're moving the way this works we're taking the DNI and we're moving that off the worker and instead of doing network address translation are between or NAT between the lambda function on the worker and the NI locally we're moving that into a remote NAT and we're securely tunneling from the lambda function to the remote NAT so what does this mean well in practice it means that we can use 1 e and I across many different workers we can essentially multi-tenant those en eyes and this lets us you make music or get away with many many fewer en eyes and the fact that we have many fewer en eyes means that a lot of the time we can create them at the time that you create a function rather than at the time the function scales app and what this means view is much more predictable VPC latency so you'll see this coming in over the course of 2019 but also faster scaling the ability to ramp up faster than you have before without running into limits around en eyes or around IP addresses but there's another reason this is so important and that's it that's because it's just way easier to use one of the edge cases in lambda VPC is that it's hard to predict how many IP addresses a lambda function is going to need so you know as your lambda function scales up every single worker is going to consume an IP address from from your subnet and that makes your network management focuses life fairly complicated in this new model things are much much simpler because for most workloads we're going to need exactly one IP from each subnet so that's gonna make that management task way simpler than it was in the past I wanted to get back to firecracker as we ramped up wrap up here a little bit and talk about why we've we've talked about it so much this week and why you've heard so much about firecracker and that's because we're extremely excited about how it enables our innovation firecracker gives us much lower startup time than other similar virtualization solutions it gets gives us lower memory overhead very similar performance but most importantly it gives us a huge amount of flexibility and this is giving my team the ability to do all kinds of things like that VPC improve and that you're going to see show up in lamda over time so for us firecracker unlocks innovation but for you firecracker unlocks higher utilization and higher scale it unlocks the you know increasing the ability for us to give you ramping up in you know for scale ramp up and higher numbers of absolute or higher amounts of absolute scale so we're very excited about firecracker and we're very excited about the stuff that's gonna let us do over the next few years so in conclusion you heard Holly talked about how lambda goes together those front-end components the invoke service and the counting service and placement and I talked about the worker and how we think about security isolation and how we think about utilization and how we think about packing but the great thing about lambda the thing I'm excited about in service is that you can leave this room and forget about all of this stuff it's just been for your entertainment so I hope you enjoyed hearing about it and then going off and building things that you don't need to need to look under the covers and can just go off and build your business logic and deliver value to your businesses without without needing to needing to understand a lot of this deep architecture stuff well thank you very much Holly's gonna join me back on stage for for some questions I just flew through them in the deck again yeah thank you [Applause] any questions yeah so the question was to slam the French in work with API gateway in the same way that's probably a great question for Holly the lambda friends can work with API gateways it integrated with the API gateway actually a very common use case is to call in to lambda functions and that invocation works just like you you would have seen API gateway literally calls the lambda invoke API you know just like you can call the lambda invoke API and there's one thing that we kind of like architectural e and AWS is using our public api's because you know because if we if we need an API or we need a control chances are you do too and using our own API gets us you know a great understanding of you know what the needs are of customers at scale using that same API yes so instead of using virtual you know pcs on micro v ends why don't you run it under s containers you know lambda functions on top of containers right or on top of a key s instead of ec2 hypothetical question sure why do we run not run lambda functions as containers we believe that virtualization is the right security boundary across accounts I can't go into exactly all of the details why we believe that here but I'd be happy to talk later but we think that hardware virtualization is is should be the minimum bar for code multi-tenancy across multiple accounts so with the changes to the eyes in the concurrency does that mean we're not gonna have to do that crazy concurrency formula in deciding how many IPS we need for a subnet that the lambdas are attached to yeah we hope so when over the course of 2019 ok we'll we'll have more precise dates to share for being q1 yep so my question was about the nat gateway so it says showed up in your diagram that it then that instance will pop up inside of the V PC do we have to pay for the running costs of that neck gateway cuz they can be quite expensive on a monthly basis no that's that's something that's built into our architecture so this is completely under the covers and either then lower consumption of IP addresses you're going to see no change in networking capabilities and no change in your belt and the last question I had on that was currently if you want outbound traffic to go through a static IP address currently you have to go through a private subnet routed through a NAT gateway into another public subnet is there any way of actually simplifying that so that it's it the lambda instance can run inside of the public subnet instead or yeah no not right now but that is a that is a great feature request on something that we will we will take a look at thank you thank you so I noticed that you had your virtualization directly over Hardware so how do you deal with contention of resources is there actually dealing with the contention on the network card and the hard on the hard disk I'd say it's a very deep topic and a great question I probably the best thing I can do is is point you at some some reinvent talks from last year there was one that Antony Liguori did and one that Matt Wilson did you can find them on YouTube or if you find me afterwards I can I can find the links for you that explains in detail how the ec2 nitro' system does that ok great thanks I am here I'm very curious about the workers and I would like to know how do you keep track of what your workers are doing where they are at and where they've done and what kind of technology and possible languages do you use okay let's let's ahead take that one one at a time so how do we keep track of workers so we have multiple systems that keep track of workers both in terms of as you saw the the placement service as well as the worker manager and it really depends on the current state and how we're using that worker at that point in time can you explain to me a little bit about the the languages and where you're going with that are you just are you are you asking about what different language runtimes we support or no I'm I'm more interested in like the internals like these workers you mentioned a state like where does that state live and how is it run like continually inside inside of the worker so inside of the worker we also keep track of all of the Sam boxes that we create so there's data structures inside of our worker that helps to do that it's more of a confirmation here so I thought you mentioned that one of the pain area we have with Eni being consuming IP of a subnet and without understanding the detail about it what we ended up with I ended up doing when we created lamda which connects to V PC it goes to the Direct Connect and goes into our data center for transition during transition phase I ended up connecting three subnets so is from what I understood it makes sense just to have two subnet connected basically to avoid because you just need one for failover let's say you're still going to want to have ideally i1 subnet / availability zone that the thing that you're talking to R and Z so if your back-end runs across three availability zones you're going to want to need you're going to want three subnets to you know to balance the load and to make that make it fault tolerance better it runs across for ACS you're gonna want four subnets and for a couple of reasons one of them is it gives us more placement flexibility and then we can make better decisions on your behalf but it also means that you know if you don't have a sudden 8 in one of your AZ's we won't be sending load into that AZ's you'll kind of end up with unbalanced load on your back-end so we think that you know this new VPC approach is going to make things much easier to balance by reducing the number of IPs that you need but it's not going to change those best practices about having you know essentially 178 parisi in front of your backends I was just thinking till you guys have that feature available should I think of producing one submit just to avoid but looks like from balancing point of view yeah from a from a load balancing and and fault tolerance point of view it's better to have three subnets all right got two questions first one is about cold start I've got customers who use an API gateway with laptop behind them and they have some very strict SLS and they noticed that if there are cold start habits they don't meet their resumes so they're using these really convoluted frameworks to make sure that they also always have a certain number of lambdas provisions but it's just hard at the moment is there anything in the works that allows you just to specify how many warm lab does you will want to have provisioned at any given time I'll take this one so we are very aware that for certain use cases latency can be an issue and one of the things that we are interested in hearing from customers and you have have just asked about it which is is there a way of guarantee that something is warm and so it's it's great to hear that that's something that interests you okay thanks and forgot my second one I'll get back to you question regarding they're more heterogeneous resource allocation potentially like maybe non proportional allocation of resources or perhaps support for elastic GPUs is this something that you're thinking of there is any technical implications or is it more of a business decision I would be very interested to hear more about what you would like to see there in terms of controls so maybe we should we should chat afterwards but I think as they kind of met a point like one of our goals with with serverless is to keep things as simple as possible and that doesn't mean that you know we want to compromise on stuff like that but we want to be very thoughtful about the buttons and knobs and controls we add because the more buttons and knobs and controls and things we add the more there is going to be for you and your team's to understand to use lambda effectively so we think we can get to most use cases that need additional controls without building additional controls so that's what I'd like to hear more about your use case and see if it fits into our thinking about how to solve these problems without pushing that complexity on to you well I guess maybe we can take this offline yeah hey so we have time for just maybe one or two more questions I just want to let people know yeah if that's okay is there anything in the roadmap in the future that you're going to decouple the amount of memory in the amount of CPU resources that you've assigned to love the functions right now I'm again a slave problem we have to assign well over a gigabyte of memory to love the function that only requires 42 mix in order to meet our SLA target which is just a waste of tremendous amount of memory I think the same answer as the previous one we think that we can get rid of that waste on your behalf without adding the additional controls so I'd like to if you have a moment afterwards or we can get in contact and hear about exactly what you'd like to see there in terms of in terms of control for example actually yeah let's just share afterwards yeah I'm wondering uh just what keeps you up at night about this system what keeps you up at night what are you pretty scared of you know I I was I was talking to mark backstage before our talk and I'm actually very excited about where we're at with the Surrealists technology and where we're going to and you know back to the conversation on firecracker and the innovation that we can drive I I truly believe that you know this is the future of computing and so you know I sleep well at night [Laughter] do you have time for one more tip time for one more mark oh yeah hi thanks I've done you know better testing with running many simultaneous invocations of pretty simple functions that do one remarkable things but I'll see sometimes well in the distribution of run times of an instance let's see five verse a factor of five or six easily in the minimum run time and the maximum run time I just like to understand what I'm seeing I'm curious is that something that you would expect is that something like oh there's always some number of instances that are on health and that's what it looks like just curious if you could comment on that we certainly wouldn't expect high variance at steady-state so if you're running a constant load what we would expect to see is you know very consistent performance you know obviously if your code is doing something that takes a consistent amount of time so if you're seeing something other than that I'd be interested in getting in touch and we can we can dive into that at not steady-state if you're ramping up or ramping down you get this this auto scaling behavior that Holly talked about where we're adding sand boxes or we're removing sand boxes and for the vast majority of cases where you're seeing inconsistent latency it's during those scaling times and that's something that we're working very hard on improving over the next year starting with you know the VP seal agency but but working on all aspects of that problem well thank you everyone thank you for coming and watching our talk [Applause]

Info

Channel: Amazon Web Services

Views: 59,479

Rating: undefined out of 5

Keywords: re:Invent 2018, Amazon, AWS re:Invent, Serverless, SRV409-R1

Id: QdzV04T_kec

Channel Id: undefined

Length: 59min 12sec (3552 seconds)

Published: Fri Nov 30 2018