AWS re:Invent 2018: Powering Next-Gen EC2 Instances: Deep Dive into the Nitro System (CMP303-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we don't get started alright let's go ahead and get started my name is Anthony Liguori I'm a senior principal engineer with an EC 2 and I've led the development of the nitro system over the last year I've also led the engineering effort behind firecracker which we just announced earlier this week and most recently AWS outposts which we just launched today so really exciting week I'm super excited to share all of the exciting things that we're doing with all of you and I really appreciate you coming to a 7:00 p.m. meeting toward the end of this event I know I'm standing between everybody and beer and foods so I appreciate you taking the time to come and listen to what I have to say we're gonna start out with an overview of what nitro is why it matters and how it works last year I did another presentation on nitro and we kind of scratched the surface this year I want to go really deep in a specific component and that is the nitro security chip and so we'll talk a lot about why that exists what it does and what we use it for then I wanted to walk through some of our recent recent launches and talk about how nitro enabled those launches and you know how it's providing benefits to you very concretely with some of our recent instance types and then finally I'll end with what's coming next so what is the Nitro team working on what are we what are we looking towards in the future the Nitro project began really with a question to ourselves and that question was after almost 10 years of having developed dc2 if we could take all of our learnings from that and we could look at the physical hardware platform that we use to host ec2 instances that we have you know very very large number of these servers and we questioned every single component in the system and said is this the right component for the specific purpose that we're trying to meet here which is to serve ec2 instances and then from a software stat to look at every piece of software in a system and say do we really need to have systemd on every single server the answer is no do we really need to have Asus H and interactive logins the answer's no and we went through and we really tried to understand since these servers only are used to run ec2 how could we really optimize them this is ultimately is what led to Nitro and this is a platform that we launched with the c5 instance type in November of last year and then we really started to talk about it reinvent last year so it's just about one year old based on that while this is really the first time we've started talking about nitro nitro as a system that's been under development since at least 2013 when we launched a c3 instance type with enhanced networking that was our very first use of nitro hardware since we launched nitro last year every single instance type we have we've launched since then has been based on the Nitro system so chances are if you're using a c5 M 5 r5 t3 any of the AMD based instance types if you got really excited like me and launched in a one instance as soon as it became public that's all based on the Nitro instance are the Nitro system so when we think about the nitrous system and the reason we call it a system is it's really composed of three independent parts the first part is the Nitro cart itself this is the i/o accelerator within the system and this year one of the things I wanted to talk about is that this is actually not just a single card but this is actually a family of cards each that serve different purposes and we'll talk about those cards a lot more in the next few slides security is the most important thing we do at AWS it is part of every conversation we have it's the first thing we think about when building anything and so when we went to build nitro one of the big questions we asked ourselves was could we do more from a security point of view and that led to the Nitro security chip which will talk a lot more about in it if you in a bit and then finally once we've done all of this once we have all of our i/o off once we've simplified the stack it led us to revisit what a hypervisor actually needed to be and that led to the development of the nitro hypervisor the net ROI hypervisor is unlike any other hypervisor out there and that it does very very little and we'll talk about exactly what it does in a bit I often like to refer to IO in the world of virtualization as really the soul of a virtual machine it's what gives it the personality the type of block device you get the type of network interface you get that really is what defines your experience with that virtual machine and so we'll start out by talking about what the nitro cards are and how they work as I mentioned there are multiple nitro carts today there are four distinct nitro cards there is the nitro card for V PC the natural card for EBS the nitro card for instant storage and then finally they let an electro controller all of these cards are actually physically different and if you think about it it makes a lot of sense if you have a controller for instant storage that controller needs to talk to nvme devices behind it it needs to talk to the host system but if you have a nitro card for vbc it needs to talk to a network right so the type of physical hardware that you build is going to be different for those different use cases however these are all based on the same underlying ASIC and we obviously share software where it makes sense something that we haven't really talked about in the past is the Nitro controller we'll talk a bit more about that but it's really the brain of the system overall it's what ties everything together and makes it cohesive so let's deep dive into each of these types of cards the Nitro card for V PC if you could take it out of one of our servers and you could look at it and put it in a different server what you would see is something to the OS that look like a normal network adapter you know not not that I'm different from something you'd get from any of the major vendors the actual driver that would bind to it is the ena driver the elastic network after driver but this is a special network card in that it only works for V PC networking it only works within the context of the nitro system the elastic network adapter is something we introduced a few years ago and we introduced it to solve a problem that NICs tend to have it's if you're running on bare metal in a in your own data center you don't really think about the fact that if you have a server that was built for one gig networking and then you upgrade that server to ten gig networking you're adding a different NIC and then critically you're usually installing different drivers and it's very rare in fact I don't know any cases of this in practice that you see a single driver that works across generations of the underlying network technologies this isn't really a big problem when you're installing servers by hand locally but if you're in a virtualized environment and you're using things like amis this becomes very disruptive right you don't want to have to change your ami you don't gonna have to change your application it's far better if when we launch something new like c5n you can just take the instances that you're previously running relaunch them on that new instance type and just get better networking and that's really the experience we want customers to have and so that led us to build the elastic network adapter interface today drivers are available for all major operating systems for open source operating systems like Linux or FreeBSD or any of the other popular ones those drivers are upstream today so they come by default they're shipped in all the major distros the other important component of the nitro card for VPC is the actual V PC data plane so this is the bit that actually allows your instance to talk to other instances within a V PC vbc is a software-defined network and so it's you know not surprising that we do encapsulation and decapsulation but there's a number of other features that we implement within the nitro card BC such as security groups limiters routing etc security groups are in fermented in the card itself because we want to enforce those security groups as close to the rich origin of the data as possible and this is one of the great things about the way security groups work within a situ is that it doesn't rely on configuration within the instance and because of the nitro card the same security groups happen on your platform even if you're running bare metal where there's no hypervisor present we can still implement that same level of security question the customers frequently asked is about limits in terms of how much performance a specific instance type can get and one of the most important things that we focus on when it comes to performance besides you know the actual raw performance is making sure that customers can have a consistent experience when it comes to performance we want to make sure that every c52 x-large no matter which server it lands on which region it's located in which data center it's in where it is on the network always has the same performance experience this is critical for distributed applications you can't really rationalize about a distributed system if every node within that system is behaving slightly differently or if one node can handle more traffic than another you just simply can't scale that type of distributed system well now often the question that we get asked is how much how many packets per second can a particular instance type handle unfortunately it's not a simple answer so not every packet is created equally in terms of how much it expenses the overall system for instance if you have a workload that is trying to send really large UDP packets and ultimately greet creating fragments fragments tend to be expensive everywhere in the network there's really just no easy way to handle fragmentation generally you should try to avoid fragmentation but people do it and we certainly support it on the other hand if you have a very large TCP flow that is able to use the full MTU of the system that's actually an easy pack at the process it's generally pretty straightforward what we've done over time is that we've actually built a fairly sophisticated set of limiters that try to take all of these different equations into account to ensure that no matter what is happening you're always getting the same consistent performance experience so generally the best advice that we can give to customers when it comes to limits is to take your application and actually try it try to push it to whatever is the maximum throughput it can achieve and then we're always happy to have a conversation about whether we think that's an accurate characterization of what the system can do now networking is super important for a lot of customers a lot of their work are their data flows through the network but we also have a lot of customers who really depend on fast block storage these are customers that are using things like relational databases or other types of workloads that expect to have block storage and so the next card we built after the nitro card for V PC was the nitro card for abs and this is actually something we launched with the see for instance type it's been around for quite a while now and even though this wasn't how we exposed it to customers if you took that card or if you take the current generation of that card and you put it into a server what you would see is an nvme storage toys nvme is actually a really exciting standard it's relatively new at least in terms of how storage works and it's common my laptop has nvme I suspect a lot of your laptops have nvme and unlike networking than the networking world the storage world is a lot more standardized and a bit more forward-thinking so envy me likely will be able to last us through multiple generations of storage technology now exposing envy me is one part of what the nitro card for EBS does but the obvious other part that's really important is taking that nvme interface and turning that into network traffic this has actually become pretty popular these days you'll hear phrases like nvm for nvme over fabrics there's a lot of different implementations of nvme over fabrics everything from TCP to you know Ethernet to our DMA and the like but this is actually something we've been doing now for three three or four years is remoting in GME and it's worked out really well for us finally the other capability and the other really important capability of the nitro card for abs is volume encryption so we rely on the nitro card to actually do the encryption of all the data for encrypted volumes at the source of the traffic and using hardware based encryption so you never have to make a trade off with nitro for encrypting your data or getting the full performance out of the system because that's all handled in Hardware now network storage is great and for most customers this is actually the right way of doing block storage you get better durability and you get the ability to do things like snapshotting which makes backup and a lot of other things like super super easy but there are some customers that have asked us for local storage and so with the I free instance type which we launched oh maybe 18 months ago we introduced a nitro card specifically for instant storage so the types of customers that are looking to do this kind of instant storage generally have their own durability mechanism so there may be running a database that replicates data or they're dealing with data that is really transient or is it needed after a particular workload is done now for a few generations of instance types we did not have instant storage so if you go back to the m1 the c1 generation even up to the c3 generation of platforms those always came with storage by default and the reason for this is that networking and storage have evolved very different so if you go back to the c1 or m1 days where you were still dealing with hard drives that was those were the days of one gigabit networking and if you look at what a hard disk could do in terms of throughput war particularly multiple hard disks you could actually achieve more throughput from those underlying hard drives than you could from the network so it made sense to offer local storage on all instance types because it was just faster than you could achieve any other way the introduction of 10-gig networking really changed this because now all of a sudden the network is capable of meeting the performance of storage and and often exceeding the performance of storage which really became attractive to allow customers to really benefit from all of the other features of something like remote storage the interesting thing that has changed though in the storage in their industry is the introduction of nvme so hard drives historically spinning hard drives really haven't changed all that much in terms of their throughput capabilities over the years because you're really limited by physics there's only so fast you can spin that media before it's gonna literally shatter and and blow up and so there's a ceiling there there's only so much you can do and when and when SSDs were first introduced as SATA SSDs they largely use the same transports in protocols that spinning drives introduced so even though the NAND was much more capable the controller itself became a major bottleneck with nvme they took that same NAND technology and they placed it directly on PCIe and really opened up the possibilities for much greater bandwidth much lower latency when we introduced the i3 instance type the i3 instance type is capable of driving 16 gigabytes of throughput to the drives attached to it that's equivalent to 128 gigabit sub networking and that's something that we introduced you know almost two years ago at this point so obviously we're now back in the world where our storage local storage performance can actually far exceed what the network is capable of and so this is really what led us to introduce this capability again because now you can actually drive really tremendous performance from local storage now there's one more property of the night row card for instance storage that ends up being really important but it's easy to to miss and that's Drive monitoring so one of the characteristics of nand so the great characteristic is they don't explode so that's good really important in a data center but then the second not-so-great characteristic is that it wears out over time so an individual cell of nand has a finite number of writes before you just can't tell the difference between a 1 and a 0 anymore and so modern drives actually have way more NAND than is presented to you and they have a complex controller firmware that will actually random or decide dynamically where to place writes to evenly where across the entire drive usually this works great you can't tell it's happening it's happening really fast but eventually that NAND is going to get worn out and it's going to be harder and harder for the controller firmware to find and and that isn't worn and when that happens you typically start taking really long pauses as the controller is trying to rearrange data and find places to put those new writes and what you actually see in practice is usually the drive drop performance dropping off a cliff a really significant degradation in performance this is not what you want to happen in your application and it's not something that most people expect to happen so it's not something it's not an error condition they're gonna design for in their application this is one of the many things that we're monitoring for continuously in the night row card for instance storage we're making sure that all the drives we know what the state of the NAND we're is and that we can proactively get those drives off of customer instances before it becomes a problem for customers the unique characteristic of the night row system is that that same capability is available both for virtualized instances and bare-metal instances because it's all happening in hardware we're not relying on a hyper to do this type of monitoring now as I mentioned we do a lot of monitoring the platform to ensure that we're delivering a consistent performance experience and that all monitoring all has to happen somewhere and the thing that really coordinates this is the Nitro Card controller whereas a traditional hypervisor is you can think of sort of a vertical application where all of the various software is pulled together and is running on a single system with a shared fate or with the Nitro system we actually have a lot of different components and it's more of a network application you can kind of think of it as almost a small cluster of compute resources and so the Nitro card controller is is the brain that controls all of these different resources make sure they're all configured correctly and presents a unified control API to the ec2 control plane in addition to working with all of the different nitro cards the Nitro card controller also interacts with the nitro security chip and the natural hypervisor to coordinate the entire system as a whole and ultimately implements the hardware root of trust now I mentioned I wanted to talk a lot more about the nitro security chip the Nitro security chip is a custom microcontroller that we put on the system and it sits in front of every bus that stores non-volatile storage and it allows us to make sure that that storage is in the state we want it to be it also is connected to the Nitro controller and it's ultimately controlled via that mechanism and it provides a simple hardware based route of trust but why do we even need this what like why is this a thing and what does non-volatile storage in the system mean in addition to the all of the nitro cards and the other components in the system every modern computer is really composed of a bunch of different microcontrollers like I mentioned the the drive controllers are running complex software but there's also software that is determining how fast to spin the fan to cool the system there's a memory controller that's actually monitoring the memory bus and setting what the right frequency should be at any given point in time and there's really like this large variety of these devices and all of these devices they can range in complexity from something as simple as the equivalent of Arduino to something as advanced as a Raspberry Pi where you're running a full-blown Linux OS in fact chances are your systems today are running multiple operating systems and you don't even know it all software can have bugs and so all of these individual devices have to have a mechanism for upgrading their software and that mechanism usually means there's a small bit of flash sometimes it's shared in a single a single bank sometimes there's multiple banks but ultimately there's always going to be a non-volatile storage somewhere in the system the de storing that embedded processor code now the only way you typically interact with the server is through the operating system so it stands to reason that if you as a user on a traditional server want to be able to apply those bug fixes there has to be a path to write to the non-volatile storage from your in your x86 processor you know in Windows or in Linux or something like that and so people don't tend to think about this when they think about hypervisors but one of the really important features of a hypervisor besides protecting instances from each other and protecting instances from the host OS is actually providing an isolation layer between all of these underlying hardware devices and the instances themselves so that nothing you do in that instance can actually change what's a the BMC firmware is or something of that nature now to understand why we had to build something here I wanted to talk a little bit about what the rest of the industry does to provide this type of protection because it'll really show why we felt like we had to do something different so UEFI UEFI is the modern version of firmware that is probably running in most of your systems today prior to UEFI most servers and even laptops were running a legacy bias and that bias is typically to be honest with you pretty horrendous its 16-bit there's usually large portions of it that are 16-bit and you might not know this but GCC can actually compile 16-bit code so in virtualized systems that have biases for a long time what we had to do is use a different compiler to build the BIOS than GCC because GCC just doesn't support that and the only open source compiler available is something called BCC which only supports KRC now if you don't know what KRC is you're very lucky it's a very old dialect of C that nobody knows anymore but like this is this is how biases are written and it's just not where you want to be from an engineering standpoint so UEFI is a breath of fresh air it brings up modern development tools it's a 64-bit environment it uses modern practices so overall it's a really great thing and it's a big advancement that's been made in the industry that said it's a lot of code depending on how you measure it a UEFI build that you would put on a server could easily be tens of millions of lines of code and it's ten millions of lines of C code now I've been an engineer for a pretty long time and there's no way you'll convince me that there's no bugs in that much code this is just the nature of what we do right but this is unavoidable like this is an important thing for the industry to do overall and so like you if I has a role and you have to play it anyway how does you if I solve this problem it solves this problem through a mechanism called secure boot the first observation is that UEFI runs on the general-purpose processor how does it know that it isn't already on a system that has been modified how does it know that it's not been loaded by something where there's a BMC that's running code it shouldn't know it doesn't know it can't know you're you start out in an untrusted State and so then what UEFI does is it starts a signing procedure so it'll walk through and it'll check some every bit of thing that it can identify and because UEFI is such a large system obviously it has a bunch of module is a modularity to it and so you start out with early firmware you move to the boot manager you move to the various applications you might ask yourself what is a UEFI application well the best one I've seen so far is Doom somebody ported doom to run on top of UEFI really sure why you need applications but they're there there's also UEFI drivers and those are often coming from the cards themselves so it's code that isn't even coming from the main BIOS vendor eventually you'll sign all these things and you'll get to the operating system there's a whole bunch more steps but I ran out of room so this is all that you're gonna see for today you can easily do a two hour presentation on the way this works under the covers but ultimately what you end up with is you end up with a signature a checksum really and that gets stored in another chip in the system that there's some physical tamper resistance too and then remotely you can query the system and you can prove that that checksum is what was told to that chip basically now this is a system that is sound like academically if you look at it and you review it lots of smart people have worked on it and it is correct however what's important is what it's telling you and what it's telling you is that as long as all of this software is functioning correctly then you're running that software on the system and like I said at the beginning I just don't believe that all of this software and all of this complexity is is going to be correct so when we were thinking about how to support bare metal the real question we asked herself is how do we solve this problem in a way that is simple that is easy to understand that we all feel really confident in is going to be correct and not prone to error and the answer we came up with was the nitro security chip the nitrous security chip is based on a very simple premise and that premise is we don't actually care about what the rest of the industry cares about so it's not necessary to allow the general-purpose processors to update all of those embedded controllers so what we do with the nitro secure chip is very simple we don't allow the general-purpose processor to write to those controllers and instead we provide ourselves a path to do those updates from the Nitro controller itself ultimately this ends up being a very simple mechanism because it's a well-defined interface to talk to these individual devices you can block writes once and it applies to every single component in the system I fundamentally believe that the best security is simple to explain and by departing from the requirements and restrictions that general-purpose systems have we were able to really simplify the problem the downside of this architecture is that we'll never be able to run Windows graphical update utility to make it BMC software update that's ok I don't care I like this a lot better so when we have all these ability to manage the system from outside of the general-purpose processors when it comes down to designing a hypervisor we can actually remove almost all of the functionality and that's what led to the nitro hypervisor the Nitro hypervisor is based on KVM when KVM is a part of Linux but it's not what you'd expect from a Linux installation there is no system D there's no sis v in it there's not even a busybox there's just a small number of user space applications that were written specifically for this purpose and all they really do is set up the right hardware bits kick off the hypervisor and then get out of the way and the overall design goal for the nitro hypervisor is to be clean and what we mean when we say quiescent is that we want to make sure that the hypervisor never executes on any core unless it's doing work on behalf of the instance that the instance requested and you know with this architecture it lets us make the nitro hypervisor nice simple easy to think about I wanted to talk a little bit about what that means for you and what that means for an application and I think the best way to do that is to show some data from a benchmark I love to fast there so earlier this year I was working with a customer and they wanted to bring a real-time workload into ec2 now real-time workloads are workloads that generally have a bunch of threads or some kind of applications and there's some kind of event in the system that event could be a timer that event could be a network packet coming in it could be a sensor from an arm with a laser that if you don't respond to it within a certain period of time you'll blow a hole in the side of the wall or something like that these are generally systems that cannot tolerate a latency in terms of responding to these types of events and in this case the customer had an application that has to process a network packet within 150 microseconds of receiving it because the overall protocol that is used within their system expects the response within a fixed period of time where everything stops working it's not good enough to get this 99% of the time they needed it to be a hundred percent of the time and so within the world of hypervisors real-time is kind of the kilise heel it is the thing that you know everybody kind of ignores and says oh you know that's no you just use bare metal for that and since we had AI three metal my first instinct as a hypervisor developer was okay like we'll use bare metal and we can satisfy your workload but the customer actually pushed me to try c5 so I said sure let's give it a try let's let's see what happens and so we benchmark c4 c5 and bare metal for this particular workload now bare metal is the red line at the very bottom and it behaves exactly the way I would expect bare metal to behave it's pretty flat it's pretty consistent there's a very small spike at the very end that's in the very high percentiles that's most likely as systems management mode code so even if you're on Berry meadow the bias usually has a little hook that occasionally will run for a variety of reasons and most likely this was a small number of events where that code was running and had a wait to get the system CPU back there are plenty of bare metal boxes where SMM mode code actually can cause really bad jitter but we also control our biases we build our own biases and you know we've we've known about this problem for a long time and we've made sure our systems don't have that characteristic the yellow line is c4 and I actually think that this is a beautiful result for a hypervisor I suspect if you took either your own local hypervisors or you know other cloud providers and you ran the same benchmark you would not see a line that was this consistent this is really hard to get right and one of the ways we get this right on c4 is that we limit our control plane software on to run on 10% of the cores on the platform so as a customer you're only getting to run on 90 percent of the available cores because we're trying to isolate the work we have to do from your application and so but even with that mechanism in place you can see that we only meet the SLA at about the 7 70th percentile we get close but then when you get to the high percentiles it shoots up to 750 microseconds again I still think this is a great result but it doesn't meet the needs of their application now what was surprising to me but is obvious in retrospect it's just that I've spent too many years doing this and I'm bias even for the things I build the blue line is c5 and what you see is that there's a very consistent small adder on top of bare metal that's expected there's more hardware involved in delivering an interrupt through a hypervisor you have to go through the iommu there's a few other things that come into play so there is always going to be that small adder compared to bare metal for most applications it doesn't matter it's a couple microseconds at best and then the other the surprising part though is that you don't really see much of a difference in high percentile latency even at the P 100 the spike at P 100 is very small and pretty close to bare metal and it stays completely below the SLA line this was a really exciting result to me because what this result tells me is that applications that previously couldn't even consider being virtualized we can bring into nitro instances and ultimately as a developer that's really what I'm trying to do is to allow new workloads to be brought into ec2 I wanted to talk a little bit about how we're using nitro in a number of the things that we've launched recently as I mentioned earlier this past year we've introduced a number of instance types with local storage and this is the same local storage that is available in i3 so and specifically in our 5d is approximately half of the storage throughput as an i3 but if every drive is effectively the same as what you'd get in i3 in terms of this throughput and capability so this is not M 1 and C 1 error local storage this is incredibly fast high performance local storage again this is really optimized for instant or for customers that need scratch space or already are doing replicated storage and it's very likely that moving forward will have instant storage variants for most of the platforms that we build as long as there's appropriate customer demand we pre-announced I three metal and we announced a preview of i3 metal at last year's reinvent but earlier this year we launched as a ga product I three metal and this is our first bare metal nitro platform this has been a super exciting platform for me because again it allows new workloads two varieties to the most interesting workload of all that this platform has enabled is VMware on AWS that's been an amazing partnership there's lots of exciting things that we've built as part of that offering but the other thing that's been really interesting to me is to see other workloads that use virtualization that aren't what we would think of as traditional hypervisors something that this platform has become pretty popular for is Android streaming Android emulators can take advantage of KVM for optimization and we found a lot of customers running a bunch of Android sessions on a single i3 metal instance to serve the content remotely to customers and then another use case that is really dear to my heart that we'll talk to in a bit is what I like to call micro VMs which is a new way of thinking about serverless computing now by simplifying the hypervisor and by removing a lot of the logic that needs to be part of the hypervisor it's enabled us to make it much easier to replace the general-purpose processor and earlier this month I believe we launched a series of AMD based instance types there's been a lot of exciting things happening in the space the epic processor is super interesting in terms of the number of cores that are available in it and this ultimately allows us to deliver cost savings to customers so if you have a workload that requires a certain level of performance that isn't that doesn't require an Intel based processor there's an opportunity here to lower costs and are very compatible and easy to use way now the an instance type we launched earlier this week that also has been really exciting is the c5n instance type and this is actually using the latest generation of nitro processor and is kind of a preview of what's to come in the nitro space and so this is a platform that allows for up to 100 gigabits of networking throughput and the other thing that's really kind of interesting about this platform is the we've really focused on optimizing latency in this in this new generation of nitro chip so if you have a workload that is doing HPC or distributed machine learning where there's a lot of internode communication you can see really tremendous performance improvements on this platform the other thing that this really enables is really high throughput stew s3 a lot of applications actually are compute intensive but they need to pull in a lot of data to do that compute and the data ingestion or the data push cycles of those workloads end up dominating the overall time with c5n you can significantly cut down the amount of time it takes to pull large data sets from s3 to perform compute due to the much much higher throughput available in this platform now I talked a bit about the nitro hypervisor allowing us to introduce new CPU architectures more easily and one of the things I'm incredibly excited about that we announced as part of this week is the a1 instance type now the a1 instance type uses the AWS graviton processor which is based on arm and is in the same family of processors for that we use on all of the nitro cards themselves so it's actually a platform that I as an engineer have been spending most of my time developing on and I can tell you firsthand that you can build amazing things for this platform you can build software if you take advantage of all the right things that drives really high high i/o performance with very less compute than you would typically expect now even without completely redesigning your application to take advantage of this architecture we found that for certain workloads particularly scale out web applications you can already get a 45 percent cost savings moving that workload largely unchanged to this platform and as part of the launch we already support Amazon Linux rel and Ubuntu so you largely can pick up and do things just the way you've always been doing them I am really excited about this platform and particularly working with customers to build their applications to really take advantage of its unique capabilities now we've talked a lot about instances in virtual machines which makes sense because I'm a virtualization guy so it's really what I like talking about but a lot of our customers today are using containers or some form of server lists to run their workloads in fact I would go as far as to saying that most new applications that are being built are being built or being delivered as containers or some kind of server this workload like an AWS lambda and we started asking herself within the last year can we apply the same lessons learned in nitro to the serverless space and what we came up with is firecracker which is a project that we announced earlier this week and what firecracker are attempts to do is bridge the gap between virtualization and serverless computing so you know it's a small industry and I know a lot of the folks that work on containers and there's always been a bit of a healthy rivalry between containers and hypervisors and really that's because there's a trade-off and that historic trade-off has been security and isolation and consistency versus speed flexibility and performance and so a lot of us have talked about over the years that hey wouldn't it be great if we could bring the two together and come up with a common technology that gave you the same kind of agility and speed of containers or function virtualization but offered the kind of performance isolation and security guarantees that come from using virtualization and this is this is our answer to that this is our this is our answer to how we provide the best of both worlds when designing something like this security is of the utmost importance and that was the primary thing that we focused on and unsurprisingly one of the ways that I think about security is simplicity so firecracker only contains the things that are necessary for running serverless workloads and nothing else the other thing is speed if you want to run a lambda function that is only going to execute for 500 milliseconds you don't want to wait 10 seconds for that function to start that's extremely wasteful if you're gonna run windows and windows is going to apply a lot of software updates during boot and that's gonna take 15 minutes you probably don't care if it takes 10 seconds to startup it's a different world with different trade-offs the other big trade-off is scale and efficiency in a in a dense hypervisor deployment it's typical to see tens to maybe a hundred at the max maybe a couple hundred virtual machines on a single physical server I would say if you're running 200 virtual machines on a server you're doing a pretty good job you've you've got a pretty pretty a good scalable system however today it's not uncommon to find commodity servers with upwards of a terabyte of memory and a lambda function can run as a footprint as small as 128 megabytes a hundred is not even close to the scale needed for these types of workloads so firecracker also focuses on minimal footprint and the ability to scale to very large numbers earlier this week I showed a demo of running 4000 firecracker VMs they all executed within a minute and the overhead of the runtime within firecracker is under five megabytes per image now this is a really emerging space and there's a lot of community activity and so we immediately from the beginning thought that the thing that was going to be needed to make this successful is to do it as an open source project and I'm very happy to say that as part of announcing firecracker we also release this as an open-source project on github we announced this on Monday and Peter's keynote and I was thrilled to wake up the next day and see a bunch of pull requests already sent for firecracker and so it's been really great to see the response I'm really excited I've already had a number of great conversations with the community if you saw me sitting on my phone before this talk I was actually reading Twitter about firecracker and seeing the various conversations so really exciting space and I'm really really excited to see what the community builds with us the other thing I'll mention here is that firecracker because security was so important we actually looked at tooling for making sure we could have stronger security guarantees then you would typically have in a hypervisor we've spent a lot of time focusing on ways to ensure that we were very safe in terms of memory usage in the nitro system we apply a lot of techniques like formal verification penetration testing you know coding guidelines things like that but what's interesting about rust is that rust is really the first language to appear that has strong memory safety properties built into the language and forced by the compiler without the performance trade-offs of introducing garbage collection for something like fire cracker where we're measuring runtime at the p100 in milliseconds having a multi millisecond garbage collection pause isn't acceptable and so any kind of garbage collected language is just not appropriate for this type of application and it's been really amazing to get to work with this relatively new language and really see how well it solves this problem so there the other thing to call out about fire cracker is that even though we're using this today in lambda and Fargate it's still very very early days open source is near and dear to my heart I firmly believe that the best way to build an open source community is to have something that's useful from day one but still has a lot of work to do and that's absolutely where we are with firecracker so I'm look really looking forward to continuing to work with the community to turn this into something that all of our customers could just take and run and do amazing things with so as I mentioned at the start I wanted to talk a little bit about what's coming next and last year I did a similar slide and I made a statement that our roadmap really comes on the feedback of from customers just like you 90 percent or more of our roadmap is comes directly from our customers and one of the things that our customers have been asking us for a really long time is to be able to have ec2 capacity closer to them physically many times they have legacy data centers running potentially a legacy database or some kind of workload that they really need to have super low latency but they want all of the you know all of the experiences of using ec2 in AWS they don't want to deal with having to think about we're leveling of drives or any of the other things that we do on behalf of customers the problem is while our customers have been giving in this that feedback for a long time we didn't know how to solve the problem so obviously the hypervisor is by far the most important part of cloud being a hypervisor developer that's my worldview and that's of course the right answer but the reality is if I gave you an extra server today and I let you take it home it would be a really expensive paperweight there's not much you could do with it because while hypervisors are super cool and are obviously the best thing to work on it takes a lot of control playing services to actually do things with that data to actually predict when the drive is going to wear out all I know is you know how many writes have happened I don't know when it's going to fail or things like that there are literally hundreds of services that all play a role in making easy to be the experience that our customers have and all of those services are built to be multiple AZ's for high availability they have dashboards they have Canaries they get deployed to on a regular basis and there's really just no way to take all of that that that makes an ec2 instance really what it is and shrink it down into a little stack that runs on one or two virtual machines just as impossible you can make something that kind of sort of feels like it but it'll never will be it there will always be a uncanny valley of experience however we had a bit of an epiphany this year and we realized that if we took the Nitro and we took its it's design at being you know very passive and very simplified and we combined it with the same underlying technology that powers private link we could take individual nitro servers and expose them back to our control plane and really give you the best of both worlds where you could have the same physical hardware that we use in our data centers running the same software that we use in all of our data centers wherever you needed that hardware to be but then still get the benefits of the AWS control plane and all of the things we do to ensure that all of all of those instances are happy and healthy this is a really really exciting project it is really early we wanted to let customers know what was coming which is why we reap renounced it but I think this is going to be a super interesting space I'm really excited about it and I'm really excited to see what customers do with it thank you all very much if anybody has any questions there's a mic over here and we have about ten minutes to take questions I had a question um recently you announced the elastic fiber adapter so how does the elastic fiber adapter intersect with the Nitra hypervisor yeah thank you for asking that question I did a I did one of these sessions earlier today and I talked about the elastic fabric adapter and I forgot to in in this death so he saved me from getting into trouble so the elastic fabric adapter is part of the MetroCard for VP see it's another personality that we can use to present a networking interface to our customer the elastic fabric adapter provides instead of providing a traditional Ethernet interface to guests it provides a RDMA interface or an RDM a like interface to the guests what this allows is for you to use applications that are built for things like Lib fabric or openmpi to be brought into ec2 and to largely just work the other big benefit of the elastic fabric adapter is that it is strongly optimized for latency and so you can get the lowest possible latency between two instances by using PFA the other really exciting thing about EFA is that it's built on a networking primitive that we built called SRD and SRD allows you to have a lossless experience of our DMA in an elastic way and so if you've used our d8 RDMA before one of the things you might know is that typically it requires very special connectivity fabrics and those fabrics simply don't scale you can't build clusters of thousands of nodes just doesn't work that's not the how the technology works one of the really exciting things about EFA is that it actually scales elastically that read the way that our customers have come to expect I had a question about the simplification of hypervisor with nitro and you mentioned that the hardware enables a simplification I was wondering in what ways what are the major things that are simplified and the nitro hypervisor and also I was wondering about the just house how lightweight can it be made in terms of like memory footprint because you know nothing is zero memory footprint sir and you know memory only comes in certain Quantum's of you know capacity so you know typically powers of two so when you see an instance like on an a.1 and since it's 32 gigabytes of RAM you know yes is a hypervisor hiding in there somewhere or is it backing up 32 so I'll start off by answering the first question which is what's an example of simplification with the Nitra hypervisor so a good example of that is software updates think about what's involved in updating let's just say a typical Ubuntu Linux installation right if you if you use a bun - I'm sure you've run into a circumstance where you've tried to do an apt get update or upgrade and something doesn't work quite correctly and you end up having to do a deep package - eff or some other like terrible thing there's a lot of complexity there a lot of it as necessary complexity because of the design of the system the Nitro hypervisor can't do a software upgrade the Nitro hypervisor is always delivered from one of the nitro cards as an in-memory image and there's just there's no need to have a mechanism to update because we can just change whenever we need to what that image is so this is just one of many many mechanisms where it's just simply not needed anymore in an architecture like the nitro hypervisor I think I forget what the second part of that question was it was the question is just like how small can you how small can you get it you can actually tell you can check so when we say that a an r5 24x large instance type has 768 gigs of memory that is the amount of memory in the underlying physical system and so what we do is we reserve some of the memory through a mechanism called the e 820 tables which goes way back to legacy BIOS but it's still present even with UEFI and this is a mechanism that's commonly used by firmware I mentioned SAS mmm as a management mode this is a common mechanism for reserving memory for things like that and so the goal we set for ourselves is to make sure that nitro hypervisor doesn't have noticeable overhead compared to what you would normally expect to see from system firmware and we fit within that mechanism today but you can actually just go and if you look at like the Linux boot up prompt and it shows the ei 20 tables you can actually even see the memory that's being used by the nitro hypervisor with the hundred gig instances is that only the network traffic or what about the storage network traffic or is it a like is it to be shared between multiple things or is a separate physical connection for storage and network yeah so like I mentioned the Nitro card for ABC is a separate card than what we use for EBS and the for c5n it's the first the first generation of the new 100 gig metro card and today that's only available for networking but as you can probably imagine we're looking at using that same technology to accelerate you know other types of i/o including storage so I'm sure you'll see more from us there at some time in the future just once in a while I wish I had a TPM chip that was attached to an ec2 instance and I was just your nitro security chip reminded me of that yeah curious if that's a common request like I know in a cloud architecture you have your kms or whatever that's external to the system but it's not something that people ask for and is that gonna happen someday yeah I'm one of the things I like about instances is that I don't have a TPM chip a lot of customers have asked for TPM chips usually we have conversations about what you'd actually use it for one of the things that's different about an instance compared to a bare metal system is that there is no way of changing firmware in an instance so that whole mechanism to establish a secure boot because you're starting from an untrusted state and you have to establish secure a secure state but every single ec2 instance has immutable firmware for the guest and starts from a secure scan secure state at the very start of day so a lot of the traditional use cases for using TPMS really just don't apply I know there's software out there that once use a TPM that really wants to have that interface we've we're certainly open to building something like that we're really just looking for customer feedback about what they would do and what are the right use cases but as far as I'm concerned you know the biggest use case for a GP TPM is to support legacy workloads that expect it to be there but I don't actually think it provides a lot of value in a cloud native application but always open to feedback there and happy to build it - yeah for lateral agency you just told us that you are laterally that work latency is very low so my question is do directed pass review you are an eight to the cast yes when you send a packet from an ec2 instance on nitro or you write to a block device within a nitro instance the hypervisor is not involved in that in any way shape or form when you do an LS PCI within that instance you are seeing direct-attached PCI devices and the only thing that the general-purpose processor is involved in is you know the iommu and the kind of mechanisms necessary to do that pass through but all of the i/o devices within a nitrous system are pass through so if you direct pass were the only device to test you just told me that in the future you will support the container a lot of container resistant so do you have any love device right so the question was what do you do about networking for containers right this is actually one of the reasons why I'm so excited to have the firecrackers an open source project that's actually one of the conversations I was having earlier today is about what's the right way of doing networking for containers there's I could talk for a long time about this there's a lot of interesting stuff happening in the container space we launched or we announced today a container networking service called AWS service mesh I believe is what the name was which provides envoy based mesh networking it really kind of as you can almost think of it like it's an l7 our layer 7 version of VPC it's really exciting stuff so yeah we definitely want to provide the highest performance experience two containers but I also think that container networking is in a different world today than what instance networking is it's less about you know Ethernet and IP protocols and a lot more about our pcs you know rest interface is HTTP things like that so it's a super interesting space I'm super excited to explore it with the broader community okay um I have another question you download many of the components to the kernel right so do you steal in cloud computing you still have some we still did some component for example system monitor in the main board right so if you still use this you still live this component in the members how can you provide CPUs that stable right so the question was there are components that are required to bring the system up like the the baseband controller like how do you actually make that work if you need to measure that before the answer is very carefully and it was a hard problem but we figured it out so yeah there's there's a lot of details there more details than I could go into in the less than a minute we have but maybe one day we'll we'll publish you know more information about that so I don't have any more specifics than that today so my request is that in the future you show me that for c-5 instance the CPUs for example the jitter is very stable and if this component work and how can you steal the CPU times from yeah from from the system yeah so the question was like how do you get really low jitter if you have to do work and the answer is you have to move work off to other purpose-built Hardware in the system and not do it on the general exhausted use yeah I have time for one more question so maybe the person behind you how this fire could have predicted from Spector and Specter like issues you know that's a great question and this is one of the reasons why you know I think security is so important in the container space so we actually have a document in the fire cracker repository with a suggested configuration for production instances that will have all the necessary mitigations for all of the side channel vulnerabilities that have been discovered over the last 18 months this is one of those areas where turning that into something that's easy to do and easy to consume is work that we're gonna have to do with the community overtime the information is there in the repository but it is a fair bit of tuning getting a really strong environment for multi-tenancy is actually a really hard problem and so it's one of the things we're really interested to work with the barter community on so that's all the time I had thank you all very much [Applause]
Info
Channel: Amazon Web Services
Views: 41,636
Rating: 4.9660058 out of 5
Keywords: re:Invent 2018, Amazon, AWS re:Invent, Compute, CMP303-R1
Id: e8DVmwj3OEs
Channel Id: undefined
Length: 65min 6sec (3906 seconds)
Published: Thu Nov 29 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.