Pensando Systems Technical Overview

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so my name is Frances mattes and I'm the vice president of engineering and Sandow and I'm gonna give you a technical overview of the architecture that Prime and Sony spoke about so as a systems company we go from transistor all the way to restful api so we're covering the entire stack and so I'm gonna start at the bottom so when we started the company we started looking across the hardware landscape to see what what were the trends and and where was technology heading and and how could we exploit 16 nanometer technology that gives you much better density and performance relative to previous technologies and so we started looking at this and figuring out how could we get this programmability while still giving you very very high performance and so if you look at how we fit into this landscape what I'm trying to show you here as a is a relative scale of the the cost versus flexibility and feature capability we start with standard NICs those are the things that connect to servers today sometimes they're balaam sometimes they're land on motherboards sometimes they're adapters that give you TSO they give you RSS they give you basic connectivity to a net device and UNIX the next level I would call or what the industry is calling smart NICs and maybe some custom Asics and these are basically standard NICs but they have additional offload perhaps and see of ARM cores or see of processors and these architectures most of the offload is around a flow table these give you the capability to do things like OVS offload maybe even rté flow where the flow table is the hammer and every protocol is the nail so to speak and so this obviously has a significant problem when it comes to scalability in terms of state explosion the multiplication of different fields within the flow table that's one thing and then the second thing is that you have issues when you start to miss in these flow caches then you have to go to software you start to get high jitter and high latency and so what we wanted to do is bring the capability to provide flow tables in a flexible format which is what p4 does and also give you the ability to layout pipelines pipelines of multiple tables multiple lookups without burdening the performance and so that is where we are today so from a scalability and performance perspective we believe that were significantly better as sony was showing you in the previous slide and last thing i want to point out is there's also FPGA based solutions today while this offers a significant amount of flexibility with respect to what logic can be laid down let's say that the issues with FPGA are around the density it's roughly 20 X dense the ratio of I would call its purpose purpose logic versus FPGA logic also with that you get very high power relative to the performance and so and so you basically what you see is that FPGAs are not only cumbersome in that you need Verilog and let's say a high sophistication for programming and then for skilled operation for maintenance and sustaining so with that we're going to go into what we're talking about in our design Capri Capri is the codename for the first generation ASIC built in 16 nanometer what we've done is put in for independent p4 pipelines these are very high bandwidth pipelines that provide up to 24 stages of processing each stage can do up to four match actions in a clock that can give you a total of three hundred eighty four outstanding matches in parallel along with roughly a hundred and twelve packets being processed independently at any one time so very very high throughput the pipeline is designed for a hundred million packet per second throughput that's basically one packet every eight cycles but because of this programmable we've given you this ability to trade off feature richness versus performance and so that's a very important task programmability there's a typical curve that shows you as the number of instructions increase in a processing thread the performance goes down and so we give that type of flexibility that software has but because of the hardware pipelining and the hardware scheduling we give you a much better performance the other thing we've done is we've extended p4 we brought a significant amount of innovation to p4 the first thing we've done is our machine target has a run to completion model that again gives you the benefit of being able to trade off different area different dimensions of performance versus features or richness and how much work you do per clock or per stage I should say and so that's that's very important we've extended tables beyond a typical SRAM table that's very close we've extended them into HBM this gives us the capability to do millions and millions of flows millions and millions of route we'll talk about that in a bit when we talk about the scale in space but the point I wanted to make here is that unlike a switch that has very very tight SRAM tables in a very constrained area we are innovating in that we're putting our memories in large DRAM we've built in hardware mechanisms to make sure that we have lots and lots of outstanding reads that we take care of the read after write dependence ease that you get in a pipeline when doing stateful processing which brings me to the next point we've also added the notion of stateful processing to p4 and that's really as simple as the pipeline can update State so beyond counters and meters you can do things like maintain TCP sequence numbers update sequence numbers this gives us the ability to terminate protocols like TCP or our DMA to be able to do TLS proxy much higher level processing than just your standard l4 type of processing for switching and routing we've done this all in a fairly low power area as Sonia was saying around 25 gig we're a little under 20 watts and 100 gig roughly 30 watts depending on the workload typical power is definitely under 30 watts and most importantly we've done this with very low jitter so the latency is around 3 microseconds as she showed in the previous slide or in her last slide but what I want to get here is the point across is the the jitter and this is fundamental so when you compare this type of architecture to a flow cache or even an offload that is off the PCIe bus where most of the information is stored in host memory every time you miss in that cache you need to go refill it and that can include a software process it can include punching the packet the software and processing it on the CPU or it could be some state machines that have to go do a cache line fill whatever it is it's going to introduce jitter and so a very important part of this design is having low jitter very important when you have very very large millions and millions of flows perhaps maybe in a 5 T like application on the edge and then lastly doing all of this in very high bandwidth meaning the hundred gig rates being able to process packets at hardwired speeds with this flexibility of software and so the right side is basically a block diagram of the chip with these with the with the main blocks highlighted it's basically on the left as a p4 subsystem a switch that we've brought these innovations around the last piece I want to say about p4 is that besides extending to those three areas I discussed we've also extended it to being able to do flexible DMA and I'll get into that now I think Francis yes go ahead Justin yeah I just want to clarify and I think it's fairly obvious now I wasn't sure early on but this seems to be all about networking workloads they're specifically around packet based switch you can type things going a little bit into layer four activities that you mentioned things like being able to terminate TLS workloads directly into their card um but but it is all specifically around that and that network side of functionality network and storage so we provide networking and storage services and then within that umbrella of message processing we provide security so that's what I'm going to talk about now are the different services that we provide but yes this is an ethernet device it's this interfaces or Ethernet on the bottom and PCIe at the top and that gives us the ability to plug in the computer so that we can provide PCIe services to that computer we have a very rich set of PCIe services starting with DP DK and your standard net dev kernel driver we can export nvme devices we can exert IO or vmx net devices we can be a storage offload we have one customer we're being used as a storage offload and we provide a storage offload device or a crypto offload device embedded HSM device there's lots that the entire PCIe services mechanism is soft and we can define that when we boot the card and and flexibly provide these services which is very important so that's what the host seats yes let me jump in and say that you know things beyond layer four there is a keynote talking before summit coming in 29th but so do you get to you know layer seven related elements I will talk into more details there so far we have implemented TCP stack in the pipeline completely like segmentation reassembly timers you know all those kinds of things sending packets of originating packets terminating packets those kind of things there is also you know nvme is a stateful protocol beyond you know it's the very and that's implemented it implement implemented in the pipeline itself TLS is is another big thing that we have implemented in the so it goes way beyond classical for real quick can let me jump in for just for one second I just need to have something I feel like I've missed something and I just want to ask is there more to this in how we leverage the functionality that this brings to the data center is there more than the management front-end that you expect was talked about initially and is there more than just the cart that goes in the server is their software is there something else that ties all this together or is it literally just on program mag programmatically configuring the card and letting the flows flow through the card and handle all of those extra things that's trying to do does that make sense yeah yeah definitely I think I mean we it's something you're gonna cover later I can definitely stand down yeah so I mean we're walking through the stacks that we're starting at the very bottom with the hardware okay and that we're gonna discuss all of the software so clearly these arm cores that I'm showing you here run a significant amount of software on them for the first of orchestrating configuring and and exposing these services to to the okay and those okay absolutely yeah I just wanted to bring that up because I'm just sitting here I'm like I'm missing something in it yeah catching me up so I'll stand down for announcement okay thank you okay quick question there sorry it just took one and there down as the arm a 72 do you have other options or is it all based on the a 72 because that's a relatively old version of the cortex syringe right now so this chip was designed in 2017 2018 and that back at that time arm a 72 is the state of the art arm microarchitecture for embedded systems it's a very high performing architecture relatively speaking it gives you a superscalar out of order microprocessor behavior I think it's a three-issue machine might be greater actually I can't remember the detail of the amount of shoo the arms in our architecture unlike other smart Nick's are really there just for the purposes of aiding the P floor doing some connections set up some management the bulk of the data is data processing is done in the p4 pipelines and ok yeah so so how do we compare some what other people pipelines then as a what would I compare them to if I was trying to benchmark what they can what they capable of because one of my questions when we're putting up posting on Twitter is say the problem with specialized devices is generally their lifespan so I can go and buy a six now and other FPGAs that do a single purpose job very very well but they might last me six to 18 months blockchain is a great example as the difficulty increases custom hardware becomes obsolete now that's a very similar situation that generally happens with other FPGAs and things so if I go and if a new compression algorithm comes out next year or a nudey duplication method or the things that you are doing if they change is this capable of moving with that can i reprogram it is it is it going to have a lifespan great question yes so um when it comes to protocol processing which is what changes the most frequently I would say in terms of VX land versus Jenny versus you know different types of encapsulations and overlays that is all handled in the p4 programmable processing on the on the let's say the hardware acceleration engines like CRC SD dupes shot-to-shot freeze all of your block ciphers AES GCM those kinds of things those are actually hardwired engines into the in the device where if we were to come up with a new encryption algorithm that did some kind of new polynomial arithmetic say that would actually be a change in hardware because to get those things to run at 100 gig rates and beyond they must be hardware now what we've done is our PCIe interface as one of the services we offer is that it can be bifurcated into an endpoint and the route complex that gives us the ability on that route complex to hang off a small device if needed be a FPGA sometimes called a sidecar so that if you wanted to add a new encryption algorithm or added a new compression algorithm you could do that if you wanted to add nvme drives for the purposes of local nvme virtualization you could do that if you wanted to add some kind of coprocessor that was doing the latest greatest encryption algorithm or integrity checking algorithm you could do that so so we've added the capability to the architecture for it to be able to enhance and be scaled from the hardware perspective but from a packet processing perspective it's all in p4 for the most part 99% of the bulk work is done in p4 and that is all soft and so we think that this architecture has long long legs it's not something that you will be replacing in not 18 months Francie's I have a question about you know different generations of your card in the same data center I mean you start with this card and in 1-2 years you will have a new one so you introduce something that is more powerful that can handle more traffic probably faster and everything can they so did you think about a mechanism to you know maintain some sort of compatibility I mean also from the past performance point of view because some inconsistencies in you know especially because you're talking about working at the line rate for for some some of these protocols so some of the encryption some are they so you are very fast and the next generation would be faster correct how will you manage all this inconsistent so the most important thing is is yes the next generation will be faster and that will take them the natural progression like any silicon does whether it's a server hard or an offload device that it will keep up with the next level of feeds and speeds around 200 Giga round gen for things of that nature but more importantly is that the software the entire stack that runs in the pensando system is backward compatible and so that's a very important point we made sure that we paid special attention to that so that all of the services all of the offerings that we provide are go forward in in our new designs as well and the difference that you will experience is in scale both in space and time whether it's bandwidth or whether it's table size so yes lots of lots of thought put into that and just one more because there are this one's a bit of a hypothetical if general-purpose quantum computing comes along tomorrow are you dead hmm I don't think so when you say comes along tomorrow yeah okay say let's say I can fill my Amazon datacenter with cheap general-purpose quantum computing because of all of the things that you can do specialized hardware quantum computing takes the that problem away the calculations of those things become a general processing problem no jumping and please flow applications go to quantum computing that needs if network will also require that level of quantum computing so which means that all these services need to be offered at a much higher layer level so we could leverage that technology to build the the technology and the software and all those things using that and that's instead of trying to be scared from it we could leverage it actually okay that's good awesome thanks ok thank you so what does all this mean for us so basically I think this was this is where we're getting to the question was what does this all mean for me as a user of your technology well what it means is that you would basically you would purchase our management system you would purchase our cards through a service vendor like Sony was discussing earlier you install these in your network and you would boot the system and once you've booted the system you would get different levels of service depending on what you've enabled and and what is available in the current software package but the services were providing are around IO processing for network storage and for security and so whether the networking is I need a Brit a virtual bridge and a virtual router whether I need access control lists madding overlay networking where we can do flexible endcap and d-cap being able to put VX land headers on or MPLS over UDP genève these kinds of things being able to write a evpn control plane for for Thep distribution things of that nature load balancing all of these are networking services that that we provide at very high rates at very high scale which is obviously very important as you begin to scale out your network built in all of this is security whether it's booting the card for with a secure boot mechanism so that we know that all of the software that's running on this card has been signed and verified whether it is a crypto cryptography for both asymmetric or symmetric ciphers needed in TLS those are all offloaded and hardware for very high performance rates or whether it's a stateful firewall or micro segmentation all of these security features again are implemented in the data path written in p4 so they can be easily enhanced and easily upgraded and then one of the most important parts that we bring at the edge of the network is the motion of observability so today in a network if you want to look at packets you turn on something called port mirroring that has to be done pervasively through all of the switches typically there are limitations in tapa rack switches today and most shared memory output cube switches will have limitation on the format's that it can mirror and how how precise it can be in what it captures sometimes it's only sampled as an example so what we can do at the edge is we can capture all of the packets and explore him in a flexible way today we're in our offering or supporting er span we have net flow capability and so this gives the operations very tight and very fine-grained visibility into what's happening in the network and then last but not least are the storage services so we have quite a bit of storage services we talked about encryption for networking around TLS DTLS IPSec but around storage we have what's called XTS this gives you data at rest encryption this can be very useful if you want to be able to encrypt block storage to drive so we have a customer is using it for that purpose along with this encryption we can do compression we have LZ compression in the device again for block level encryption I mean a compression all of this being done at a hundred gig rate so very important when sony talked about the power of and all of these features that we're talking about can be operating concurrently at a hundred gig so in her first slide where she showed the different levels of service that can be offered today in a network those are endpoint devices for which you typically trombone into and out of I go to the firewall I go to the load balancer I go to the storage endpoint here what we're doing is we're allowing all of these services to be chained at the edge for very high performance and easier manageability so that's the whole idea of what we're trying to do by providing all of these scale-out services so to give you an example of the types of scale we're talking about it froze before you up to that yes back on security you had a previous slide can we talk about Hardware root of trust a little bit we had a question come in from Twitter sure is the NIC platform able to act as a hardware root across for the it is it is so what we have is we have a technology called puff which is stands for a physically unalterable function it is a way for the chip to boot at time zero and create a key based on some technology around SRAM's and what that does is it gives the the chip a private key or a private signature that is a secret only to the chip and from that key we can begin to derive new keys for storage root keys for example at the station keys we can create public private key pick key pairs for for signature and for setting up TLS sessions and so this becomes the root of trust with respect to all of our secure boot processing if you want to debug the chip we have debug certificates to allow you to be able to get into the debug window so very much secure all of our algorithms are in this certified and we are in the process right now of being fit certified I have a question as well so there are so many services that you're running on this car obviously so many widgets the more widgets you turn on the more processing you need so how do you actually work out how much well what what amount of processing you need or what model of a car how do you size that how do you measure yeah so it's a good question so so you look at first of all what are your operating speeds going to be are you gonna be running at 25 gig or you're gonna be running at 50 gig are you gonna be running a hundred gig that sets the bar for what the packet per second rate is so for example at a hundred gig the maximum packet per second rate possible theoretically is 150 million packets per second so that's the first metric you look at and then you say how many times do I have to process a packet for each service so if I'm doing for services that means I need to do for operations and I want to make sure that I have for pipelines is a a simple way to think about it that I can run these packets through each doing its own type of work in a pipelined fashion so that you still maintain the throughput which means you eliminate the dependencies first I do you know maybe I do encryption or maybe I do compression and then I do check something and then I do deduplication and then I do compression whatever that chain is you want that chain to be in a pipeline so that at any one point in time everything is overlapped and you see the full pipe running and that is really comes down to the internal bandwidth of the chip the chip internal bandwidth is 400 gig the p4 pipelines run at 400 gig our network connectivity we have internally that shared memory output cutes which I showed you earlier and they network on a chip that gives you 3.2 Tara bit of bandwidth so that gives you the ability to 3.2 Tara bit if you're going to run 50 gig you can do the math that's basically eight eight times you can move a packet around while still maintaining line rate so we've really built this design so that we can run a bunch of chains concurrently while still being able to meet the line rate which requires a significant amount of speed up I guess to get to answer your question internally the way this is we can also do the traffic management for each of these traffic types so if you say don't allow more than 20% of storage traffic 30% of Ethernet traffic you know we can do all that kind of traffic management and traffic shaping in both directions yep significant amount of metering rate control and QoS policy it can be applied as well it's just maybe one clarification you want to touch upon is the acceleration engine bandwidth is separate from the p4 part I mean you could do 24 functions in the chain without really counting into that 8 that you were talking about the acceleration function maybe you wanna touch upon that yeah so what what vipin is saying is that our p4 processors are basically pipelined units that can have multiple threads outstanding and so you could have concurrently running through the p4 pipeline at different phases or at different stages in our DMA application a TCP application a classic NIC application you could be doing multiple different types of work concurrently on the different stages which is inherently the parallelism that's that's in the design and that's independent of the bandwidth on the accelerators that are running it you know 100 gigabit or 200 gigabit or independent of the NOC and the memory subsystem is running at 400 gigabit and Beyond so yeah so I'm just way too throttling it then yes the question for me so in the storage world for example does the data need to come in via the network on the card or could it come in from other interfaces and then be passed through for processing and then back again yes so the typical flow would be honest Leslie um pick a storage application you would have a host computer like an x86 which would be running a volume manager or some type of storage controller and would pass the data to the card through PCIe that data could be block data variable size block 64 K 32k whatnot and that's done through DMA that's a PCIe service that we provide as I mentioned earlier where the host is running a driver and using that to get to the car and then starting a chain of services program through the DMA reigns sense so I noticed you also you use such a support PCI three and four yes so do we get significant benefits of AMD based architecture over Intel now on the there's a skipping well you get I mean the AMD today is what has Gen 4 so yeah you do get yes obviously Gen 4 will give you a little lower latency for the same bandwidth or it'll give you twice the bandwidth through same number of lanes so yes okay good theoretically I'm sorry theoretically you could use this as a pure PCI offload card to implement whatever functionality correct that's exactly right you could do that that is a mode of the card and we actually have a customer that does that for some of the chaining services you could theoretically do it for storage in fact in practice we do you could theoretically do it for TLS offload for all of the asymmetric cipher is required for key generation which are very very compute intensive we run those in tens to hundreds of thousands of operations per second on the device do you have any use case where you need cards on all the service in involved in the communication I mean to get advantage of some of the functionalities where you need a car card on the servers yeah so the pensando card on the server that is you know sending and and also on the on the other side where you receiving it so is there any functionality oh I see yes so there is some functionality where we would need to be on both ends I think that that's the that's the way I read the question we have some we have enabled some RDMA extensions which if you were to turn on would require pensando devices on both ends those are DMA extensions are for the purposes of alleviating the need for PFC in the network which can be cumbersome for large our DMA scale-out networks and for a more advanced congestion management algorithm and in that case you would need pensando on both ends if you were to run those extensions but other than that there's no it's there's no need per se okay all right good questions okay so going back to the scale this is just to give you a sense of the types of scale we're talking about and again these are all at the same time it's not I get some LPM or some ACL this is I can give you an L p.m. route table I can give you a ACL table I can give you a flow cash table tunnel tables end point mapping tables so just just to give you a sense of the type of scale we're talking about millions and millions tens of millions of entries whether you're running one flow are you running a million flows the performance of the device doesn't change that's very fundamental in the architecture in the design and that again allows us to do this power of end so looking at that it looks like I could if I would be so inclined built the router out of this absolutely of course then the question is I get the packets in and I don't want to send them all through the same interface so I need multiple of your cards in the system can I just connect them directly together or does the whole thing have to go through the CPU and the main memory no you could you could can you could connect them together through a network that's one way to do it but you could also provide what's traditionally called a router on a stick function where you go in one interface and you come out another interface and that's of it that's a very popular model and and we support that so we could sit off the side of a switch where we could set off the side of a computer and you could come in and come out across the interface so that that's a fairly standard model for implementing a router and then we certainly have use cases where that would work but I couldn't use the PCI bus as sort of the sweet pebble you could you could you could in theory absolutely yeah you could you could connect a bunch of these to a PCIe switch and and and we would be able to make that work absolutely interesting I was historically anybody who try to do networking using the PCI switch failed miserably we saw this in 2000 2005 2010 2015 people were trying to do it you and that didn't really work very well so I think the networking will be probably a better choice that may be our bias but that's the baby's here yeah okay so so the last slide and one of the most important slides is how does all of this come together so on the right side of this is what you see the full architecture in that this chip I just described call Capri is what is on this D SD card this distributed services card which is a PCIe card that plugs into your server or an OCP card that card for it works with all the other servers in the system that have these cards to form a cluster a cluster of pensando devices that provide all of the different services we've been talking about the way these services get orchestrated or controlled how they get configured how the user interfaces with it is all through what we call the PSM which is the policy and services manager so this is a service a micro services oriented architecture built around kubernetes where the user interfaces to this PSM through a restful api this gives us a this is a very highly available highly scalable architecture we we build these PSMs in in clusters the minimum is three that forms a quorum which gives us the ability to have a very high availability we have extremely high scalability today with a thousand and we're moving to 3,000 very quickly this gives you secure and let's say lifecycle management for the purposes of upgrades and downgrades it gives you security in terms of how the PSM communicate with each other and how the PSNS communicate with the DSC so this whole management plane is intended to provide an operational model for enabling these services it's all policy driven so that means that if you basically you're mapping out a declarative or sometimes called intent based model where you define your policies whether they're security policies or telemetry policies these get these policies then get taken in by the PSM and the associated dsds are updated and that's all done within the subsystem of the PSM in the DSC
Info
Channel: Tech Field Day
Views: 4,178
Rating: 5 out of 5
Keywords:
Id: AM4UmiMaCKg
Channel Id: undefined
Length: 38min 24sec (2304 seconds)
Published: Sun Apr 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.