Accelerating Computational Storage Over NVMe with RISC V

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so my name is Stephen Bates I am the chief technology officer of a startup called adead accom we are doing something called nvme nvm Express computational storage and today I want to talk about how we can accelerate computation and computational storage in systems that are based on risk v processors so a quick little piece of mathematics risk 5 + a dead accom + nvm Express equals awesome and I am going to spend the rest of this talk trying to justify why that is the case I'm first I'm going to talk about what risk 5 is ha that's a joke you wouldn't be here unless you already knew more about risk 5 than I probably do I'm gonna talk a little about something called nvm Express and it's 4 in the afternoon I'm sure everyone's tired but a quick show of hands have you heard of nvm Express hands up if you have hands up if you know if you have an nvm Express device in your laptop or handheld or whatever right okay good that's that's good so I'm going to talk a little about nvm Express I'm going to talk about nvme performance on a Linux capable risk 5 SOC actually one from size 5 and I'm gonna show you how nvme has a lot of potential but in the current SOC incarnations it actually sucks and I'm going to talk a little about why it sucks and I'm gonna talk a little about some of the things we could do to make it suck a little less I'm gonna introduce something called nvme computation this is a this is a kind of a an area of active development and standardization right now I'm gonna talk a little about that and then I'm gonna talk about this weird thing called peer-to-peer DMAs p2p DMA and I'm going to talk about work that we've been doing in the Linux kernel to enable that for a variety of architectures including or a risk 5 and then I'm gonna finish off with some examples of what happens when you bring all this stuff together and show essentially how we're moving to a paradigm where you don't necessarily need to have an Intel Xeon in your system to get a lot of work done your heterogeneous computing and changes in computer architecture and challenges with Dennard scaling and Moore's law are kind of pouchiness towards a more heterogeneous let's have a bunch of high-speed devices communicating directly with each other under supervision of a linguist capable device makes more sense so now I'm going to show you how we can actually do that today with risk five and you know going forward things might even get a little better what is envy what is envy me or nvm Express so some of you have heard of this before what happened was there was a you know quite a few years ago now there was an amazing company called fusion-io which was bought by SanDisk which was bought by WD but what they did is they put flash on a PCIe endpoint so they took NAND flash and they stuck it on a PCIe device and it was shit-tons fast like people who are used to getting data from spinning disks and this thing was suddenly way way faster and everybody loved it they're like my applications are running faster I'm no longer bound by i/o and so what happened is other companies started producing PCIe solid-state drives as well but everyone had their own driver and that's a real pain in the ass I don't want to have to install a micron driver because I buy a mic on Drive I don't want to install an Intel drive driver because I bought an Intel drive it's a pain in the ass so what do we do we standardized something called nvme and we went to all the DRI vendors and said if you build a PCIe attached SSD for God's sake make it a line to the nvme spec and then that means we have one driver we have one management stack we have one way for applications to talk to it and the world would be happy and that's basically what happened so nvme is now a very successful standard in fact this year I believe is the first year where more SSDs will ship with envy me is the interface rather than SATA and SAS combined so we've kind of won that war we like PCIe because it's directly attached to the processors there are 13 million servers getting shipped this year and three of them will have Gen Z as an interface and 13 million of them will have PCIe so I think you know it's a very successful protocol and nvme buildings on top of that we've even extended it so we can now communicate nvm commands over something like Ethernet over our DMA and tcp/ip so I can have a client that thinks it's talking to a local Drive encapsulate that over an Ethernet network and have it have an action on a remote device that's called env me over fabrics and I'll talk a little bit that at the end of the talk superb support in Linux every one I'd hear who's building an accelerator you will have to hire someone who is a PCIe driver and you will have to upstream that driver so your customers can take advantage of your wonderful accelerator I don't have to do that my startup is aligned to envy me I get some of the best kernel developers on the planet writing my driver for me and I don't even have to pay them and they're doing it for VMware and they're doing it for Windows so my little tip of the day is if you're doing a PCIe risk 5 accelerator think about doing nvme and you'll save yourself a driver developer or 5 there's a healthy road map of upcoming features low latency so we have devices that are based on something that's not NAND which means it's and boolean joke poop um everyone's asleep today computational storage they're not just storing data but perhaps doing mathematical functions on that data while it's on the band that's something that we're working on things like IO quality sort of service you can go to the nvm Express org website you can download the spec it's an open standard it's very like risk v you can be a member company you can have voting rights everything ends up in the public domain and and so there's a lot of similarity between risk v as a political entity and nvm Express there's a lot of different form factors for nvme they are also standardized there's a lot of performance features that are great multiple hardware queues and there's a really good management stack and that's important if I want to go and deploy 10000 accelerators in my data center I need to manage those I need to know when they fail how to make LEDs change colors so IT operators can go in and switch them out and replace them with new accelerators people like Facebook and Amazon care about manageability at scale and envy me has a really good management stack and that's important if I take an nvme drive which is capable of doing quite a few gigabytes per second of i/o and I plug it into a risk 5 SOC the performance is in fact as we try to increase the performance we find that we saturated about 200 megabytes per second so this was using a very recent version of the Linux kernel this was using the freedom SOC card from sy5 and try as I might even though the SSD is more than capable of generating much more data than this I could not get anything more than this 200 megabytes per second using the Unleashed expansion card and with all the PCIe stuff and it wasn't just a case of PCIe bandwidth there's some other stuff that was going on that we find was rather interesting and rather challenging so nvme is designed to be pretty performant and it is designed to be architecture agnostic but there are a couple of things happening on the RISC 5 SOC that were causing problems one of them is it doesn't have the lot the widest DDR controller on the planet so it had some limitations in terms of how much of this DMA traffic can it pump out to the DRAM buffers that are allocated for those DMAs the PCIe bandwidth is a problem there is some gen 2 on that expansion card so gen 2 is you know a few hundred megabytes per second per Lane it's not the bottleneck but it's probably contributing the CPU cores are you know for want of a better word wimpy so it takes them a certain amount of time to issue an nvme command basically do a DMA and there's only so many of those per second it can do and so the CPU is actually working quite hard in order to get the DMA there's a couple of others that I've highlighted in red that are interesting because they're actually based on you know things that we might do something to bite in the future the virtual memory layout for RISC 5 not just for the SyFy but for RISC 5 in general is really really bad and some of the choices that have be made around where the linear space sets versus where the virtual mapping starts is not great especially when you have a 32-bit report on a 64-bit device another challenge is that we don't have an iommu obviously in these SOC Jie so if you want a DMA to an address it's above 4 gig but you only have a 32-bit report and your PCIe you physically have to bind stuffer through memory and we have something in Linux called the software IO TLB to do that but that's basically mem copies and mem copies are incredibly prohibitively expensive on an SOC because you're basically doing load load load load load store store store store store and so all of these things contribute to the performance and we ended up at about 200 megabytes per second so having really good PCIe is one thing having it actually work is another and there's things we need to dig into to make that better alright so that was doing a traditional DMA so we were take tasking the nvme drive to copy some data from the drive and put it in physical memory ddr connected to the SOC that is the traditionally until the latest Linux kernel until 4.20 that has always been the path a DMA must take when you do a DMA and Linux if you try to do anything else in any kernel prior to 420 you would get an error code if you try to do it what we've done in the 4.20 kernel is introduced a new framework that we call p2p DMA and this is upstream 4.20 is still in the RC cycles but as soon as it's stamped it will be all be up there and it's already in the RC so you can go play with it right now and what we do is we actually enable a new path for dmas where you can pass in an i/o memory address and not have the DMA layer rejected so you can actually pass in a PCIe bar as the destination for a DMA now why is that interesting it's interesting because now I can actually move data directly from one PCIe endpoint to another without having to binds through system memory and hit some of those performance limitations I mentioned on the previous slide that's great another interesting thing is that in nvme we actually have a standard an optional standard for exposing a bar in way that we present some of the attributes of that bar to the operating system so it understands its capabilities we call that a controller memory buffer and nvme drives are starting to appear on the market for a variety of reasons with these controller memory buffers and surprise-surprise our product has a very good CMB and we leverage that because now I can actually do DMAs directly between my endpoints and it's worth noting what we've done in the linux kernel is not nvme specific any PCIe driver at all can if it wants to choose to play in the DMA space the peer-to-peer DMA space so if you have a PCIe endpoint and you have a kernel a Linux driver for that and you want to play in this space as a 4.20 there are some documentation zin the driver device writer directory now that tell you some of the api's that you can call in order to get your driver to play in this space and we're working with GPU vendors and nic vendors and fpga accelerator card vendors in order to get their drivers updated so we can start doing more of this and if you're interested we can talk about it more because there are some subtleties around this that I won't have time to get into today it's worth noting that what we've done in the 420 kernel only works for x86 so it only works for Intel and AMD it is art specific we're doing some things related to what's called memory hot plug and that's touching architecture specific code so it's not a case of doing it once an all architectures benefit I'm very hopeful that some of the arm people will do this for arm we've already done some proof-of-concept to show it can be done and we have done it for risk 5 that's going to be what I'm showing you today we have not chosen yet to guarantee we're going to upstream that work but it's on github so people can take it and have a look and we may well work towards up streaming that for risk 5 because we think it's quite important and I'll show why not so peer-to-peer DMA lets us do pretty interesting things in terms of data movement and nvme computation lets us do interesting things in terms of manipulating the data that we're moving between these endpoints because just moving data from device a to Device B is kind of boring but if I can do a math matter we'll function on that data in one of those endpoints it gets a lot more interesting so nvme computational storage involves taking these nvme SSDs and adding some standards-based way of doing data manipulation on them so this is a you know leveraging the fact that nvme is a great protocol for PCIe accelerators the endpoints don't have to be SSDs they just have to have nvme is the protocol by which you communicate with them so it could be a smart NIC that presents as an nvme SSD there's a little company called Annapurna that got bought by Amazon for exactly that reason that's what they were doing and Amazon loved it and they use it in in AWS today we can leverage all the goodness of the nvme ecosystem and still get to connect accelerators to the host processor and the great thing is it doesn't matter if that host processor is risk v if it's arm if it's in tell if it's AMD of its power they all have an nvme driver right Linux has a really really good nvme support part of what we're doing right now is a startup is proprietary it's it's not standard but as of two weeks ago the NIA which is the storage and networking Industry Association has set up a technical working group to start standardizing this so what I'm hoping is in a year or two from now nvme drives as well as advertising their capacity will also have a standard vendor-neutral based way of saying and I can do these computational services and those services could be fixed functions like compression or it could be saying I can run a docker container I can run a Berkeley packet filter I can do in you know any kind of thing these are the companies that are involved in that standardization effort you can see there's some reasonably large names there and there's also some startups and and more companies have even joined since I put this slide together so what does that mean for RISC five I think one of the things I want to talk about you know one of the things I'd like to make kind of clear today is that using this paradigm of peer-to-peer an nvme computation we can actually decoupled the performance of the CPU from the performance of the i/o path I can do pretty amazing things in turn of the amount of gigabytes per second I can do down at the PCIe level without having to have a Xeon or an epoch or some other huge processor managing it all I can actually get away with using a very small risk 5 core so we have a demo at the sci-fi booth and I suggest you come and see it because that probably explains things much better than I can in a slide but what we can do is we can take this sci-fi SOC we can connect that to a micro semi switch we're actually using a Gen 4 switch one of their early Gen 4 samples they gave to us and we can do Gen 4 PCIe traffic that's managed by this small risk 5 SOC but but all the DMA traffic is done through the peer-to-peer DMA framework and never goes above the switch so I've completely decoupled the performance of my data plane from my control plane my control plane can be totally wimpy it can be a you know a ten dollar SOC and yet I can be doing multiple gigabytes of data movement and manipulation and storage and networking down below this this gem force which we were doing about in the demo we have over there we're doing about five gigabytes per second of Z Lib compression so we're compressing data at a rate of five gigabytes per second and we're doing that using the risk five freedom SOC as the control plane manager you know we don't have it in the demo today but we can add things like our DMA network cards to that and we can start doing things like nvme over fabrics we can start pulling in data from the network or pushing results ID across the network and we can start scaling all of this up we still have linux managing everything if there's a problem if there's an error event Linux can come in and step step in and try to tidy everything up but our data plane is now completely decoupled from that come see the demo now much better to see it in person so I will promise to be over at the sci-fi booth after the talks have ended today for a little bit at least and I can walk people through the demo if they're interested basically we're doing five gigabytes per second if we do the math on that we're about 1500 times more efficient than using the the actual risk 5 core 2 soozee lib compression so we're quite a bit more efficient and obviously there's a price and the power thing that has to be factored in but I think is still gonna be a hundredfold more efficient to use our device than to use those risk five-course to do it and we're totally scalable in the sense that because we're all below the switch you can just add more of us and more of the nvme drives and things scale up we did a very similar damage to this at Super compute with that with an AMD system and got very very large amounts of performance using a similar approach okay you know I think the key thing I'd like people to take away is that bullet point at the bottom right hand side of the slide the IO is at PCIe gen3 PU is at PCIe gen 2 I've completely decoupled the performance of the processor from the performance that I'm getting out of the system and I think that's very interesting so just to talk about conclusions and future work Linux on risk 5 is a fantastic platform for PCIe and for nvm Express it's open hardware allows innovation Linux is great nvme support for any architecture and the RISC 5 lets us do things like you know look at reports host bridges and start playing around with i/o mmm use there are some issues with incarnations that they stand today some of these are just because the SOC s are still very wimpy we don't have multiple memory bandwidth we don't have you know big PCIe s some of it are choices that we're making in the software stack we may want to revisit high Linux does it's virtual memory mapping for risk 5 peer-to-peer has no performance bottlenecks because the DMAs never actually go near the risk 5 we can do some very interesting things with that one of the things that we want to do next is already ma Nix and do some of this in a scale out fashion across Ethernet networks all the code that I talked about today all the peer-to-peer DMA stuff are ports for risk 5 the qemu models we do to some of the emulation to do some of the emulation work before we play with hard work the route FS is that we installed on real Hardware all of that is available on github take a picture of this slide or come talk to me and you can go download it from github so anyway if you're interested in period here if you're interested in doing you know very high performance stuff with wimpy course which i think is a very interesting computer architecture model as we move into the world of heterogeneity you know heterogeneous computer then you know take a look at some of this if you're working on accelerators based on risk 5 and they're PCI maybe take a look at the nvme computation standardization stuff it might save you a lot of trouble to align to that spec so you don't have to do a lot of your own development work and and if you're interested in seeing what we can do come and see me at the sci-fi booth maybe a little after the next talker when we finish for the day
Info
Channel: RISC-V International
Views: 2,074
Rating: 5 out of 5
Keywords:
Id: LDOlqgUZtHE
Channel Id: undefined
Length: 20min 39sec (1239 seconds)
Published: Thu Dec 13 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.