Getting the Best Performance with Xilinx's DMA for PCI Express

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to a Xilinx quick take video my name is Jason Lally Technical Marketing Manager with Xilinx in the video today we're going to look at how to get the best performance with your Xilinx DMA for PC Express the first thing we want to look at are what are the factors in achieving optimum PCI Express performance well the first thing we want to do is make sure that we select a link speed and link width that's appropriate for our design now the maximum the design Link supports is Gen 3 by 16 and the video today we're going to be looking at a design that is Gen 3 by 8 next up is the maximum payload size now here the maximum payload size is going to be determined by the smallest maximum payload size that the system can support while the Xilinx IP supports a thousand 24 bytes of maximum payload size most systems in the market today support between 128 or 256 bytes now when we start looking at things that we can control in terms of how our design operates the thing that we can control is the size of the transfers that we specify the larger the size of the transfer the better performance that we're going to get and we'll show you that here in a minute next is the number of DMA channels that we enable on our DMA controller the more channels we enable again the better performance we get the trade-off is that it takes more logic resources inside of our device and finally the last thing that we'll look at is using polling rather than interrupts to achieve high performance now polling is going to take away from your processor because that processor now has to go and pull a location periodically to see when a transfer has completed but you'll see that it also gives us better performance so let's look at the how the basic operation of a DMA so the first thing is let's get some terminology down when we want to move data from system memory down to our PCI Express endpoint we call that a host a card transfer or h2c likewise when we want to move data from our PCI Express endpoint to our system memory that is a card to host transfer now what's going to happen is we're going to have buffers that an allocated in system memory and we have our endpoint memory and we have descriptors and those descriptors are what is going to tell our DMA engine where I'm going to write data to or where I'm going to read data from and then where I'm going to put that data as well as how much data is there now we'll look at an example of that here in just a few minutes but this gives you an idea of what a typical DMA system looks like alright so the design we're going to look at is a IPI design that I went ahead and put together took me only 20 to 30 minutes to get this design put together it's really pretty straightforward and pretty easy to implement yourself what I have is a DMA engine down on the bottom I do have two channels of it turned on and it is using the master port now that master port is going to two different memories one is our on chip B RAM memory and one is our off chip ddr4 memory now in addition to that I've also turned on our ax I light master interface this port what it does is it's going to map an additional PCI Express bar into our ax I'd address space so when I when I do this I can now issue PCI Express reads and writes to that particular bar that map's to the ax I light and I can do single D word reads and writes to things within my system in this case I've hooked up the system management wizard which lets me monitor things like temperature and voltages I also have an ax I performance monitor connected to the master port of the DMA so again I can now use that performance monitor if I want to know exactly what's going on inside the FPGA in terms of the performance that I'm getting and then I've enabled some IO so that if I ever wanted to expand this and be able to read and write to i/o outside of this particular IP I does IP I design I could do that all right so we're going to use in this design we're going to use a KCU 105 which is a card that we've used in other videos before it's capable of operating with a gin three by eight so that's got to give us about eight gigabytes per second per direction in terms of the theoretical maximum bandwidth that we can that we can get to so I've already got this card up and going in the system it's operating at gin three by eight the system is 256 bytes maximum payload size this is an Intel z77 platform what I want to do is go ahead and show you that if you download this Xilinx answer record six five four four four this is where our driver and application resides there's a readme file that tells you exactly what you need to do to compile the driver as you see I've sped it up here to show you that I'm doing these steps but you can go ahead and follow along to compile and then load the driver now once it's loaded there is a performance application that comes along with it I've modified a little bit so it doesn't scroll pass it's clearing each time but what you see is we're getting pretty pretty high numbers so we're going to do host of card transfers first and we're going to send a specific number of transfers you see now we're doing card to host and again when we start with those smaller transfers we get not as higher performance but then we very quickly get to the performance levels that we expect now this matches with the chart that you see here and in this chart this is what you see from most companies that have PCI Express DMAs and this is what we expect these are really good numbers but now what happens when we go back and we try to run a transfer using the built-in commands so in this case we have it set up to do a DMA to device which is a host to card transfer we're going to send four kilobytes of data now when we run that command and we do the calculation we see we actually only got two hundred and forty megabytes of data and this is compared to the 6.9 gigabytes that we are expecting which we saw previously so maybe maybe that's not quite right let's go ahead and look something a little bit bigger maybe there's a size issue here so we're going to true a 32 megabyte transfer again we're going to do this from the system down to the card or to the host down to the card and when we run that and again we do the calculation and we see it's 5.5 gigabytes per second so again not near the 7 gigabytes per second that we're expecting so let's take a look at why that is all right so now here we see again that Hardware number performance chart and the way that we get this performance chart and pretty much everybody else who shows you a chart like this is what we do is we create a string of descriptors and those descriptors are in a ring so the operating system is never becoming involved when we measure these Hardware numbers now this is great for being able to showcase the performance that the DMA engines are able to achieve and it's a really good test for us to make sure that our DMA engine can handle any type of data that's thrown at it but it doesn't really give designers the an idea of what they need to do and what kind of numbers they can expect in their designs so let's take a look at how a system operates first to get an idea of what needs to be done to to get better performance so in this case we're going to show you a card to host transfer so the driver is going to go ahead and set up a bunch of descriptors those descriptors point to memory that's been allocated to us to do our transfer to so the driver once it has those those descriptors set up it's going to write down with a pointer to that very first descriptor and and then it's also going to start the DMA engine by setting what we call the run bit when the run bit gets set the DMA engine will actually go and fetch those descriptors from system memory now if those descriptors are adjacent it's able to do that with a single read if not there's a pointer to the next script and and we can dispatch those descriptors one at a time as we need to to fill up the buffer space we have in the engine as those descriptors get operated on we create PCI Express transaction layer packets to write to the different buffers that we have allocated to us now there's probably going to be more than one TLP that gets generated per descriptor especially if we're sending something like four kilobytes or larger but anything larger than the maximum payload size and so we're going to continue doing this process until we get a descriptor that has the stop bit set now once that last descriptor gets read in there's an interrupt that gets sent from the DMA engine back to the driver saying that everything is complete and at that point the host has to do a context switch with the operating system to go from whatever it was doing on to go service that interrupt and then once that interrupts been serviced that's when the next DMA transfer can begin now that's why we see this really low data rate when we look at this four kilobyte transfer compared to what we expect or the 32 megabyte transfer from what we expect is that we see a really large variance and how long that context takes to to happen within that operating system and it really depends upon where that operating system is at when it gets that interrupt to how long it takes to go and service the interrupt so what we've done is we put together some performance numbers that are maybe more typical of what you can expect to see and here we see that using interrupts using MSI interrupts doing both host a card and card to host it's in that two kilobyte to 32 kilobyte range that we see a pretty big difference of the type of performance that we get so the first thing that we talked about is doing larger transfer sizes and we see eventually we can get up to pretty high levels using those those larger transfer sizes we also talked about using multiple channels and here you can see the blue is one channel the edie is two channels enabled and the green is four channels enabled and so you can see that the more channels we enable the more concurrent transactions we can have going on and those context switches don't come and hurt us as much now the other thing that we talked about is being able to use polling mode to increase our performance so here you can see with a card - host transfer anything that's less than about 256 K especially when we're using only one channel can have a pretty drastic impact on the type of performance that we get when we compare it to what we would be able to achieve using interrupt modes alone so if we put it all together what you can see here is the best performance that we can get is when we have four channels turned on and we're in polling mode but depending upon the amount of data that you plan to send during any one transaction you can see that that it may be beneficial for you to send more data at once and use a smaller number of channels and thus a fewer number of resources and less power and everything that goes along with having a smaller DMA engine and so this will give you an idea of some things you can do in your design to get your optimal performance alright so let's go ahead and recap from the system side make sure you check and get the optimum links with and speed that you need for the amount of data that you need to transfer if you do have a chance to choose a system choose one that has a large as possible maximum payload size as that makes a huge difference in the performance that you're able to achieve on the DMA side choose the largest transfer size possible for your application then use additional DMA channels to get to the level that you need to be at and then finally you have polling as an option to pull a location to get better performance now all of this is available in answer record six eight zero four nine you can see a write-up that goes through these different charts so if you get a chance go ahead and go take a look thanks for watching the video you

Info

Channel: XilinxInc

Views: 10,836

Rating: 5 out of 5

Keywords: PCI Express, PCIe, PCIe DMA Subsystem, XDMA, KCU105, DDR, Gen3, x8, Direct Memory Access Posting

Id: WcEvAvtXL94

Channel Id: undefined

Length: 13min 5sec (785 seconds)

Published: Tue May 09 2017