Understanding the Performance of DPDK as a Computer Architect

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

alright thanks for the introduction and good morning folks welcome to this presentation the topic I bring here today is understanding the performance of DP DK as a computer architect there are multiple contributors to this work this is a joint work between University of Massachusetts Lowell and Intel Corporation on the first three others Shabangu Hill only myself and yellow are with UMass Larry one mark happing and John Morgan always Intel as a team with both the DB DK users and developers we feel it's important to bring up this topic to the huge community and I assume everyone here today is with network background right so let me try this the great joke about TCP that everyone will gather okay I see some smiling faces so as a network guy we tend to think network and but sometimes it is more precise and important to think from other perspectives so this work is trying to give you a new angle understanding the performance of DB DK from a computer perspective so first of all this is a quick agenda of my talk today we were first touch on the background and motivations of this work and then I will introduce some basic knowledge about OVS architecture and memory hierarchies and then I will talk about the experiment setup and as well as a testament astrology's and then I will give you the performance evaluation between the vanilla OBS versus the OBS DB DK and then we will talk about the multi soggy platform impact analysis and then finally we were reach the conclusion and some key takeaways all right so for a typical data center and the cloud it is highly relying on the virtualization technology in order to fully utilize the hardware so in this case the open V switch is a key connectivity component in the cloud and data center to provide the network of these virtualized Hardware there are some good examples like OpenStack and open nabela but the problem is as a line rate keep increasing from 10 Giga - all the way to 100 G G the OBS is hard to keep up with the speed so that's why Intel released on the DVD K accelerated version of OBS in order to fix this issue okay so but we all know that from all these two days beef we have been seeing a lot of fancy talks talking about the performance is higher than the vanilla OBS but why what is the magic behind it so in this word we're trying to explain the reason from a new angle and we will talk about the cache behaviors of these two sulfur's and the contacts which is something like that so I prepared a couple of slices to do the introduction just in case everybody's on the same page this is a typical application scenario of OBS in cloud and data center and we can see on the feet on this figure we have the physical racks connected by the physical switches and on top on on the on each of this physical rack and we have multiple virtual machines and all those virtual machines are connected with our virtual switches and Oviatt and virtual machine can talk with other VMs why are the virtual makes and it can talk with other physical racks over the physical NICs and if we zoom into one typical component in this diagram we'll have a more detailed view okay so in this diagram on the right hand side it is a contains two domains one user domain one kernel domain and in the user domain we have the virtual machine which it contains the virtual Ethernet port Verta IO driver etc and in the kernel space we have the open V switch the tab and the V host we can observe two basic communication scenarios from this diagram on the first one is if two virtual machines trying to communicate with each other they can go either within one physical rack that is the first case v m2 v make two virtual machine or in another case we can go out of this hardware to another hardware that is VM to physical neck to another virtual machine we will talk more about this two scenarios in our evaluation basically our test environment is built upon this and another IO comparison between OBS and OBS DVD K so if you see on the left hand side this is the OBS data pass okay again we have user space kernel space and for it boniva OBS it contains typically two basic components one reside in the kernel space that our kernel model and one OVR daemon running in the user space okay so for a typical data path it will go from one virtual machine to the kernel model and go to somewhere else to the outside world and if the rule is not catching the kernel then obviously it has to go ask for the daemon and then gather rules so this will involve a lot of contacts switches the user to kernel contact switches will give you some concrete analysis on this case in little slices and but if you look at the are the the right-hand side that's the OBS DVD K version and we can see that over as DVD K has it's only implementation of the data path right it has the virtual switch forwarding plane and also the Pullman driver so it can either stay within the user space by you know the data path will go through the we switch forwarding plane or if it wants to talk to the outside world he can go bypass the kernel to the physical NIC okay either of this case will not involve much no user kernel space contact switches and again we will give you the numbers for it so some college knowledge again for a typical processor we have multiple cores and for each core we have L 1 cache l2 cache which is private and then we have shared space the l3 cache or last level cache LLC and then we can go to the memory and we know that the further we go away from the core the more latency we will have and we take the numbers from the typical Intel skylake processor and you can see the number in the temple if you see piece out to data access latency to l3 and to memory you can see the number will increase from 12 to 44 all the way to 140 with 2 gigahertz you and also the bandwidth of this communication will decrease if you're going further as we can see from the numbers so based on this discussion this is our test test bad setup okay we have two categories one is the gas to gas communication case we call it VM to VM communication in this case we have one physical host running two virtual machines and this two virtual machines will talk with each other over OBS or VSD vdk okay this is the first category and in the second category we have the guest to host communication or in our case we call it VM to host okay so again we have two physical host one physical host has one single virtual machine and running either OVS or VSD BDK another virtual host will run at the upper of server okay and the benchmark we're using is iperf so in either of these two categories we have iperf server hyper client talking to each other okay so we test this set up based on this hardware specifications we use the Intel micro Super Micro server with a DMD 8 core processor each running at 2 gigahertz we had it has l1 cache to cache last level cache and memory like and the physical Nick we're using is a Intel 10 10 10 G Nick and it is running ubuntu 16.04 and OBS versions 2005 dotto DVD conversion is 16.04 so all the VMS you as you can see on the previous slide are created by PVM and emulated by cameo and we run iperf test on the provided environment and we use the some profiling tools to gather data one is the Linux perf and one a the intel vtune amplifier and just other curiosity how many of you are using VT amplifier for your projects good good I see a couple of hands so this is a very good tour that you can track back to our source code and do some fancy you know performance profiling so I perf test is running on this is running as such so forth different experiments we have one lap per server iperf client and the only difference between this forest is it either VM to VM talk or is it either obvious or VSD BDK so for different cases and now this is a first evaluation in this slide we are trying to give you an expression between the throughput and IPC performance IPC stand for instructions per cycle okay that is that is a computer architecture so on the x-axis we can see four different cases from left to right that is VM to VM or VM to host over OBS the first two and the right hand to our VST of decay cases okay it is o either VM to VM communication or VM to host communication the last one and the y-axis stand for the exact number in different unit okay so if you look at the blue bars that is the throughput comparison and we can see if we compared the first blue bar with a third local blue bar that is the comparison comparison between OVS and OVS DB DK with the VM to VM communication case okay you can see a five point five times more supras increase for this scenario and if we compare the second and the fourth blue bars then that's the VM to host cases and we can observe at least those three times more throughput increase okay so if we look at the orange bars that is that for the IPC number okay so for a typical four issue architecture we know that the idea IPC is four okay so if we are not using the OBS DVD K then the IPC is much lower than one which is like four on point point three or point to something like that okay but for the obviously BTK cases this number is much higher it's like for three to four times higher and it's bigger than one okay so I believe the key takeaway from this figure that if using OBS DB D K we can observe up to four point five times more throughput speed-up and also the IPC is like four times more as well the second evaluation is about the cache behavior so again on the x-axis that's four different scenarios and on the y-axis that's the exact number in different unit if so the left y-axis stand forth the cache reference in the unit of meeting reference per second and the y-axis stand for the l1p cache miss rate okay so if we look at the blue bars again we compare the first bar and the third bar we can observe at least seven times more cache references and if we if we look at this second and the fourth bar and that is like eight times more cache references but that means if we're using the OBS DB DK cache behavior is more cache friendly and we can have more cash references and if we look at the cache the peak l1d cache miss rate then we can observe that for the second and fourth part of for the second and fourth dot that is like 50% decrease okay so the key takeaway from this figure figure that if we are using OBS DVD K then we will have more cash references and fewer l1 cache misses and this is because of the software prefetching mechanism design in DVD K ok the third evaluation is about the last level cache and also the TLB buffer okay so and again the y-axis stand for different for scenarios and the y-axis this time is normalized number okay so the unit for blue bars and orange bars are the reference rate which is milling per cycle so we divide all these numbers by the first group so you can see the first Corral all normalized to one okay so if you look at the LLC performance then we can observe a three to six times more access if with OBS DVD K cases okay and if we look at the yellow bars which stand for that data TLB miss rate we can see surprisingly if you're using OBS DBK the TLB miss rate is almost zero is only a little bit of edge over zero but almost zero so with this slide the ticket weight the key takeaway from here is if we're using the OBS DVD K then we will have more LLC access and we will have near zero TLB miss okay and this is thanks to the design of this huge page okay in this case we won't have too many TLB miss and we won't be penalized with the page walk which is very expensive and later on we figure out that across socket communication might be very interesting because modern data dealer centers will employ many socket platform which is used to design on scalability within the power budget so the question is how does the vanilla OBS and the OBS DB DB DK behave on this on such multi socket platform okay this might this might be a very interesting topic and this time we use a different hardware setup we use two socket server okay and each of this socket has a xeon e5 processor on it which contains six course each core running at four point two point four gigahertz and it contains l1 cache to cache last level cache and also the memories okay so with this hardware we again have four different setups for different configurations as you can see on the figure that's two configuration categories in the first category that is within one socket communication so basically we have two VMs with reside on one socket and it these two VMs will communicate with you it is each other over OVS or VSD BDK and on in the second category we have two VMs running affinities to two different socket and this two VMs will talk with each other ob / OBS or OBS DVD k and again we run iperf benchmark for each of these four configurations so there's diagrams contains a lot of information and basically this is a throughput comparison and cache behavior comparison ok so again on the why the x axis this is the fourth is these are the four different configurations you can see from the left hand side to the right hand side that is OBS on the same socket OBS on the different socket and obviously BDK on the same socket or OS on a different socket okay and y-axis stand for the exact number with different unit if we look at the blue bars which stand for the throughput then we can observe if we compare the first group blue bar and the second blue bar that is like 1.3 times more throughput increase decrease if we are using different socket and sim likewise if we compare the third blue bar with a false blue bar again we can see a decrease of the throughput that means mod is a crosswalk a design will actually affect your performance negatively okay and if we look at the cash reference the LLC cash references which is the gray bar and the yellow bar we can see the cash references will actually be less if we are running on different socket okay so I believe the key takeaway from here is if we have multiple socket platform then a more widely used way is to run on the same socket if you have very high bandwidth communication so for the same circuit case you will have higher bandwidth and also better LLC behavior and if you remember in our introduction slides we talk about the contact switches for OBS and OBS DVD K and this diagram is to show you the data so if we if we compared this first group with the second group you can see oh yes DPD cake contact switches will drop dramatically it's like four to five times last contact switches and if we look at same socket or different socket cases like if we compare the first bar with the third bar I'm sorry if we compare the first bar and the second bar then we can see the difference is minimal okay so the information from this slide is that for OBS dbkay users we will have fewer contact switches because of the design the pulmo driver or the data plane design of DB TK and a cross socket design in this case is not a root cause of contact switches so finally in conclusion this work we conduct a thorough performance analysis between vanilla OBS and OBS DVD K from a new angle and we can see that Obst of decay improves the system performance by first increasing IPC and increase the cash references and then they decrease the cache misses by utilizing the software prefetching mechanism and also decrease the TLB misses by using a huge page and also contact switches is less because of the user space driver and we also study the multi socket platform this multi socket platform may lead to lower system support and last axis so if a cross socket in this case however is not the root cause of the contact switches we can make this conclusion as such so that's all I have prepared for today and I'm open for any questions from the audience I actually had three questions or observations I didn't catch the packet size for your numbers is it oh that's that's a good point so we use the default iperf configurations so I believe it's using the large packet size like fourteen fifteen hundred something like that yes and then an observation in your VM to host your kind of understating the benefit because when you think about it you have to go VM on one machine OVS DB DK on that machine right then you have to go of ESD BTK on the other machine back to the the VM host the only machine one benefit there's also two OBS DVD K's in the in the sequence right so it's actually better than what you what you said I think right and then the last question was you showed like ten gig eleven gig between VMs in us in scenario on the same socket or dual socket but if you've got 40 gig links coming in isn't that a huge bottleneck I mean you come in on a 40 gig linker then the only VM to VM you can do is 10 or 11 gig how do you solve that how do you improve that when we go in from 4200 gig in there we actually didn't have run any experiment on the 40 gig and the Hendrick gig but it's definitely on the you know next step to do it but as I can see that the data performance a that should be consistent in either of these cases because of the design itself I bleed but that's my assumption but your VM to VM is not limited by your 10 gig pipe right I'll be honest pure member that's only the VM tools that's the maximum capacity for that machine right so if you've got a service chaining application where you're going firewall to a load balancer to firewall to IPS to QoS or something then you've got multiple 10 gig connections between the VMS so there's also a question what if you run multiple VM to VM communications does it drop below the 10 gig or so another area right that's a good point arm and we look like definitely would like to see that thank you um it would be really interesting to get some numbers on the CPU utilization and the power consumption of the two because that's an area where you have a lot of indirect costs that are not directly easily measurable right right actually we are we are conducting another experiments we have the we have a plan to do the power analysis and also CPU utilization is not difficult but for the power consumption we need to do some hardware configurations but right you're definitely interesting to see that result I have a question here so this is about your multi socket tests but OVD K obviously PDK so and that can you tell us how you did the huge page mapping was it pin to one particular socket or it even matter I'm sorry using two sockets yeah so where do you pin the huge pages do you pin it on both the sockets equally or does it matter so your question is does it matter if we are using many more sockets yes so looking at your LLC misses in the same socket and different socket the numbers are quite comparable I'm wondering if you did even huge space allocation in both which is what by default when you allocate which pages it picks it from both the sockets straight did you do anything to pin it to one particular socket do you see any difference in that oh you mean do we create huge huge page on across different sockets that's a good question we actually create huge page within the same socket if running the VM but I with I don't I don't think we did that across socket but if that's something interesting to see we can do that yep thanks I have one question you regarding on the i pref a model that you are seeing on the VM to host the diagrams of the iperf test setup before the performance so I did see can't go to the pace like with the diagram where you have the host to BM yes so I'm just wondering why do you run the PI AF server on the host instead of TD in running it I prefer server in the VM with the OBS DVD K actually we did both oh yeah okay so thus your result represents both scenarios this consistent it's cuz instead okay thank you all right if no more questions I'm glad to talk with this research offline and thanks for attention

Info

Channel: DPDK Project

Views: 13,624

Rating: 4.7938147 out of 5

Keywords:

Id: VdskkbCzglE

Channel Id: undefined

Length: 31min 8sec (1868 seconds)

Published: Wed Aug 31 2016