2.6: DPDK Overview

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

my name is Cristian I'm from Intel I'm a software architect involved in data plane development kit wrote a few components from the start of the project years and years and years ago and and I have a few a few personal involvement into this project a few babies that I put into this project so today I think I have two sessions I that are back-to-back one on an overview of the PDK another one on optimizations just to just get a bit into describing the optimizations the software optimizations that we do so I kind of presented these things for a number of times to different people you tell me if you need to drill more into things if you have questions or we need to skip some of these things if if you know them already it's a pleasure now to talk about the PDK actually where it was mentioned in the last number of days a lot so at this point probably everybody knows that the PDK is actually sits at the foundation of a PPP it provides the polymer drivers or getting packets in and sending packets out from the system it provides the infrastructure for setting up some of these resources setting up the system and hopefully other things are other cool components from DP DK will also make it into a VPP there is some kind of overlap in terms of functionality like there are libraries that that you might see or concepts or let's say functions that you might see implemented in both places in DP DK as well and as in VPP so hopefully over time we'll we'll just converge towards single implementations for some of these things that that is because as Damian said that the VPP project started the long while ago maybe even earlier than the PDK started so there is a lot of legacy code from Cisco we we started with with the code basically from scratch inde PDK so sometimes we you get you get several implementations for the same thing but that should not be an issue should be just an opportunity to improve so a pleasure to talk about the PDK it's an open-source project well a well known at this point very successful data plane packet processing project major contributors you see over there so i'm sure you probably guys most of you are already getting emails from the DPD cable talk my list is anybody here who's not using the PDK okay didn't see that okay okay yeah so okay we'll see what we can do about you seem to be the exception so it's good actually so yeah we can talk about why that is so a lot of companies there you can see a long list of contributors we have users but we also have contributors who actively develops and patches that are included into DP DK you can see on the left hand side some statistics about number of contributors number of commits I'm sure that Vance and Tomah will also present you more numbers about the DP DK and how successful the project is in the next during the next presentations you can see like a long list and of like companies from different fields is not just let's say chip vendors we also have equipment manufacturers and other folks over there DPD k although it's starting with Intel in mind it was originated by Intel right and it was initially implemented for Intel CPUs and Intel Nix nowadays it's actually multi architecture Multi CPU architecture and it also provides support for Nix from different vendors so here is a quick road map of the PDK releases and how we added support for different CPU architectures along the way how we added support for different NICs along the way so nowadays we have Intel but we also have arm we also have Power Architecture and we also have Tyler architecture in terms of Nix we have Intel Nix but we also have Mellanox we have Cisco Enix we have Chelsea Abroad comics metronome mix yes so nowadays the PDK is the place to be and most of the nicks are are already there so you don't need to go and search for them elsewhere this is a quick map of various components that we have in DDD K so what exactly is DP DK it was mentioned before as well it's a set of libraries for improving the performance and functionality of data playing products packet processing products for various applications right it's it's not just for one particular application is for a lot of applications a wide range of applications across the board so we can look at all these components like you can see like a lot of libraries over there we also have by the way sample applications examples to show you how to use those libraries VPP you can say it's a one of the big users of DP DK is not a sample application of course it's a stack it's an application of the statute a very complex system that also incorporates DP DK and okay maybe it's not the code to start looking into if you just want to start with kay you should probably not look into how VPP is using DP DK is probably not the easiest starting point but for example I'm doing this as well I'm also looking and ramping up on the VPP code and trying to understand how DP DK is used over there how palma drivers are used how the various strengths as Damian presented today there are number of thread combinations in VPP and they all look and they use DP DK in slightly different way and that's also evolving so on the left hand side you can see like a number of components that we call core components core libraries for example CIL environment abstraction there that's one of the components that also VPP is using maybe not entirely maybe with some changes what it does it's it actually sets up the system for you as DP decays in userspace we don't have to do a lot of we basically leverage all the hard work that Linux kernel did so what is he al doing for example it helps with the CPU CPU enumeration or identification let's we need to understand how many CPU cores we have how many how many hyper threads we have how they are mapped to different CPU sockets right do we have one CPU socket or do we have two or four CPU sockets how are they connected together so obviously that that is straightforward we don't need to rely on bios we can simply rely on a Linux file system look into /proc slash CPU info and get that information from there we did suck it up into DB DK and start from there memory is also set up at this point since we use large buffer pools for packet processing for every packet cam is received we also need to provide a free buffer back to the to the NIC for for another packet to be to be stored in there we I mean DVD K uses the traditional allocation mechanism of buffer pools right you just take a big amount of memory and you just chop it into fist sized slabs and each slab will actually be able to store a packet so that's why we are actually we need to to spend a lot of attention on memory and getting the memory set up right to avoid issues with TLB thrashing we need to use huge pages so that's that's I mean that's a like a big top topic to talk about probably we want spent that much time on it right now as Damien also said VPP is using one gigabyte huge pages this is actually a hardware feature of the CPU you have to check with your CPU manufacturer whether this huge page feature is supported or not on Intel CPUs we do support two megabyte pages and one gigabyte pages in DB DK so why is that important because CPUs work with virtual addresses your program works with virtual address addresses but whenever you need to communicate with the i/o layer the nickel the NIC actually uses physical addresses so this translation needs to happen summer so it happens in the CPU it happens on the platform and we actually have CPU side table lookaside buffer abbreviated tlbs you have io side tlbs to do all these translations there are several translations involved into actually mapping a virtual address that your program uses to a physical address that your system is using and this actually needs to happen fast because on every packet you need to know this address so the default page size of four kilobyte is actually not o not enough not not working properly with the right performance in this setup for packet processing because what a 1 0 one four kilobyte page will only allow you maybe to store one or two buffers one or two or two packets in one page so every let's say other buff other packet you would actually need to look to do this mapping again so in order to avoid that you need to minimize the number of pages that you use so that they use fewer pages you can store a cache of these these page tables so that's why we actually prefer to use one gigabyte huge pages so so then you could have a huge number of buffers in the same page so you only need to you could have thousands millions of buffers in just a few pages so you don't need to keep a long list of these pages so that's done by EA l as well also the Knicks are discovered at this point so we just look on the PCI bus loop look into the Linux file system to see which NICs do we have over there where have they mapped to which CPU circuit and what's the right of the proper driver that needs to handle those Knicks of course these Knicks could be handled by the Linux kernel already so we have to use the Linux standard mechanisms to actually remove that NIC from being handled by the Linux kernel and having it taking over taken over by DP DK right and handled using the polymer drivers that the PDK provides so that's what he al is doing release just the set up abstracting providing a nice abstraction for the for the system for the other libraries to use right so M buff is actually the packet descriptor VPP uses this n buff as a packet descriptor it also opens more data more metadata to to this data structure I think DP DK n buff is 128 bytes which is exactly two cache lines on Intel CPUs and I think the VPP metadata that is appended to the embodies that - or 64 bytes anyway we can check in the code I mean it's not a big deal I think there is a comment in the code about 32 byte I didn't count them myself maybe I should have done that anyway so 32 or 64 that's another cache line one more cache line that is part of let's say the core packet descriptor that of that VPP uses that VPP graph nodes will have to share and all recognized mempool is the library implementing this pool of buffers millions of buffers sometimes for example if you use traffic managers you need to buffer a lot of packets for each output ports for each output port and you probably need typically need millions of packets in in a buffer pool to do that depending on how many milliseconds of traffic for example you are planning to buffer for each output port if you are not doing that you might get away with the smaller man pool I think that was the default in VPP is it 32 kilobytes of buffers in a memory pool okay we can check I think that's what we use in the PDK sample applications as a default ring is the library that provides support for localized hues localized rings lot less circular buffers why is it localized because it's typically used for communication between several distinct threads that are running on different different CPU cores so so basically you have both of these threads running in parallel on different CPU cores and they share the same data structure a queue at the end of the day is just a buffer that is split into entries and handled using some indices it's shared memory between these these two threads running at the same time as opposed to time sharing so then they need to we can either use semaphores but we don't like to use semaphores for a good reasons for good reasons because that that involves blocking and locking which involves blocking right so we use local excuse it's actually very simple to implement a single producer single consumer local excuse so easy to write the code you can read the code there are a few a few tricks in there but at the end of the day very small amount of code to that no special instructions no special let's say hardware support needed there compare-and-swap is actually not needed for single producer single consumers if you want to implement multiple consumer single producer or multi a single producer multiple consumers or multiple everything then you would need to to use some tricks there is no easy way out so typically you would use compare-and-swap instructions but you it's actually a good point you don't need to use that for the simple case of single process single consume ok timer is obviously implemented is there for implementing lists of events that you need to trigger for example you need to age entries into a table things like that I actually am NOT a fan of using the timer library on the data playing side so I'm okay to use it for let's say control plane stuff but I usually find the workaround and a different way to do things when I need to to implement a sort of timer events on on on on the data data playing threads rather than actually relying on on this thing because actually stopping from time to time and actually parsing a linked list of entries is not a good thing so usually a linked list is not a good thing for data playing processing because a linked list means that you need to chase pointers you need to you have a data dependency you cannot go to the next you need to do you need to parse it linearly so if you have a lot of entries there is no ad no quicker way to do that you have to parse it linearly you cannot go to notes at a time you need to go one note first to actually get the pointer to the second note so that it the dependency the data that the data dependency here is very strong so I don't use linked lists if you can I actually had examples in the PDK where I when where we use arrays if we can use arrays sometimes you cannot use arrays you need to use a linked list so the workaround to do that is actually have a linked list where the nodes are actually erase so you have like a group of four entries that are in an array and then they are linked to another group of four entries using a pointer so basically it's a midway between a linked list and an array just to avoid the huge penalty of then you are in trouble so the question is do you need that for data plane or do you need that for control if you need it for data plane maybe you should think about writing your application in a different way unless it's 30 years old code that nobody knows anymore so you need to use these libraries so so that's that's like the set of core libraries the next big thing is obviously the polymer drivers this is what the PDK is best known for it's actually the driver providing you the package read packet from the input queues of the NIC right package to the output cues of any packet in packet out and we have a huge set of devices over there of Nyx physical needs virtual mix as well so abstractions for let's say software software software devices really so Intel NICs Mellanox Broadcom cisco and then all sorts of virtual devices like AF packet ring speaker we have a bonding driver as well we also introduced these crypto devices which are sort of polymer drivers for accessing the queues of a hardware acceleration for encryption we are not dealing with packets here maybe although you can create the connection to packets straightaway what really happens is you have a queue of requests encryption or decryption requests and the queue of responses back from the accelerator so it's it's still a palma driver you need to poll for you need to push request and you need to pull for responses that so that's why we have this this crypto device is over there we have support for the Intel quick assist technology AES ni multi buffer is actually a software library using special intensive instructions called AES ni for AES encryption algorithm symmetric encryption and I stands for a new instruction so it's it's an instruction that is used basically to implement a yes and then we have this library that also pushes as many encryption decryption requests through the CPU execution units to achieve the maximum performance there so what else we do have these extensions things that we call extensions to the GDK we we do have like a classification libraries once you get your packets into the CPU you need to obviously look them up into some tables this is what we do all day long for packet processing we loop packets into tables to just understand what to do next with the package what's the next table we need to look up that packet into so we have a hash libraries we actually have several instantiations of those in the decade that you could see some of them are more hidden than others hash is actually an exact match classification you just need Reaper ant Apple on the back of the number of a bunch of fields from the packet you know exactly which bytes make up your classification key or lookup key and you look it up into a table just to get the data associated with that key so it could act a pointer and opaque index which is like the typical usage that you see in VPP for that or you could actually be an array of bytes sorry an arrow yeah another of bytes of data bytes associated with that key could be it's actually metadata that you store for whatever for that flow for whatever that entry represents in your table you have a CL lookup access control list so it's wildcard classification you read your end table from the packet and you need to match that against a set of rules that are for example used in a security gateway and you have like wild cards over there and these rules will also overlap sometimes the most of the times would actually overlap and you will actually heat the same at the same lookup key will actually heat several rules in that table so how do you reconcile over that sort of event you actually pick the one that rule that has the highest priority so in a security gateway in a firewall you will actually see rules listed up in descending order and the last rule that the lowest Peretti rule will actually be drop everything but then on the top you have some other rules that will actually say hey accept these deny these send these packets over there do this today yeah according other recording priority so you don't have to scan through the end the highest the first match will always be the highest birthday that's the way to look at it from the logical perspective that's actually not the way it's implemented in the code the data structures are actually a tree like data structure so you would it's very similar to implementing a pattern matching algorithm like a hawk or a CQ will actually build a tree or things and you just go through that tree the problem is on the performance side that each each noting the three we is actually a memory access so you need to to kind of push more things in the same lookup you need to minimize the number of memory accesses so then you will actually not do one byte at a time you will try to do as many bytes at a time in one node in one one axis I'm not the guy developing that library so I actually cannot go too deep into into that but my understanding is that we are actually building up a tree yes yes but from a logical standpoint if you just describe that to somebody or do you implement it in a naive way you could actually do a for loop and just start parsing the rules and matching your key against the rules in descending priority order and the first hit will actually we LP m longest prefix match is used for typically for routing for using for implementing a feed-forward information based forwarding information base so this is what VPP also implements the algorithm is actually different I think VPP uses an M tree to implement the FIB we are not using an entry we are using a different algorithm we are very generous sometimes with memory so our routing table would actually allow you to to get to a resolution with just one memory access for most of the cases where your prefix is actually opted the route to to to hit is actually less or equal to 24 bits so slash 1 2 slash 24 you can actually do in one axis if your prefix is bigger than is longer than that then you might need the second second access into memory is that working at your mic ok so no the short answer is no so what happens is happened historically is like we started we added a support for ipv4 and then at some point later we said yeah we should also add ipv6 so then we look for various algorithms and what happened is that we actually said yeah we can we can use the the v6 version for the first 24 bits out of the ipv6 address so then we extended it with the some other tricks so so the v4 implement the few v4 implementation is actually reused for v6 but there is a lot of other things that we need what the bad news is that it will take many more cycles then for before because it's more memory accesses not just one yes at some point I'll get to that and study that goddess it's somewhere on the list too many things to do so yeah that that's a good good good good observation so as we go along I mean we will borrow concepts from BPPV people borrow concepts from DPD cable will both improve I mean and we'll both evolve okay extension libraries for packet distribution like load balancing sort of thing back at reordering like we mentioned yesterday sometimes for an elephant flow you may be the most elegant way to do that is to send packets from the same flow to different workers and then reorder so we have a library for reordering over the packet reordering yeah yes that is exactly the paralyzed the processing Madeline do you need both the distributor and the reordering you could but you could use both of them reordering is just one way or solely a distributor is just one way to load balance you could actually use other skins for load balancing as a like RSS which is a hardware scheme flow director hardware scheme or maybe simple software implementations and not for the elephant flow yes so for the for the elephant flow yeah you're right sorry I should I should think more about your question so on the elephant flow there is no purpose in computing using a hash or sort of digest per packet to to bifurcate you just sent the next packet to the next worker that is available so probably not not the distributor that's the reorder yeah you simply assign the sequence numbers to your packets and then you reorder based on the sequence numbers looked at the code and I there is no music explosion in dakotas it's not obvious to see a do just put big logs around it or you I I think that I mean the idea what reorder is you reorder it a I mean the only reason you need to reorder is because you have multiple threads in words right if you only have one thread there is no need to reorder so if reordering code would not be thread safe then it would be meaningless I think it's one of the basic requirements but I'm not that familiar with that library right now I need to maybe spend more time looking to that so we can maybe take it offline right and maybe I can better understand to your requirement for your elephant flow use case and associated information yeah IP fragmentation ipv4 ipv6 IP reassembly ipv4 ipv6 is present then quality of service for traffic metering and hierarchical scheduling on the egress I we do have a library that to implement a five-level hierarchical scheduler in software with actually no harder acceleration you can have thousands of queues as leaves of the hierarchy I actually use this as an example for how to optimize code how to write code optimally on one of the techniques that you could use to optimize code on multi-core CPUs package framework is one of my babies as well it's actually provides a way to design your applications without worrying too much about moving packets from A to B so typically when you when you write your code you just simply start with receive packets from that queue then take the packet to the next table and do the lookup and then you do if else if else is whatever you do branches over there just too and then you move the packet to the next table so on and so forth and then you put back it into a queue or you drop the packet so you will have to move the packet from A to B all the time package framework the philosophy comes from open flow which actually looks at a data plane of a switch as a chain or three of tables that are connected together so then the idea was can we actually abstract all different types of devices that can stream packets like Hardware queues software queues IP fragmentation queues IP reassembly queues traffic managers at the end of day they are just huge that would actually get packets on one side and send packets on the other side maybe not in the same order maybe with some kind of timeout maybe you get a lot of fragments in and you just send a single reassemble the Datagram out but in the other end of the day is still a queue it might be a smart queue and then you have all these tables that are just search algorithms you just have whatever lookup keys which are array of bytes and you associate some data to to these keys so we put together all these D Isis that can stream packets under the port abstraction into the port library we put together all the libraries that we have 4 4 4 lookups a CL hash LPM into this table abstraction library and then we just since they all look alike and you press the same button for all of them and they all sink maybe in a different way we can connect them together and you get for free DS this application skeleton you just say ok this input port is connected to this table this table is connected to this other table then you just need to populate your tables with actions like ok for this add these keys to the table and for this entry in the table do these things and send this packet over whatever next step so you get the idea is that you build you build all the skeleton for free you get all the skeleton for free it's built for you and then you just need to call a function like run run the pipeline and the package will seamlessly go through the pipeline without you having to write all this code all together all the time and maybe making mistake debugging your application you get this for obviously a VP sorry packet framework is just a way it's a methodology to connect these things together a VPP actually has a huge library of protocols that are already implemented right I mean and I mean that there is if you look into the V net folder of EPP you'll see a huge list of protocol supported that's actually where the gold is you won't find that in packet frame or packet framework it is just a tool to to change things things up but then you still need to implement your own protocols right we do have some examples for routing or flow classification but it's just examples we don't position those as a step VPP is just that if you need the stack VPP is where you need to look if you need to just write a simple application on a complex application from scratch and you don't find maybe I mean you have reasons not to use PPP then you could use packet frame or maybe you could use both I mean there are actually people looking at using both together you

Info

Channel: FD.io

Views: 17,170

Rating: 4.6455698 out of 5

Keywords:

Id: 0G6u409cSos

Channel Id: undefined

Length: 35min 42sec (2142 seconds)

Published: Mon Jul 25 2016