[ENG] Marian Marinov: Comparison of eBPF, XDP and DPDK for packet inspection / #LinuxPiter

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so I'm the chief System Architect of SiteGround we are hosting company but what I am going to present today is not anything that is directly connected to a company and nothing that actually we use in production right now I'm organizing the biggest open source conference in Prague around 3,000 attendees and since I'm a very big open source fan this is why we will be playing open source games today so the talk is about comparison of eb p FX DP and DP DK who here doesn't know any of those three things perfect I hope you understand a few of a few ideas around them so I will compare what I would explain to you what I have done this this year because in January I decided that I want to do this for our company or at least to have it as an open source project so why do we need to compare these things because currently DDoS attacks and dos attacks foods are big issue for our company we have a lot of attacks for me this is a lot and some of these attacks cannot be easily mitigated so the text the pattern is slowing down because we are handling them better these days but still 31 attacks resulted in service disruption for our company for our customers which is a big problem for us and I want to solve this problem the problem there is that you can solve this problem only with big hardware and I'll explain this a little bit but the tacks are sometimes very easy 20,000 packets per second and you get 30 cores down from one server that's a big big issue the server can handle this but 20 thousand packets I can generate on a normal laptop maybe even on the phone so this is a problem everyone can generate this attack and bring down 30 seasoned CPU cores and their scripts to do this so script kiddies children can do this to you and disrupt your service which is a big big issue so buying additional bandwidth usually solves the problem of receiving the attack then after you receive the attack you need to scrub the tar and the traffic and our only clear traffic that supposed to reach your server reach your servers at this point things become very expensive otherwise if you don't want to handle this traffic at all your fault to services right go up there and I'll explain why this is no go for some companies like ours so first let's decide to invest in hardware and bend to it the problem is that our company has eight different locations worldwide different data centers and not every data center is ok with investing a few million dollars for device that would be used only by us or if they have invested in this device this device is shared by other customers also which is a problem because you cannot use a dedicated 10 gigabit or 40 gigabit port and the problem is that these devices they're not hundred gigabits most of them they're ten multiple of ten to 40 or 50 gigabytes but Onehunga bit is extremely rare even in very expensive data centers so if the attack is watered and the capacity of this device what happens is that the attack results in neurotic and attacked IP again service disruption for the customers the other problem is that even if you can handle the traffic sometimes this disrupts the connectivity in your rack or the hole or all of racks so because it's maybe 200 Gigabit attack to your servers what's happening is you get 200 Giga bits to the edge switches but this over falls drop links of the edge switches so you don't get the traffic to the rack so you have to solve this somewhere somehow you can completely move out of this traffic of your network by going for example it's cloud layer but there are other companies also you can point your dns to them and they will receive all that traffic and clear the traffic only valid traffic will reach your servers controlling your DNS at this point becomes a problem because you now have to configure it via api's and if you have one or two domains that's not a big issue but we are hosting company we have millions of domains book actions on these AP is usually take a lot of time so now instead of using a set comment on a few servers or a few hundred servers and fixing stuff for us changing something now we have to wait for an API call to finish and hope it actually finished so this is not suitable for large hosting companies like ours so what we decided to do is let's build a VM yvm because I can deploy a VM in Amazon I can do the boy in VM in Google I can deploy them in Alibaba or whatever I have that supports a VM so I can scrub traffic everywhere if I make it on hardware it will work unfortunately I would need a specific and I'll talk about this in a bit so what I want was at least thinking a bit bandwidth per VM and at least eight million packets per second scrubbing power for this VM and I want to do that in Linux which is fine if you know what you are doing scrubbing UDP DNS and 1080p traffic was the most important part for me because I don't have any other services running on UDP then scrubbing TCP traffic with sink cookies was also important because I want to scrub the traffic on the filter not on the server's themselves because the server's may have only ten gigabits while the VMS can be multiple of ten gigabits and wishful thinking is caching the HTTP responses we're still thinking about this so the problem is the Linux networking stack in general it looks like this will focus only on the part that is forwarding so this thing here and this is the general picture of the networking stack actually this is the networking stack of Linux your packet starts here here and then it traverses all this wine here for for the whole packet to go out of your machine please if you have questions as them immediately stopped me immediately and ask questions so I can respond respond with the proper site immediately okay don't be shy so what we are going to focus on is this one here which is this thing here so you have a packet that is received then you go into your driver and there you have xdp that can be combined with eb PF programs and then you are allocating the sqb for the network stack onwards at this point you're at least around 20 22 cycles so after this point there are a lot of cycles happening here so everything from here on is swole and so as hell I will be going back to this picture a few times I'm sorry you'd understand why so in 2019 culture had this very nice article about how to drop ten million packets which is very nice and I confirm the results and actually used their slides here in my presentation because I didn't have time to again get the numbers from my home web so iptables can drop around two million packets per second the note here is this can be done with only a single row in interpreting chain if you add more rules immediately you see the gradation of this performance and two million packets per second it seems watch but when you have a DDoS two million packets per second is small so if you if you can if you configure your network in a way that you can receive all this traffic on your end nodes and the hosting machines you would be able to drop at max two million packets with only one row but what would be that who would it be the slash 24 that is attacking you the slash 12 that is attacking you what would you put there giving multiple entries is a problem simply because in this picture here you are in this forward filter here so you have okay sorry in this periodic filter here you have already done contract and this here costs a few cycles I don't remember how many but it's not cheap even if you decide to use IP site hood here knows I've set for the rest of us for the rest of you it is a way to put hashes or arrays of eyepiece or prefixes directly into a single row in IP tables so we can skip the whole problematic accounting of IP tables and going through all the rules in IP tables instead of having maybe 40,000 tools in one chain you now have one rule that is matching all these perfect says here that is very nice there are different ways different things in hype set but generally this is the idea so even we type set it against all so household this is from the quad clear blog post what if you're dropping with IP tables in parenting here you have 1.6 million packets per second and if you're doing this with any of tables I still don't know why any of tables is slower it's supposed to be faster and what if you're using traffic control TC to drop the packets you can see that with TC you can reach around 2 million packets per second yes that is the question is is that the TC we typeset no this is TC directly with drop no EBP F here simply traffic control has you 30 you 32 filter you 32 filter okay and then here is what's happening when you have XDP here this is simply because you're dropping already here and not going through this chain this is why it's so fast so even though I knew these results I decided to yeah the demo coat of quad layers here I have links on at the end of the presentation that you can go through so even though I knew about XP I decided to be a very smart ass and to write the iptables module to fix my problem why a new iptables module because I want to do a layer 7 deep packet inspection I wanted to get into the UDP packet and check the domain names that were there and the actual requests that people were doing doing this with dive tables modules that currently are available is very very expensive with those I could do maybe 50 60 thousand packets per second which was very bad I managed to make it around 260 280 thousand packets per second with my custom my tables module but this again is too small for water attacks so I want to do something better and I decided that I'll use EBP F who here knows EBP F okay extended Burkle packet filter as a way to talk to the colonel in different ways so you can either trace different processes or parts of the colonel or in our case here we'll the program that we'll drop packets from the network you have access to kernel functions and you can do some limited actions there the eb PF programs are small programs that have have to have to have a way to finish it's not possible to have endless loops inside them so they're fast and proved that they will finish so you wrote those into a kernel and they do some stuff for you there is a nice presentation from Daniel Berkman from Falls dem 2016 I was on that presentation so I was okay i knowi BPF I'll do that and I implemented what I did with type tables module in EBP F this is now EBP F that is attached to a traffic control so TCE BPF I managed to do three hundred and twenty three hundred and fifty thousand packets per second but nothing more than that and this was with two thousand domains in the EBP F map that I used for checking the comparing the traffic if I used one I don't remember but I think it was around a million that I managed to make and in order to get these results you have to pin your NICs to certain CPUs and you have to have your user space software not using those CPUs so CPU affinity and CPU sets were very important in order to make it faster which is a problem in cloud environments where you don't know if these CPUs are actually dedicated to you because the virtual machine gets something but what it is you don't know so after that yes yes I mean that's what I did when I was expecting inspecting the packets was I didn't do the UDP checksum at all I was hoping that the packet would be correct this so this can be fixed with a holding the UDP checksum checks but I haven't tested this so I was simply hoping for the best during my test it has to have UDP checksum otherwise you have to have more checks from the data when you are inspecting the packet content and this is a problem if it is broken alright so yes what is the performance degradation if we enable checksums I really don't know because I didn't implement it it sorry but it would be at least give me a sec maybe around 20 cycles it can be less depending on how you implement it because if you implement it with AVX instructions which is a hacky-sack instrumentation you can do it you know maybe 10 cycles but I'm not sure it depends and this is cycles per packet right other questions ok so then I decided ok I know something that is even faster than EBP F and it's called DP DK data plane development kit it's very nice a few years ago inter decided that Linux networking is slow which is it is so they decided to make it faster by simply skipping called the network so sorry they're putting in here and directly going to the application itself so you can either connect an application to the network art or virtual machine to the network card which allows you to be very fast you have you skip all these things here and you create a data plane that is forwarding all of your traffic directly to the NIC and out of your machine or from the NIC and directly to your application so at that point I knew that I had to order few Nicks so I ordered one in tow and one solar flare and how explain myself for this with both of them I managed to drop everything that I put on the wire routed to the cart everything that was before the thinking a bit limit of the port itself I haven't tested with the for gigabit NICs 40 gigabit NICs I'm sorry again I didn't had the time to test to prepare my test to up at home so the solar flare has a very nice addition that you can put cold to a discard please don't ask me how because I cannot tell you on the microphone sorry for death so the PDK was nice you write C code it's for me understandable but it has the complexities of strange network architecture that you have to use now and strange functions for this which became a problem in my team when I shared my idea with them they were like the PDK what new networking no this this is not very good so they didn't want to participate in supporting the PDK at all also the PDK required specific hardware which was a problem for me because I don't know if I would I would be able to buy this harder in every datacenter in the world especially for example in Amazon and Google you don't know what hurdler you run on for the networking I mean so I wanted to have something that works better but I had the talk with a friend of mine brain krishna and he told me you know if your team is scared about the PDK you can write it in p4 and I was like what the hell is before this is another programming language that you can use to actually build the PDK programs which was nice it allowed me to easily update the logic and also update the content of the DPD k programs unfortunately again my guys said if for what so no we didn't go this way I simply tested it it worked fine but it requires a lot more time to change anything in your program and testing it is more like putting in Y and testing that I don't know I'm not an expert in the PDK so I really don't know how to properly debug the DPD K programs so this was a problem that everyone in my team agreed we should avoid then we ended with a ended up using XDP because obviously it works so it's here it's fast it's the same in the same way that duplicate does it work but it can be transparent for the user applications and you can use the fast path of the networking as much as possible so it's also supported by many drivers now it was implemented by Jasper Brower and when I asked him a month ago in at Linux bumpers about this and he said I implemented it simply because the DPD key guys told told me that it cannot be done in the next so I'm really happy that Jasper truth that challenge make mate XDP two years ago it was only available for a few drivers now most of the drivers in the linux kernel support XDP so it can be a single application that you're installing there or it can be combined with eb PF programs that can extend the functionality you want so what I did was I created a filter similar to the one you see in the code demo of poplar and this future instead of having a single prefix there it used an e BPF map that would when a packet arrives you would first check the IPS that the IP that it's arriving from with a map and then the second thing is to execute the to extract the actual data from the packet and compare the data in my case the domains the host names that are requested compare that with host names that are entered in the map itself after that it was needed I needed a two that can update the EPF map from the user space so VP fe b PF requires two things you need a program that is wallet inside the kernel to actually work there and you need a user space - that would talk to a file descriptor directly to this eb PF map so the kernel and the user space tools share only this let's say memory range in the kernel which is the eb PF math map and that's the only thing they share so the user space - is used to load or delete host names from the map so i can update on the fly what I have in the kernel filter so after I had the user space - actually had a UDP filter and this was all that I needed so the setup for this VM is pretty simple you have an ipv4 IP address that is under attack you move this ipv4 address you route it to the VM itself and that is filtering your traffic announce it from there by bgp for example in my case and then all the traffic from your routers to this IP goes to this VM after that at this VM you simply add routing for this IP static route to the machine that actually hosts this IP so you filter on this VM and the traffic the clear traffic goes directly to your machine it's pretty simple you need I had to shell scripts to do this for me I want to move the IP on the machine and want to create the routing back so this is only incoming filter not outgoing filter at least in my case I really don't need an outgoing filter so now yes I PPF map array in map array yes it depends first I started with very small size that was capable of getting only 64 entries there now yes in order to get 2000 min names I required actually four maps at the moment so four or five hundred and 512 at the moment I could do it with a single map but if you experiment a bit with the EBP F you'd see that you can actually bind the maps to particular CPUs so I want to separate the maps between CPUs and when I have for example 12 424 course I want to be able to use all of these course so I was setting the CPUs to be used but the map that I'm using to use only 4 4 CPUs then the other map the other four CPUs this way I'd get better performance out of it instead of getting a single one because when you are going through a single one all the CPUs have to agree for that when you are going now I'm limiting the number of CPUs that have to agree that you get this data why don't I use the per CPU map because then I would need to go per CPU for I would have one word that I would have to copy for each CPU know if you have two thousand entries you would have yes it depends on the size of the map you want to have if you want to have bigger maps you have to separate them if you want if you have smaller maps it's fine so the idea was the UDP scrubber is pretty simple you simply get the data from the packet if it if it matches something from the map and by something I mean the Tod and the hostname after that I don't care for subdomains whatever is after God so the first part after the Tod and in the domain and there I know that there Tod is like not Co UK and stuff like this I have a pretty simple hacker there that says ok the Tod is everything that starts with the first six symbols to the point so that was just a little bit a little bit tricky when I started scrubbing the traffic and the next thing I really want to do is have caching for each response with TTL this would allow me to filter directly all the traffic that is going to the DNS server on the hosting machine one interesting thing here is that if you get only proxy so caching for the DNS requests you actually get you don't filter all the traffic from attacks if I create traffic that is a random source ports random sort of a piece and random host names all this traffic will hit the DNS server because it's not cached it cannot be cached because it's random right so the only thing is the better approach for me was to simply filter the requests if they're going to something that is hosted on that IP even if it's not hosted on that IP I shouldn't let it go then the TCP scrubber I haven't done this at all because two months ago I had to stop on working on this project the idea is pretty simple you need to match the destination port ant if it is not in the route list you drop the packet syn packets with that destination port everything else has to have sin cookies here is a problematic part when you have multiple VMs of this you have to synchronize the sync cookies between these servers so you have to use a hash that is predictable for this and also if you have if you want to be able to do contract you have to share the contract state between these servers again this is easily done but it's not implemented yeah the idea was that most of the DDoS attacks that were sinful that were TCP worse in floods and so TCP syn cookies should be okay and then if you stop the first scene you actually have to retry that syn syn packet again to the server that had to receive that one testing this setup was a problem because generating so many packets is not something that you see that you can do for example with pink or whatever - you have so you have to use the kernel packet generator but it has its limits so last month I talked with Jasper and I told you at links pumpers and he told me that he has a patched version of package engine that there is a link at the end of the slides for this that can generate a watt then a lot more than ten million packets per second if you have more than thinking a bit interface which was very nice I also patched that too because it's working with UDP only so I also passed that generate TCP packets for the TCP syn fault now how to get from 10 gigabit VM to 200 gibbets it's pretty simple you use equal cost multi path on your routers and you are all set I did that on my switch directly because my test test to up is with 40 gigabit up links and didn't the switch that I'm using has BGP support so I simply used ecmp there directly to the machines that I was using and any questions these are only the links you would see them in the presentation so questions okay thank you very much for the presentation and before we start with the questions I forgot to mention we have some prizes for the for the best question entry it's up to you to pick one you have as a cop and as a tech markups I just want to say that meeting you it's ideal if it's with me because he's working in our company okay okay so please raise your hands if you have any questions and review or somehow shy to ask this in English can translate this forgotten is the pandian huge name well our symposium the first question can we use not VM and named real hard wire yeah you can use your hardware you'd get even better performance there because you would not use the via telnet one important thing that I didn't mention was that if you are using VMs you'd definitely use via telnet driver not anything emulated there okay with the hardware it would be faster it would be better but you would have to have the dedicated Hardware sitting around for this which is not everyone keeps like 40 core machines with tour for 10 gigabit interfaces I don't okay thank you could you say which kind of DDoS attacks GTO test for example about packet fragmentation of broken packets and what about productivity in in kernel bypass tracks and what which measurements did you do about this case in in this cases so in our case in most data centers we don't handle our networking usually this is done by the data sent from a data center provider so when we get attacks that are more than 10 gigabit that cannot reach our network interfaces because the machines have single okay they have two different NICs but one is private and public and one public thinking of it interfaces and the only thing that we have so if we get attack that is more than thinking but we cannot measure that we why on the ice piece to actually tell us how many gigabits the attacks were this was one of the reasons that I really wanted to make our own filter so we can get the networking in the data center to our own equipment and actually measure these things our the attacks that we get usually are UDP foods and a lot of the times these are UDP foods talk to ports that we don't have services on which is plain stupid you can drop all this with a few lines of EBP F and never worry about this at all like having a foul before the firewall on your machine that is 1,000 times faster than your firewall the next the next biggest thing is TCP syn floods and the last thing is actual HTTP requests from devices over the Internet for the last thing what we're doing is usually there is a pattern in the requests that they are making so we are getting this pattern as in pattern in the text of the HTTP protocol and we are dropping that on the web servers themselves however when the attack is more than 5 6 gigabits per second these servers are not suit they cannot handle the traffic so what we are doing usually is creating our a of web servers that have equal cost multi path between them and pull the push the traffic between between those and after that to the actual web server then we are filtering with actual engines web servers thank you very much things that appear run out of the time for the session and I suggest you to continue discussion in the discussion zone of practice what's the first question the first guy I don't know okay yeah for you okay so thank you very much [Music]

Info

Channel: OSTconf

Views: 1,413

Rating: 4.8947368 out of 5

Keywords: Linux, LinuxConf, Linux Piter, LinuxPiter, Linux Piter 4, LinuxPiter4, Linux Piter #4, Linux-конференция, Linux conference, конферениця, conference, конференция для программистов, программирование под linux

Id: 25AxIR7C3MM

Channel Id: undefined

Length: 40min 48sec (2448 seconds)

Published: Mon Feb 03 2020