Netdev 0x13 - XDP based DDoS Mitigation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone my name is Arthur I'm a systems engineer at cobbler and I'm here to tell you about how we migrated our das mitigation systems of denial service attack mitigation system to xdp so a bit about us we are kind of a big content delivery network so we serve lots of cat pictures we have about 170 points of presence among the world where we serve the cat pictures and we do about 10 million HTTP requests per second so we end up seeing a lot of traffic and a lot of malicious traffic and a lot of denial service states and we've have we've had our new xdv base solution in production for about six months now so a bit about what is a denial service attack and what we see on our end so it's pretty much a constant occurrence to just a part of doing business for us we are under denial of service attacks and distributed now service attacks every day all the time there's always attacks somewhere for the purpose of this talk we're mostly going to focus on layer for denial service attacks because that's kind of what I do it and we're gonna forget about l7 because those are much more complicated so the two major kinds of attacks we see overlay or four are kind of TCP flood attacks where we just get big floods of sin Iraq packets usually with a kind of a source a spoofed source dress or UDP amplification attacks and those are mainly either DNS or memcache which you saw about a year ago and there was that big bug one notable point is that we use anycast everywhere so all of our data centers advertise the exact same IP ranges everywhere so distributed denial service attacks tend to hit us in a very distributed way we tend to see the attacks in most of our data centers everywhere it's rare that we get attacks that single out kind of a single location and so these are two examples of what we see on our end so this is kind of a global view inside cloud fair of what traffic we're seeing when we get attacks as you can see it's pretty obvious they're hard to miss which makes our life easy which is good so the top is what happened when the memcached UDP application attacks a it so we get got about 800 megabits per second of kind of a junky traffic and at this point normal traffic levels are pretty much just noise in the graph and we can see the same thing on the bottom here we have a TCP syn flood that hit hit us then it got up to about 300 million packets per second and once again normal traffic is just noise at this point which makes our life easy because at least we can detect the attacks and so our mitigation pipeline is kind of three major steps so packets come in they hit an edge server and that's over any cast and then we send a sample so can we sample incoming packets and send a small fraction of these to a centralized system called kapok okay but aggregates all the traffic it's receiving and does metric to figure out which traffic is malicious or traffic is it an all service attack from that we put out rules which describe which packets are incoming denial service attack packets and push these out to back to the edge now it's important note here that we need to sample packets before we drop them because we want to keep the feedback loop going even when we're dropping packets to know if the attack is still ongoing or not that's pretty much it so rules so we tend to have one person ature attack sometimes attack have attacks have multiple signatures so there can be hundreds of rules at the same time and they all need to match and so we only let a packet through if it matches none of the rules we have deployed right now and so our rules have to kind of key parts to them we have a classic vpf filter so no not extended BPF classic vpf and traditional is that it's very flexible this lets us match on any part of the packet and see if you have is great for this that's all let's you do let you match packets there are lots of tools to generate CPF so we have our own BPF tools products on github and that lets us generate CPF matching DNS queries so we can say any any DNS query matching you know foo.example.com should match from puff-puff is kind of a small DSL for describing IP and TCP header options and what order they're coming because usually attackers are kind of dumb and they redo the same software and some of the header bits are the same or whatever and lead pica for TCP dump also generates classic BPF which lets us generate even more filters for our parts and we can run them in lots of places as well so you can run them in the kernel with Esso attach filter you can run em in AK IP tables with HT v PF and we can also run in user space lots of languages have interpreters or VMs for CPF second part of our rule is the Torquay datacenters so I know I said that most of our texts are distributed all over the world but sometimes we do get attacks that focus on one data center so we want the ability to scope rules to specific data centers in some cases and so now on xdp so I'm sure most of you have heard what actually before but we'll do a quick little recap so XDP uses EVP f which is extended EP f and one of the main things that we use for that is helper functions so we can call them the kernel from BP F to have the kernel do some bits for us and the main one we use is tail calling which lets us chain EVP F programs together so we can have one EVP F program that decides okay I'm done pressing this packet on to the next EVP F program and we also have of the kernel verifier which checks that our EVP F program is correct does not do out of balance packet accesses does not read random kernel memory and this is great but it comes with lots of limits so the very far imposes strict complexity limits on the actual EBP F code we get to run so we're limited to 4,000 instructions right now and you can only do 32 tail calls in one go there are also limits to your stack usage how many branches you can have at the same time these are all things to keep in mind a next EP is really just a mechanism to attach a single EVP F program to an interface and the program and the chain of programs it changed calls tail calls into runs for every packet we receive and the two actions we care about is we can either pass it and then it reaches the normal it's networking stack or we can drop it and it just disappears which is great so if we remember this is kind of what our mitigation pipeline looks like and at first glance it seems like we want to do all the dropping next EP seems perfect we can have a rule xdp drop everything's great now the one thing number is we need to sample before we drop so kind of a requirement of dropping packets and HTTPS that we need be able to sample them in XDP as well and so this is kind of what we could imagine it would look like so I've drawn little dots over the XTP parts so we could have a sample or program that would somehow sample packets send them off to kpop and then we would tell call in to our dropping off which would either pass it on up the stack or drop the packet seems simple enough so first the sampling so we need to tail call into the next potential rule dropping program which means me to somehow copy the packets out we need some kind of a side channel to submit our sample packets outside of the whole HTTP program chain thankfully we need a really low separate as we saw in the graphs attacks tend to be really big so even tiny sample rates serve as well it turns out there is an a kernel BPF hopper called perf then output which lets you put whatever you want in a kind of a perfe bent and we can put turns out you can put the whole packet as well so we can put the whole packet in a perfect which adds up in a perforin buffer and then we can read that from you space and that is a nice property of degrading gracefully as well purpose kind of meant to be a lossy thing where if you put too many events they'll just get lost along the way and for sampling that works great as well if we get a huge attack in our sample rate you set too high will drop the extra samples and well so what so sampling done how do we drop things though so if remember our rules have two main parts that we have got the classic vpf filter and that sucks we can't really just eval that any DPF there's no magic helpers for anything like this which kind of means that we need to compile our rules in instead of having you can imagine we could have one generic eb PF program and our rules would be stored in a map and we could look them up but with the CPF this doesn't work and since we have to build our rules in then the complexity limits start being a challenge because the more rules we have the more complex the program is and the closer we get to be complexity limits imposed to us by the kernel we also have this target data center thing and compiling EBP F can be a bit Fafi you need to generate the code then usually compile it with clang and we really don't want to do this all across our edge on thousands of servers at the same time it's error-prone it's wasteful so we want to compile one elf distributed everywhere in a key value store but we still wanna be able to specialize the elf or do something to restrict what rules are enabled in it so if you imagine this is kind of what if we our C template looks like when we're done during rules so we have kind of our main xtp entry point here and it's pretty simple if the first rule matches drop the packet go on to every rule if no rules match then we pass the packet must be fine so the first part is converting the classic BPF so at first you think this is gonna be great eb PF was kind of designed to be converted to from classic v PF and the kernel can do this if USO attached filter now and use classic v PF and you have an e vp f jip the kernel will convert the classic BPF to e vp f for you and Jin it but that only works for us to attach filter so we kind of have to compile or convert this on our own so we ended up writing a compiler to compile classic BPF to see because then ideas that we can embed it in our C template and compile it to EBP F with clang which is also nice because clang can sometimes do some Co optimizations between rules and stuff and most of the instructions kind of map one to one so a loose with all the math operations most of the jumps kind of have one-to-one direct mapping between classic vpf and EBP which makes our life much easier so here's an example though of instruction that really sucks packet loads so in classic vpf you can easily check the bounds of packets how long the packet is so dealing with classic vpf is you just write your filter and you say bite for needs to match and if the packet is not 4 bytes long the kernel will just say your filter does not match which is fair enough you want your packet to be at least that much long but the problem of that is that that exits your program so if you want to combine multiple these filters together we can't have the filter the first filter that makes an out of bound packet access return all of our filters we need to run them all all the time so for this though thankfully EB PF offers much more generic ways of loading memories so from UPF you can load a pointer dereference it have offsets and thankfully HTTP gives us a packet pointer we need us use so that's great but we need to check the bounds everywhere and then end up sucking a bit because a single classic vpf instruction something like here we're just loading a 32-bit word from buy 12 of the packet ends up being at least 4 instructions in EB PF and this doesn't even include the end eunice things classic vpf always returns packet loads in the native and eunice so here we also need to go bytes off before a little engine which is most of the time and this sucks even more for BPF indirect which lets us load kind of packets with variable offset so we can use another register to calculate the offset then we to do much more faffing around and that doesn't mean at least 6 instructions which is a lot especially when we have our 4k limit and we want to support hundreds of rules everywhere so it turns out though that we can do much better than this so if we imagine a typical classic BPF program it turns out that most of them load a byte and so you can imagine that there they load bytes in increasing order so first they check the IP header then they check a UDP header and then they check a dns ID or something and so most of them check the first byte they want to check and then either they fall through to the next checks or they return no match so in most classic vpf programs the only way for the packet to match is for every single load to happen and so with this we can just have a single packet balanced check at the start that checks for the greatest access and we don't actually end up changing the semantics of the program we return from a different place but will always return the same value because the fact it needs to be at least that long to match in the first place and this ends up being quite a saving so here on this graph we've compiled some filters so the first two are TCP dump stalls filters that we generate pcap and the bottom is one from our BPF tools project to generate classic vpf that matches a DNS of query and the first column cdpf shows how many classic vpf constructions that filter actually is and then cv PFC is the name we've given to our classic vpf compiler generate c and how many EVP F instructions that is that generated after a pass to Clank so it's not great but it's not terrible and then on the third column we'd have how many instructions that ends up in x86 once interested by the kernel and on the right column for the kernel just for comparison of what s au attached filter ends up doing when it converts your classic vpf to EPF now it's not entirely apples to apples comparison because for some reason socket filters in EB PF don't get to access the sk buff data pointer they can only use the old load in direct load absolute stuff and any vpf tendons of calling socket load helper which means that you have to make a V PF call which means you clobber over into registers and it's a lot of faffing around and that's why the kernel is so inefficient we have almost like a 2x improvement in doing this just ourselves compared to what the kernel managed to do in a socket filter and now compile wants run everywhere so this is really key we really don't want to faff around with clang compiling all of our BPF and converting all this everywhere on the edge and we won't enable disable remember initially our thing where we have several rules on after the other we want to have a single program with multiple rules and we want to selectively enable or disable each and every one of these rules and it turns out that the first impression we'd be like oh we can do this in a map BPF has BPF has maps we can look up values in maps we can store program IDs as keys or something and check if the rules enabled this ends up being really expensive though a map lookup is for instructions minimum by the time you set up all the arguments for the vpf calling convention and actually made the call and check the return value and the verifier makes sure you check the return value and it also clobbers a bunch of registers R 0 and R 5 r clobbered which means that clang will spill a bunch of register the stack and then spill them back and for us on average a single map lookup for a rule like this ended up being ten instructions which is a lot when you only have 4,000 and you need hundreds of rules so instead it seems like we want to kind of just modify the elf instead or do something like this it turns all that work so if we write code like this so the key part here is that we have a single enabled variable in so imagine we repeat this for every rule and we load that and that variable is assigned to a register and then a single 64-bit vpf load using the inline assembly into that register and the key part here is that if you put a symbol name in the inline assembly which I've highlighted here rule 0 is enabled in yellow that then shows up in the relocation info of the elf and then we detect the value of that register and this is great so if we don't actually dump the relocation info of enough compiled like this we find the offset of our load instruction and that runtime all we have to do is find all the symbols named rule and then underscore ID whatever enabled and rewrite all these loads to either load 0 to disable the rule or load 1 to enable the rule and this gets even better it turns out the verifier prunes constant branches like this so the verifier tracks the value of variables in registers and seeing as we do 1 64-bit constant load it knows that the register has a constant value and it knows that we're using this in a jump with a constant value and so if the rules disabled it'll knob out the whole rule and if the rule is enabled it'll not bail just the check and the rules always there so this has a zero runtime cost which is great now on to debugging so this is all great and we're dropping packets everywhere how do we actually figure out what we're dropping about when we have a problem so metrics only go so far we have met great metrics and you can put them in BPF maps and we have metrics of drop packets per rule and we can tell which rules are dropping how many packets but if we're searching for a needle in a haystack there's one packet that's being dropped or disappear or something it can be really hard to figure out what actually happened to it so we really want some kind of TCP dump like tool where we want a filter to match the packet that we're looking for and we want to be able to look at the packet and see what happened to it now it turns out we already have all the things we need to do this we only talked about lid P cap and TCP dump and that just generates classic BPF and we've just talked about how we can now convert classic BPF to EVPs and we can also use perfect output matching packets so we can use any TCP dump filter compile it to classic vpf compile that to extended BPF add some extra EVP F to use perf and that would be some EVP F that would filter out packets and match one anything we want and now for this we updated our classic vpf - EVP F compiler to generate directly EVP F instead of going through C and clang because it was a lot of faffing around to get this for so little and so this is kind of currently what our setup looks like and so the big question is how do we hook the same we don't really we need to tell call in to this extra program we've made somehow but how to do that and so we always want to tell go into it and that's fine because it turns out that the tail call helper is really great this way and then if you tell call and nothing is attached it's a no op so tell calling with nothing has pretty much zero runtime performance overhead so we can just tell call all the time and if we do this at the end of the chain and at the start we can also get the final action that the packet took because until we get to the final program it's just tail calls and we don't know if the packet was supposed to be HTP dropped or XDP passed that makes sense and the great part we're doing at the end as well is in the case of other projects so we're also working on a load balancing project where we modify packets then we can actually get the dump of the modified packet and see what it looks like and so this end up looking something like this so after the drop Alf we tail call into two separate filter programs and those are the ones that we've compiled from cdpf - EVP F and do perf and both of these programs have the action embedded in them and we can then output that as perf metadata and then a user space demon is gonna read from the perf ring buffer and uh put that to a pcap and this works really well because we can even add the action that was taken as the interface metadata in the PCAT file and then we have an annotated pcap with all the packets that we saw and what xdp action they actually took so this is all great now there are some pain points the main ones being in the complexity limits we always want to support more and more rules and we never really know how many it's hard to place an upper bound on the amount of rules we need at any given point in time and different rules end up being vastly different in the number of instructions they actually use so we've been trying to do lots of work on reducing the amount of instruction than supporting more rules an early attempt involved trying to actually brute force rules into elves so the idea was that we would put all of our rules into a single elf shove it through the verifier and if they've ever complained we would assume that we had too many rules or hit some complexity limit till the problem is there are lots of different complexity limits and they all have different air nodes and they're all really hard to check for and then if we hit that limit we would just have some terrible heuristic to try and guesstimate how many instructions each rule used so we can move some rules to different elves and you would keep doing this until we ended up with a set of elves that are all the rules we needed and then we could just chain belts together now this sounds great but it was terrible it was really hard to be bugged and very unobtrusive all because it depended a lot on which kernel you were actually compiling these rules because lots of air power limits are tweaked all the time and it was really hard to actually reproduce this so for now we've decided to stick with just increasing the kernel complexity limits to same values and close enough another thing is that clang EVP F inline assembly kind of sucks you can't really specify what up codes you want to use you have to use this C like syntax to guesstimate what instruction that's giving you and if you mess it up the instructions just get silently dropped which makes it really fun we're also been working on race for you rate limiting so most of our rules we discussed here either drop or pass the packet but in lots of conditions we actually just want to rate limit specific packets so especially like if you can imagine we have we want to write limit new TCP connections we can just rate limit new syn packets coming in but implementing a race free token bucket in EVP F cheaply is actually really hard because there are no proper atomic instructions so you can lock X add which adds atomically but you don't get the previous value out there's no comparison for no fetch and add really but we're working on implementing that and yeah so thanks for UPF it runs CVP F great and here are some links to things so BPF tools is our puff and dns cdpf compiler which we use for matching packets triggering rules cv PFC is our classic vp f - c or - e v PF compiler and that's not quite open source yet but it should be next week so the link will work next week for github and same for xt pcap that's our XDP packet capture tool that uses cv PFC and that should be open source next week and new tools et PF is the loader we use that's entirely an NGO that allows doing this runtime kind of elf fudging to enable and disable rules force and that is it excellent questions yeah great talk since quick folks are here you know you think it would replace TCP do you think you can do like syn flood detection for quick in X DP or do you need to decrypt packets and it's it seems harder we haven't looked into it that much but it seems there's a lot less we can do in xt p without being able to decrypt the packets i see so just getting there all packet seems kind of hard we can do rate limiting i think based on like destination IP and stuff but nothing much more right but not on somebody else plenty of questions questions if the kinds of attacks quick would have would be similar to tcp or not and the minimum packet size in quic is not tiny like syn flood so it's hard to send that many packets like this huge stream of small packets basically to bring down the cell I'm sorry this is not entirely I'm here yeah this is not entirely true because the main bottleneck of an attacker is not the size of the packet the bandwidth but but the main cost is about writing a packet the network card and with quick you will just see the the same syn flood but not the same packet rate about the same packet rate probably but with higher bandwidth so yeah it having having an MTU sized first initial packet doesn't help you at all it it actually makes things worse so for what it's worth there is a silk cookie-like mechanism that's built into no I'm here in the front row oh yeah because there's a sim cookie like mechanism built into quick no it's called a retry packet okay so if I receive a packet and I am III the server decides that it's it's undergoing dose I can simply say retry packet to ensure that the client is where it says it is for examples I can I can choose to not respect zero oddity handshake right like the mitigations for for toss that's ultimately what ya think for so that all happens in userspace though that happened well that's an implementation yes then you're asking right so how you send a retry if you centrally try for every single client at all that comes in then you can certainly do it in simple XTP with no encryption so the retries aren't encrypted they are no okay so so the the token that's actually sent in the retry is sent in plain text so you can totally do this in xtp and how do you tell new connections from existing connections what happens if you send a retry to an existing connection you can tell from the packet type okay and that's visible so so there are so there are medications that one can one can talk about and one can build for this especially for the dass condition for the 404 for the door situation you can absolutely build something that that runs much faster than the rest of the stack does oh I'm surprised sure so you know how we say Department output was dropping sometimes then you didn't care sorry yeah what you're sending one of your slides was showing this call I can go backwards a few way back the perf event output yeah yeah that that thing there yeah you says it you're happy that it drops it can't keep up with you yeah I mean it's kind of it fails gracefully so in the case where we would be overwhelmed with too many packets samples or drop which is what we want and that works out fine okay but it has no back break she doesn't you know you just keep sending even though yeah we keep sending but I think if the ring buffer spool is pretty cheap because it just checks the reebok respond then just drops it well so people actually you you'll find out this when it's full I think I don't think this thing yeah we can't know if it's for for me BBF but as soon as we call the helper the helper will know there's no more one funny thing about that thing in userspace you know of it yes yes yeah yeah oh yeah you can add your own metrics and EVPs for this pretty easily so we have our own metrics for how many times per fair about how many packets we've sent over perf and how many we've actually received and you can pretty good statistics you mentioned you do some other stuff in l7 but do you handle any HTTP drops in xcp - no we don't do any HTTP like an HTTP inspection XTP no okay so let me know dpi in no only for DNS really okay so there no more questions I have a question oh so you mentioned the there is different semantics and reading data from pocket from so a touch filter versus xdp yeah so and do you mention that in one case is that it's a pointer in the other case is a function hole yeah is it fixable what do you know why yes I'm not entirely sure if anyone doesn't know this why if you attach if you use s o8 h vp f on a socket with an EVP F program then you're not allowed to read the data pointer from the SK buff it's there and it exists in SK buff and if you're privileged and you're loading like a TC filter you get the same SK buff and you're allowed to read the data pointer but from unprivileged so filter you cannot which is why do you want to redo that a pointer for what oh it's much I mean it's so it just kind of remark that it's much cheaper so if you want to have if you do SL attach BPF right now with an e b PF program you have to use load absolute and load indirect yes which ends up being a function call what ends up being like ten instructions to do it because all this function call do the appropriate sure of that which is not in this cab ahead but if only is DP you have all the data in one frame in one portion of memory arbitrary skb then you can stay with you but from TC you can still read the data point of yourself as long as you're privileged there's a check in the verifier if caps is admin you can read the data pointer yourself I'm not sure what you want to do is detail data pointer I'm just saying you can do it from teeth I don't particularly want to do this we don't use I don't use socket I so attached I think you can redo that a point our only if the verifier can make sure that you are not going to read the byte above the frame so yeah and that's so verify your limits I'm not sure you can know it seems you can do that NTC but I don't know I have a looked into it that much all right thanks [Applause]

Info

Channel: netdevconf

Views: 1,684

Rating: undefined out of 5

Keywords: netdev, netdevconf, netdev 0x13, Linux networking, XDP, Ddos mitigation, ebpf

Id: 1Yw6YISaSkg

Channel Id: undefined

Length: 27min 33sec (1653 seconds)

Published: Mon Jun 03 2019