NUMA Optimizations in the FreeBSD Network Stack

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I will discuss optimizations to keep network connections and their resources local to NUMA domains. These changes include:

Allocating NUMA local memory to back files sent via sendfile(9). Allocating NUMA local memory for Kernel TLS crypto buffers. Directing connections to TCP Pacers and kTLS workers bound to the local domain. Directing incoming connections to Nginx workers bound to the local domain via modifications to SO_REUSEPORT_LB listen sockets.

I will present data from real Netflix servers showing an improvement of almost 2x on AMD EPYC (85Gbs - 165Gbs), and 1.3x on Intel Xeon (140Gb/s - 180Gbs). I will present data from the Xeon system showing a 50% reduction in cross-domain traffic.

Drew Gallatin

Drew started working on FreeBSD at Duke in the 90s, and was one of the people behind the FreeBSD/alpha port. He worked on zero-copy TCP optimizations for FreeBSD and was sending data at over 1Gb/s before gigabit Ethernet was generally available. He spent a decade at Myricom, optimizing their drivers. After a brief hiatus at Google, he landed at Netflix, where he works on optimizing the FreeBSD kernel and network stack for content delivery. He worked on the optimizations to serve unencrypted Netflix traffic at 100Gb/s, and then on more optimizations to send encrypted traffic at 100Gb/s.

👍︎︎ 1 👤︎︎ u/alecco 📅︎︎ Nov 10 2019 🗫︎ replies
👍︎︎ 1 👤︎︎ u/alecco 📅︎︎ Nov 11 2019 🗫︎ replies
Captions
so for those of you who weren't here in the last talk I'm drew Gallatin I've been a FreeBSD committer since the 90s I really really like fast stuff and making things go fast so the first thing I worked on in FreeBSD was the FreeBSD ports of the dec alpha with Doug Robson and I kicked around doing stupid stuff in the network stack and now I'm really lucky because I work for Netflix and I get to play with really fast machines that serve real traffic to real people on the real internet so I'm here to talk to you about web what I'm calling Numa siloing and the FreeBSD Network stack and we're really really what this is is to serve 200 gigabits a second of TLS to Netflix customers from a single machine using FreeBSD of course so why do we want to serve this much traffic basically since 2016 we've been serving at roughly 100 gigabits a second with with kernel TLS from you know single what we call a flash appliance and we want to continue to drive our costs down and consolidate things and reduce density so we want to try to do chaordic if it's a second from a single box so in order to explain why this is a challenge I need to first talk a little bit about our workload so we use FreeBSD current and we're basically we're basically a web server we use the nginx web server and we serve all of our video the send file and kernel TLS and like like if you're here in the last talk you know we even able kernel TLS now with this TCP Tilak's TS TX TLS enable try to say that five times fast so why do we need 200 gig if it's a second why do we need new myrrh for 200 gigabits a second and in fact what is new Mouse I'll explain you about in a little bit but first let me talk about why about where we are with a hundred and where we need to be for 200 so 400 we started off with a Broadwell Xeon for our original hundred G and mm 2016 or so and that has about 60 gigabytes a second or about 400 gigabits a second of memory than with about 40 lanes of PCI Express and now we've moved on to newer Intel generation skylake in cascade Lake which have 90 gigabytes a second of memory bandwidth which if you noticed isn't quite - isn't quite 800 gigabits and they have a little bit more PCIe gen3 but not you know not enough and so the this mace this diagram will seem a little bit familiar if you were here for the last talk and if I can figure this laser pointer at all try to annotate it I have my own laser pointer least I did so basically the the workflow for criminal classes I as I mentioned before is send file will pull data in from the disks into memory and then to encrypt it you've got to read it into the CPU and then to read it and then - once you've encrypted it you've got to write it back to memory and then once it's been written in memory the network interface card needs to read it to send it and so if you add all these if you add all these twenty-fives up it's pretty easy math you get you get to 100 gigabytes a second and memory bandwidth is what you need and from the last from the last slide you could see that the Xeon only had 90 gigabytes a second so how do we get - how do we get that how do we get that much memory bandwidth well the simplest thing to do is just throw another CPU socket at it so basically you know it's it you double everything you've got twice as much memory bandwidth twice as many PCIe lanes and you've got two UPI links I'll go into more detail about that later connecting the two sockets and you know on this on these prototype machines we have eight really fast and DME drives and we have two 100 Gig NICs and we thought well why not get AMD a chance so let's build a prototype around AMD and when we first started this we were looking at the AMD Naples series machines with the interesting thing is here is that B is that you can do this in a single socket with AMD and just like the just like the the Intel we have eight nvme drives but the on the AMD we actually have four NICs and I'll get into why and a little bit later in the presentation but we're not running 400 gigabits we're running you know 4 times 50 basically so you know once we doubled everything we're like yeah we're gonna get a lot of performance boost but actually the performance went down so you know our normal workflow or normal workload was we were getting about 85 gigs on AMD and about 130 gigs on Intel at 80 percent CPU and crazy stuff was happening you know we could crazy a distance these spikes that would drive our engine X latency way up which would cause clients to run away in terror and and I should mention by the way that in case it wasn't clear before all the testing I do is with real Netflix well I mean that's not the very beginning testing but they're the most of the real test thing I do is with real Netflix clients so if you live in San Jose or if you live in Chicago you've probably you've probably been served a video from from one of my machines and I apologize so anyway with with no optimization Numa is just wasn't it just was a non-starter we threw more hardware at it and they got either negative results or not enough positive results to matter so we didn't consider doing doing Numa for a long time because of earlier results that were very similar to these that we did in 2014 to 2015 so now we've got to understand the problem what's what's new moe what is this Numa stuff mean basically means non-uniform memory architecture or memory access depending on who you talk to so basically it means that stuff can be closer to one CPU than another like back in the good old days like you know 15 years ago before AMD did hypertransport and before intel did did qpi basically the way i multi it's system look was kind of like this you've got you know the central IO hub our North Ridge or whatever you want to call it sitting in the middle all the CPUs plug in equally all the memory plugs in equally all the disks plug in equally all the network cards plug in equally everybody has equal access to everything it doesn't matter if you're on why can't I figure this out there we go if you're on this CPU and you want to talk to this disk hey great go for it if you want if you want to store it in that memory yeah it doesn't really matter the problem is that these were slow and expensive and complicated to build and so CPU manufacturers figured out that it is better to basically sort of build a network on the on the motherboard and you wind up with something looks kind of like this where basically you have what's essentially two separate two separate systems that are tied together by this thing we call a Numa bus and what that really means is that stuff on the left side is basically his own computer and stuff on the right side is his own computer and these red circles we call them a locality call these localities owns a Numa domaine or a Numa node and so what that really means is that if you are on this CPU and you want to read something from this disk it's got to go across and ideally into your own memory or and if you're on the CPU and you want to access this memory it's got to go across this Numa bus or if you're on if you want to send something on this network card and it's stored in this memory it's got to go across this Numa bus and the problem is there's only so much bandwidth on this Numa bus and then once you get at AMD you get into something that looks even weirder and this is why we've got four network cards on AMD so we can have one each one of these red circles but basic with AMD you have you know Numa links between the different the four different neumann notes on a chip and you've got four Numa notes which is kind of a disaster which is why the AMD performance actually went down so much compared to the Intel so there's a there's a latency penalty to go across these links it's you know from everything I've read and what I've seen it's you're depending on manufacturer and revisions and stuff it's about 50 nanoseconds give or take give or take 50 nanoseconds the real problem is when you're sending a lot of both data across these links 50 nanoseconds can turn into 500 nanoseconds can turn into you know even milliseconds in some cases which is really really bad if what you're trying to do is read you know kernel text it's on the other domain or if you're trying to write to you know a global variable if you're trying to redo the end page and you better grow up you've got to wait for some bulk data transfer to pass that's really really bad and the CPU utilization goes crazy so and the bandwidth is speaking of both data is you know roughly from what I've read and they try to obscure these things by talking about Giga transfers per fortnight or something which makes it really wish makes it really really hard to figure out what you what you guys should get in bandwidth but from what I'd be able to figure out it's about 20 gigabytes a second per UPI link or about 40 gigabytes a second per Infinity fabric link and the AMD is even more complicated because it depends on the memory speed and on the new ones there's multiple there's multiplying factors and it's kind of crazy so anyway what I came up with was after playing around with lots of little optimizations to do things like move to the in page array to make the VM page array be backed by local memory on each domain I decided well I'm just kind of playing with the small stuff and but I really need to do is figure out a way to organize things and keep the bulk data off the Numa links because the bulk data like I was saying will congest the Numa Lincoln will slow down anything that you haven't managed it to localize so I'm going to go through basically the worst case if you do everything you possibly can traumatize incented out the network so he starts reading he starts reading from the disk whoops but whoops it goes across the numeral ank and into the other nodes memory because he wasn't paying attention money allocated memory and then he wants to encrypt it so he's gonna have to read it back across the name of us and whoops he forgot delicated on the right note again so he's gonna write it back into the he's gonna write it back into the wrong notes memory and then he wants to send it out on the network and you know maybe he should be using this network card up there but whoops he's gonna send it out this this other network card so we end up crossing the the Numa bus four times and you end up burning basically 100 gigabytes a second a bandwidth so at this point the fabric is going to saturate and you have CPU stalls you'll have latency spikes you'll have all kinds of crazy stuff so the best case is basically the case that I showed you kind of at the beginning where you read from you read from the disk and in the close memory the CPU reads reads it from the close memory encrypts it writes it into closed memory and then sends it out the network card that's closest to him and that's beautiful there's no Numa crossings this is how you know AMD and Intel would really like you to use these machines in an ideal world so how can we get as close as we can to this best case basically all the simplest idea well let's just forget about let's just pretend us two machines let's have one V let's have one VM turn per Numa node and pass everything through except if you do that you're gonna double your ipv4 address space and every IP v4 address is precious and so it Netflix when you get a when you get a movie or a video or a TV show or whatever you press play and your client talked to Netflix stuff running in the Amazon Cloud and that stuff and the Amazon Cloud figures out where the where the where which which machine has the file you want and if it's closed which is closest to you and which is next closest to you and next closest to you and so on and gives you a list of URLs where you can find that that file so if we double the number of machines then we're kind of doubling the work that we have to do in AWS in fact if we're running VMs we're kind of more than doubling it because now you got the hypervisor to manage so it's kind of a non-starter for that reason and the next idea well what if we used multiple IP addresses wait a sec I multiplied P addresses we don't want to do that so the same reason as before basically so basically how can we get as close to the best case as possible while using lag and LACP to combine the Knicks and just to just use one IP address and while keeping the catalog the same so that you know AWS doesn't have to do any extra work so we need to somehow impose order on this chaos and the first idea I came up with which was not the winner was what I called disc centric siloing which is basically try to do everything you can on the Numa node where the content actually lives and the other idea I came up with was network centric siloing which was try to do everything local to the network card that the connection came in on and if you don't know anything about LACP basically what you need to know is that when you're speaking at least LACP your your switch or router that you're talking to will take a connection and will hash it based on either the based on on some end tuple and it will decide which of the lagged ports that you're connected to that it will the traffic will go over so you have no control over that so basically we try to in in the network where the network centric styling we try to do as much work as we can on the Numa know that Canada where the LACP partner decided that the connection was going to live so let's talk about the thing that didn't work first so basically the idea was to associate a disk controller or an nvme really an nvm you drive with a Numa node and then to basically propagate the Numa affinity through the VFS layer until we got to a point where if we looked at a file if we looked at a tea node we know what Numa note it was associated with and again but again we have to do all the work to associate network connections with Numa nodes and the idea is we want to move the network connection to be as close to the to the to the content as we can so that if it comes in on one lag port it'll end up going out on the other so after we move everything there's going to be zero Numa crossings for a bulk data the problems with this was that like like I said there's no way to tell the LIC partner you know I don't want to come in this note I wanted to come in this note you can't do that so basically while you're setting up the connection while you're doing the get you're gonna have your admin before you before you know what content you're talking about your acts are gonna be going and your replies remain going out one port then as soon as you figure it out it's gonna be going out the other port so you can have stuffing on going out both ports and with TCP that can lead to reordering and that's kind of bad news and I think Randall would be upset if I did that so the other problem is that unbeknownst to me clients will actually reuse the connection and make multiple make multiple requests in the same connection for those of you that love or hate the newish feature where if you're on the Netflix homepage just crap starts playing all the time back but it'll reuse connections for all that junk that's that's that's that's playing all the time and so you'll end up having stuff coming from all that from all the Numa nodes on the same connection so I was seeing connections being moved around willy-nilly and TCP retransmits going crazy and I said it was a bad idea so I went back to the other idea which was the network centric siloing which is basically it's basically just done plumbing and that's good because I'm just de plumber so essentially you have to associate the network connections with the Numa nodes and you have to allocate local memory to two back-to-back the media files and you allocate local memory for crypto and you run the TCP pacers and on the local on the local node and you manage to choose a local node to send the to send the data on so how do we do all this to associate the network connections with the Numa nodes basically I'm going to go through some kind of nitty-gritty details of what's been committed and what's in review and all that kind of stuff so if you're not a developer you may want to check your phone so basically I added a Newman ode to a new beau domain no to the struck down buff there was just a tiny a little bit of room and I stole it and that was added a few months ago and I also added a new modem into the if' net struct also a few months ago and this is kind of all groundwork so try to stay awake basically and and what's that once I did this when a driver received a packet he can tag that packet as he receives it with his Numa node and that's in the deaths in the tree too and I also added a Numa domaine to the InP CB struct which is also in the tree and basically the idea is that when the TCP connection connections before is born when instance in cache in the same cache expansion you've got a neumann node there in the EM buff that caused the the kinetics because of the connection to get established and you can then propagate it in the on PCB table and so the next the next trick is to make sure that you give that connection to the right nginx worker and I'll detail that in a little bit so the other and the other trick is what I thought was going to be a hard job which is to allocate local memory for the for send file to back to video files and I actually came up with this gigantic patch to plumb you know all the way from from send file down into done into the VM page allocation routines a Numa node and it turns out that I don't need any of that stuff basically if you have a first touch policy and nginx is bound to the the right domain then everything just works automatically and I want to thank Alan Cox and Constantine for pointing out my stupidity and making me realize that the vm system already did everything I needed it to do so that was two weeks of my life I'll never get back so the other trick is to allocate the local memory for the for the TLS buffers so basically we run the TLS worker threads I mentioned in the last presentation we basically have a thread pool of four CPU TLS workers and the idea is that normally connections are just hashed to them using using just software had using something software hashing on the on the end tuple so that the same connection goes through the same you know TLS worker but what I did was add a filter based on Numa domain in front of that so that connections that were associated with node 0 will go to a worker that's run will be hashed to a worker that's running on the CPU and unknown 0 and no similar with node 1 so and I also set the the K TLS workers to have a domain allocation policy so that they'll allocate stuff local to their domain so that way we'll we're doing the crypto on the same domain the connection lives on and we're doing a crypto into and out of local memory and this the Katie left stop is in review currently so how do we choose the right lag port to go out of so like I said earlier M buffs are tagged can be tagged with the Numa domain so when we go through out the output or ip6 output we we tagged the outgoing M buffs and I've organized I've done a patch to lag which is in the tree which is enabled if you have the used Numa option set for lag where basically you've got this high similarly k TLS you've got this hierarchy rather than just hashing directly to to any lag board in the system first you you filter by neumann domain and then you only choose a lag port in that dome that's connected to a NIC and that domain and obviously if there's no NIC in that domain it'll fall back to just hashing to anything so that you can still send even if there's even if that lag ports down and that's in the tree so how do you choose the right nginx worker this is this was the hard part for me so right now we've got this ester to reuse port stuff that came in I don't know about a year ago or so where essentially what that means is that you can have multiple threads multiple processes share the same list and socket and again it's kind of like lag things are hashed fairly the connection new connections are hashed fairly to these lists and sockets and that allows you to have you know a bunch of ngx workers all listening on port 80 and port 443 and so what so the the obvious thing to do is and everything's obvious in hindsight the obvious thing to do is to filter that by Numa domain so that you end up with you end up with you know you've you end up with a new socket option unfortunately because of the way ng next works and I can go into deep California yeah I'd look but sure why not why not so the way nginx works is the master process start starts up creates all the lists and sockets and then Forks off as children and at least you know for a mere mortal reading the entry next source code there's no way to tell which listing sock is going to go to which child to which domain so the easiest thing for me to do was to make a new socket option which was called after the childhood in here had had inherited his list in socket and sort of taken possession of it and after he bound himself to 2-dose cpu then I can call a sock an option when the colonel says ah you're running on this CPU which is on this domain and you want to you and you want your listen socket filtered there so that basically builds up another one of these hierarchical models where first you filter Purnima domain into a list and socket and then I eat you hash among all the different workers that are that are listening on that domain and like lag there's a fall backwards if there's nobody on that domain it'll go back to hashing among all the listing sockets on that for it globally and that's also in review and so let's go back to that same diagram where I talked about the the worst case so in this model the worst case is basically if you always cheat if you always get unlucky and your content is always on the wrong domain so you know going back to what we talked about before we're running on the bottom noon on the bottom numa numa domain on the bottom CPU and we wanted we inserted a request comes in and we're reading that we're reading data from this disk on the top so we go for one Numa bus crossing read it into local memory and then we and then we read it out of local memory and yay we're encrypting it on the right on the right CPU and now we're writing it back to a crypto buffer that we were smart enough to allocate on the right CPU and now we are going to send it on on the local Nick because the connection came in on this on this bottom domain originally so now in the worst case we've got one Numa domain crossing and so basically you're doing 100% of the the disk reads the nvme reads across Numa which is about 25 gigabytes a second on the fabric which is much less than 40 gigabytes a second the second of the fabric bandwidth but the nice thing is the average case which is better the average case is about a half a Numa crossing because you're you're gonna get green getting it right about half the time you're gonna get an unlucky about half the time so it's about 50 percent across the fabric and that's about 12 and a half gigabytes of data on the on the fabric and the nice thing is in this case the CPU doesn't saturate and we had 190 gigs so for the 4 node it's the average case is a little bit worse because you've only got a 25% chance of getting lucky so 75% is across Numa and you get a little bit higher bandwidth going across the new my bus but that's still less than 40 gigabytes a second and we can still get better than 190 gigs so here's what everybody's here to see one thing I should mention before I go into the performance results this is sort of a game of moving goal posts when I first started looking into this we were looking at the at the then the Naples the first version of AMD and we were looking at the skylake Intel and since then you know both of these motherboards have had their CPUs swapped to the to the latest and greatest from the different manufacturers so those first initial results were from FreeBSD from like you know fall of 2018 ish with the older CPUs these new results are from just last week with a AMD Rome Rome CPU and Intel cascade Lake CPU so in and this is why the xeon performance is lower that's something that I don't quite understand the way I got these numbers was to basically go through and intentionally torpedo all the Optimates all the optimizations I've done and when I did that I was surprised a little bit by the fact that it's 105 rather than 130 and I think some of that is some of the work that Marc and Jeff have done to make things to make things better for Numa where I guess if you make things if you make things better you kind of make things worse if it makes any sense there's some stuff in um a that that we have turned on at Netflix which will try to sort basically it will try to if you do a um a allocation of like an M buff or something on one domain and you do a free on the other domain it'll try to return the memory to the proper domain rather than mixing up the UM a zone so that so that you can still have a nice Numa zone but the problem is that once you've once you have freed a lot of stuff on the wrong domain that option gets really expensive because you're taking a lock and you're moving things back to the proper domain and then and when you're doing things right it's awesome and you're not when you're doing things writing you're not doing a lot of cross domain freeze it's it's great but when you're doing a lot of cross domain freeze that's expensive and so basically I've actually measured with the Intel PCM tools the fact the the qpi utilization if they give you this this metric that tells you how much of the memory controller accesses were remote versus local and it goes from 40 percent to 13 percent and on epic because the four nodes things are even worse so the you go for even better I guess you'd say so you go from 68 gigs to 194 gigs and for people who like visual representations this is the the Xeon before and after so roughly a hundred to roughly 200 and the and the this is the the utilization on the on the qpi bus again going from about forty percent to about 13 percent and here's the bandwidth on the on the AMD going from you know 60-ish gigs to 195 gigs and for people who like green screens with raw data this is the output from PCM X showing the memory controller traffic as I was as I was mentioning this is the EPI data traffic memory control over over a memory memory controller traffic which is 0.4 and that's that's bad and this for people who aren't familiar with this I wrote it so it's my favorite tool it's something called I call n stat which is a I got sick of having a window for vmstat in a window for for a net stat and either running that stat with a with a with the delay of 8 seconds or doing the conversion in my head to convert from from the front bytes to bits so I wrote a tool that that does it spits out all the stuff I care about so it's my tool I can do what I want it's import stuff so anybody can use it but basically it's this is the output gigabits per second the important fields here the number of TCP connections the percent CPU and things like system calls and how many interrupts and context switches and how much memory is free in the machine and input and output and million millions of packets a second so that's this is the this is of course the before and this is the after you can see the 13% remote and that's a that's a good number and you can see the hundred and ninety ish 191 gigs with 150 thousand TCP connections and in the 70ish percent cpu with you know a hundred thousand contexts which is a second thank you tcp pacing and and for people who like looking at internal netflix metrics this is our internal bandwidth graphs showing each showing each link separately stacking to about a hundred 190 when the machines finished ramping up and here's the same stuff from the AMD and i've crossed out the the model because it's not a released model it's a roughly equivalent to the the model number i said at the beginning of the presentation except it has a lower clock speed so the actual AMD results would be better than this because the actual AMD the real AMD cpu that's like this would be higher clocked so this may be doing AMD a slight to service by mentioning this but i would imagine the cpu number would probably be maybe eight or ten percent lower on the on the real AMD part and again here's since then the other the big frustration with AMD is that they don't export enough counters for us to be able to measure the fabric utilization and we've complained about it to them and i've heard the linux folks who are also complaining because linux doesn't have it either so if you happen to a good relationship with AMD complain about it too please anyway so here is the the data from the green screen data showing 194 gigabits a second when it's getting close to ramping up and this is not nearly as pretty because we're not used to the way these things are numbered they were numbered there were two port NICs to they're numbered you know 0 2 4 and 6 and no other machine has that many NICs so it doesn't fit it doesn't fit and nobody's ever picked a color for its but this this bar is the you know roughly 200 200 gig line and it goes up to 400 because there's because there's for 100 gig links active in the lag but it's not really going to go up to 400 because some of the merchant gen 3x8 links so that's it I've rambled on for a long time about something really simple so if anybody has any questions this is this would be the time [Applause] [Music] some is for management in different part of the world it's above my pay grade in terms of like you're worried about like you know like a million connections on one domain and no connections on the other or the we deal in you know orders of thousands or tens of thousands of connections hundreds of thousands of connections and it on that level it's rough it's it's gonna it's roughly gonna be fair because lag is gonna be you know hashing to the different nicks in a in a fair way obviously if one link goes down then you're going to lose half your bandwidth but you've still got enough capacity in that in that neumann node where the link is up that you're gonna be fine does that kind of answer the question because i think it would be a different story if if a you receipt be you constrained because you were doing a lot of work that wasn't basically if the connection was doing more work than you anticipated i guess i guess would be the way the way to say it right if one connections could somehow or a small number of connections could somehow cause an inordinate number amount of cpu use but that's not something that can really happen on the AMD you've got four NICs so you have a theoretical bandwidth of 400 well 300 actually well because it's it's it's a it's a an older motherboard so it's only a PCI gen3 so they're not they're not hooked up with full bandwidth reason why actually I mean in when I was testing earlier if I if I let that guy ramp up I think I could get it I I think I know I got over 200 the problem is that when you do that if you have if flag is hashing everything fairly which which it is then you are screwing over the people that come in on the on the on the links that are limited to 50 gigs because they're gonna be bandwidth constraint and TCP is gonna be you know gonna be sort of seeing congestion because if the NIC is going to be dropping packets on the way out hey Kirk 100% of them so for capacity planning purposes we we have to prepare capacity planning purposes and for my performance work we do everything with 100% TLS 660 ish percent on AMD and 70 ishani Intel we that's come down over the years the the CPU use now for a hundred percent TLS thanks to a lot of the work that's been done in the VM system by Jeff and Mark and Konstantin is down in the upper 50s and that you see the Broadwell machines that I was talking about earlier at the they are so close to the memory bandwidth limit limit that the performance it's like kind of like a hockey stick whereas the memory bandwidth this is the memory bandwidth on this axis and on this axis and then the CPUs is like this as you get like the limit the hard limit is like 60 gigabytes a second but as you get much of the further over e-50 you get the some more you sort to climb up on this hockey stick and the the any little thing like every cache line on those machines like any cache line is sacred so so basically any cache miss you can eat you can avoid you you move further and further down that hockey stick and as you you save an inordinate amount of CPU like you could eliminate I mean I there was an early optimization I did where I eliminated you know looking at the third line that cache line of a ten buff which saved like two or three percent CPU on those machines the same optimization on you know a cascade lake we probably save almost nothing because it's got excess excess bandwidth does that answer your question Adrian where a single listen call would return 16 sockets the on the per CPU PCBs and then he added like a call where the worker could get the socket and query on which CPU PCB this socket wasn't then find the worker there okay an approach like that help you for the nginx matching of the work of threads it might I would this with this was part of his RSS work I think but it never made the tree I think because of UDP I maybe we can talk afterward because I'm not familiar with that piece of it going once going twice alright I think I'm done thank you [Applause] you
Info
Channel: EuroBSDcon
Views: 7,174
Rating: 5 out of 5
Keywords:
Id: 8NSzkYSX5nY
Channel Id: undefined
Length: 40min 30sec (2430 seconds)
Published: Sun Oct 27 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.