De-mystifying interrupt balancing: irqbalance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to day three of Linux Kampf day you today the talk will be given by mr. Peter Boscovich and the talk is titled demystifying interrupt handling and sorry demystifying interrupt balancing irq balance so demystifying interrupt balancing irq balance PJ Peter wasowicz please and just to let you know I expected a much smaller turnout I am NOT Steven Ross that giving the F trace tutorial so thank you for coming to this that's a one that I actually wanted to see so hopefully I'll be able to get to see it on the video and watching waiting for the exodus of people that were expecting F trace here so this is why I hope that you guys are here this talk is something that I've been thinking about giving for a number of years because of the number of questions that I get over time about in a rapidity and how desire key balance actually work so to start with some more fundamental questions I wanted to talk first at what does it actually mean to balance interrupts what does that actually mean on a system and what kind of systems do we try to balance interrupts on and all of the different scenarios that can that can happen there and it's always good to know why you're doing something so we'll talk about why is interrupt balancing important talk about some scenarios that people run into in enterprise on mobile and how they apply to this this thing that we're trying to define today and then things are not rainbows and unicorns all the time when you try to do things you may have the best intentions that turn into nightmares to debug so we'll look at if you actually do try to do some good interrupt balancing why does it hurt you sometimes how many of you in here have actually run irq balance and just rip your hair out trying to figure out why it did the wrong thing so that's why you're here so hopefully when you when you leave here you won't rip your hair out as much and I can actually get some better better performance then I wanted to talk about a little bit more about how policies are actually applied in Linux specifically around interrupt balancing I get a lot of common questions so this is where we're trying to pull the curtain back on why are we doing things in iraqi balance and maybe not inside the kernel itself then we actually get to talk about iraqi balance how does it work what is the data that it's actually pulling out of the kernel what's exposed via the kernel and how does it actually collate all that together to make the decisions that it's making whether or not they're good or poor decisions what are the challenges that challenges that we have what are some of the improvements that we've been making over the years to try to give a bit more flexibility with our key balance more knobs to tune it so that maybe things that you have not seen or you looked at the man page into said I have really no idea how that's supposed to work so you ignore it so hopefully we'll will kind of clarify that a bit today and talk about some more of the future improvements that we're thinking about and that's where I'd like to open it up to discussion after that for see if anyone else has heartburn that they want to talk about or questions about irq Mellon's so to start with as Cody said I'm PJ wasowicz I have been working in Linux networking for over 12 years specifically in the kernel both in the network stack and then also on a number of the high speed network drivers for Intel the i-x GB drivers the iPhone ee ice drivers and so with that I've been focusing on a lot of performance and scalability when we tried to start scaling say the 10 gigabit NICs back like 8 to 10 years ago interrupt balancing was one of the first and foremost problems that we had to solve after we solved the multi Q problem in the kernel so this area in high-speed networking is very very relevant to the interrupt balancing work that I've been also doing to support some of that I've also worked in the interrupt core to expose some of these knobs to make IR key balance a little less dumb and we can all argue as to what level of dumbness irq balance has achieved today and I can say that because I am a co maintainer of irq balance so we are here to try to improve it a little bit more about me I like to race motorcycles I like driving over things in the woods in my Jeep and recently got married which will kind of put those two things in jeopardy but yeah if there's me at the track and then me driving over a dead stump we did not harm anything in this picture okay so when I say balancing interrupts what does that actually mean to you and what are we trying to actually achieve right so what's driving the needs for this and one of the big things is that we have this multi core system war I I call it the core wars are alive and well we have Intel you know saying I can put more cores on a die and then AMD says well I could do better than that and then we've got armed 64 systems over there saying I've got 32 cores on a die as well so we have all of these cores that need work to do and when we have lots and lots of i/o devices we need to feed those cores right so we have to aim interrupts at different cores in some kind of same fashion to try to get the best use out of all of these extra CPUs that we have things get even more complicated when we talk about multi socket systems where you have you know two to four to eight sockets all with their own PCI Express Lanes for i/o devices to come in we're gonna get into all of this with with Numa locality I think that's actually my next bullet so the memory subsystems have also been getting more complex the topologies are looking a little different you know back in the day you had you know your PCI slot connected to your i/o hub and everything was kind of equated to memory these days we have PCI controllers are now in the on this actual CPU die right so on the Intel systems they're in the in the uncor so you have all of your PCI Express Lanes coming out of each CPU socket and you have your Neumann domains your memory controllers hanging off of each CPU socket so you don't have an equidistant hop to memory if you happen to be connected physically in a PCI Express slot to one socket but all of your memory is on the other sockets Neumann domain so how do you balance interrupts to make that more efficient if it's even possible so these are things that we also have to take into consideration IO devices have kind of had this explosion of the ability to generate lots and lots of I own right so network devices back when we started doing this work in say 2007-2008 we had two queues we had two receiving two arcs queues and man we couldn't believe what we could do with that parallelism was was was mind boggling and then we upgraded to sixteen queues each and then we went and went whole hog and got 128 queues receive and transmit and now how do you start pairing out those I out I owe queues this is where virtualization came in and started getting queues dedicated to VMs and now we have containers and now our network devices have tens of thousands of queue pairs right so I think the current Mellanox cards out there have 16,000 transmit 16,000 receive queues and if they're all generating i/o and causing interrupts to fire on a lot of these queues how do you actually spread those out to make the best use of them storage devices have always sort of lagged behind how network devices have evolved in terms of parallelism so SATA controllers for a long time we're just you know single queue in and out they started to become a little more parallelized nvme by the spec actually defines multiple send and completion queues and now they have multiple interrupt sources as well and that leads me into the next point all of this put together is kind of made even harder to deal with because now your i/o devices don't generate just one interrupt right back in the day you had your legacy interrupts then we moved to MSI interrupts or messaged signal interrupts which are a single interrupts source IO devices can now generate thousands of interrupts in parallel right with MSI X so the question is how do we actually properly spread all of these things out to all of the CPUs so that we're not varying one CPU with all of the work everyone's still awake now all they heard was the groans so I'm going to assume yes so given all of that why is it actually an important task and that previous slide really focused on scalability right scalability is kind of the natural default here right if you have you know 25 Gigabit Nick or 100 gigabit Nick and you have multiple interrupts lots and lots of CPUs you want to fan them out maybe use our assess receive side scaling on the Nick to spread your flows out or you're using maybe a RFS advanced receive flow steering any way that you can figure out how to fan out your your actual dmas and the Nick will generate interrupts that then have to be serviced by CPUs so we want to make use of those CPUs by spreading out the load in a lot of the tests that we did over the years the best way to get the best performance is to pair up queues so you'd have a receive and a transmit queue that kind of get married together and they have a single interrupt and you try to figure out how your flows that you're running on your CPUs your network flows would be pinned to a specific core that you happen to have that interrupts firing on if you're able to do this and get this beautiful silo you get really really good cache locality on the CPUs and your performance is about as good as it's gonna get and if someone asks a question at the end why can't we just make our queue balanced do that I would love to have a chat with you to figure out how to do that and as I said this this is very very important for Network scalability right to be able to have this this cache locality to scale nvme is definitely knocking on the door now on running into some of the same scalability issues without having a interrupt alignment the next part is a little bit different this is kind of a more special case with cache efficiency so it's a different type of cache efficiency for locality if you had a work load a custom workload that you were trying to keep within a shared cache right so if you know about caching domains you have l1 which is typically one to one with CPUs your l2 caches are typically shared across a couple different cores depending on the model of the CPU and then as you go deeper and deeper in the cache you have much more CPUs that are sharing one one of the caches so if you had a workload that wanted to have threads within two cores or four cores that are on the same shared cache you can balance the interrupts to fire on any of the CPUs that are in that shared cache so you maintain cache locality within your application again this is a very very special use case I mention it because I our key balance gives you the knob to hang yourself with if you're trying to do this so memory efficiency this is a this is another one I had mentioned the the Numa crosstalk so if I have a network device that is physically connected through PCI Express to one CPU socket and I have all of my memory also allocated on that socket but all of the interrupts for that PCI Express device are aimed at the other socket CPUs every single interrupt is going to cause across Numa traffic so you're going to be crossing the CPUs interconnect and we have seen worst case scenario you know synthetic tests where we can bury the qpi on even existing systems today for skylake which are blazing fast we can completely saturate that link if we screw this up on purpose and I our key balance kind of screws it up and makes it pretty painful and then the last one is power efficient right so if we want to balance interrupts it's not always about scalability it could be trying to collapse interrupts down to a certain subset of chords right as long as there's an interrupt tied to a CPU and a can fire on that CPU we will never allow that CPU to drop down at deep sea States and potentially allow a whole package to go offline so we're keeping caches hot we're keepin CPUs hot so on the laptop settings or even in data centers where you want to particularly in cloud where you want to spin things up very very quickly get all your CPU is going and then if the workloads drop you want to migrate all your workload on to one socket potentially to allow a whole bunch of other sockets to go offline so this is why you would also want to balance interrupts so the question is how are all of these related and the answer is they're not really related at all scalability is one thing you want to go big and wide and you know siloed and parallel cache efficiency is very very tight that if you have all this parallelism going on around that you're gonna be thrashing the crap out of your cache memory efficiency is kind of related to scalability but then power efficiency is on the other end of the spectrum so these are all of the issues that irq balance has to try to deal with so hopefully now you have a little bit of sympathy for it and if not that's okay so let's just run through a couple scenarios these are kind of synthetic pie-in-the-sky scenarios your target system wants power efficiency so like a laptop but all of the interrupts are assigned to each and every CPU core right worst-case scenario interrupts keep the CPS and caches from dropping in to see these dates so you fail scenario two you wanted maximum performance and scalability but the thing that I bet most of you in this room will raise your hand on if I say how often have you seen a system where all of your interrupts on this massively parallel system or on CPUs zero yeah so that sucks because now you have head-of-line blocking for every single interrupt on every device is waiting for CPU to zero to re-enable an interrupt so it can enter up again and go into that I is our the inter-service routine and now you have completely major system 1cor waiting to have a bunch of cores do a little bit of work right so you have to wait for CPU zero to process all the interrupts and you fail and then the third scenario is if you want a mix of both performance and scalability but want great power efficiency as they say in Ireland good luck okay so just to give a little bit more context around irq contexts it's a really terrible pun this is important to understand when we talk about interrupts in the kernel that you're trying to balance because there are different two different contexts or main contexts we have Hardware interrupts right these are the things we've been referring to in the i/o devices they run in hard irq context in the kernel these include legacy interrupts so if you know anything about PCI devices or even ISO devices you'll see the int a or in B C or D pin interrupts these are called legacy this is an homage back to the old iron QL line days if you've ever screwed up your jumper settings on like your Sound Blaster card don't let the non-grey hair fool you I've run into this problem a lot even when you have the floppy disk to reprogram the virtual jumpers as to where your interrupts actually fired and then in PCI we introduced MSI or the messaged single interrupt so this is a non shareable interrupt so every device actually gets its own interrupt line and this is really just a PCI Express message and then MSI X this is many msi interrupts on a device this layout on a system is exposed via proc interrupts then there are software interrupts and these are generated by the kernel these run in soft irq context they are still considered an exception level thing so the same rules for interrupts apply to software cues as in you can't sleep inside of an interrupt well I teach a class on how to write device drivers back home and at a local university and a student actually asked me you can't sleep in one well you can sleep in and interrupt but I don't recommend you do it because you will have a kernel crash every time that is different and you'll be infuriated as what what's going wrong if you want to hear more about a horror story that took us three months to actually find a bug that that was this you can pull me aside outside anyways software interrupts are generally used for Network receive processing that's really the bulk of where software IQs run in the kernel this is the polling mechanism known as nappy if anyone has heard of that it is known as the new API and it's been around for about fourteen years and these software interrupts are exposed via proc soft I recuse and I'll talk a little bit more about that so just to kind of clarify a little bit more just some background information hired hard iron queues always run higher than software accuse and you must have interrupts disabled on the CPU right if hardware interrupt shows up you better make sure that that interrupt is disabled on the CPU because if a software interrupt tries to fire you will typically have a different type of interrupt fire shortly thereafter which is a machine check delivered via nmi or non-maskable interrupt and you have some problems so it's a bit of an eye chart and I apologize and I am gonna turn around and look at this this is what a snapshot of proc interrupts looks like if you have never seen it on the leftmost column is the interrupt number and then every other column after that is one column per CPU so this is from an eight thread fork or a thread system I decided to take the snapshot from this and not my dual sockets curly skylake with a hundred twelve cores because then there would be a hundred twelve columns that I would have to put in like two point font and it would still wrap about four times but what this represents is on the rightmost side is the actual interrupt that was created with the request irq function call inside of a driver these are hardware interrupts and this is showing a count of each CPU where that interrupts fired since the system last booted okay so I'll give you time later to look at this if you if you have a burning desire to watch this video again interrupt 124 is my en should be at 0 but it's a E&P 0s 31 f6 that is my one gig device and it has one interrupt it's an msi interrupts and you can see that this this very large number on cpu 6 indicates that all of the interrupts right now are pinned to cpu 6 for that one device if we look at proct soft iron queues it's a little different these are on the left side these are all the different software cues that the kernel currently has if you want to learn about the RC you saw fire to you mr. Paul is right here and he has a talk right after mine I highly recommend going to that the one that I'm more interested in is the net rx this is that receive side nappy polling software interrupt and you can see on CPU 6 this number is a lot larger than the other CPUs talk a little bit more about this here in a second so there are some differences with how affinity works with with hardware and software interrupts as I said hardware interrupts are generated by the endpoint devices they come across as a PCI Express message then that puts the CPU into an exception handler looks it up in the a pic the programmable interrupt controller and finds where that registered interrupt service routine is and then calls it and if you've never seen a PCI Express message there is a memory right in here that's going across the bus this is from a PCI Express analyzer this is what's happening on the actual bus to generate and interrupt in contrast software interrupts they are scheduled via the kernel as I said there's a do soft irq function call that you have in kernel core these are run out of the soft I er QD threads so if you did a process list of PS you see these case off tire QD slash 0 through whatever and that whatever is the number of cores you have on your system on it's in brackets Colonel thread that's actually running the software queues on that particular CPU and that is a very very important point that software interrupts are raised on the CPU of the kernel thread that is currently running on that CPU so if you have the kernel thread running in like CPU four and you do a software EQ that software EQ will fire on CPU four so the the subtlety here is for the hardware interrupts there is a another knob that irq balance makes use of proc irq and then the hardware irq number SNP affinity this is only applicable to hardware interrupts so if you're thinking well I've got hardware interrupts that I'm trying to balance and then I've got these software queues kind of running over here and you're telling me that my network drivers and DEP network devices are using software or higher queues to do the bulk of their work and this is the only knob that I've got how do i Affinia ties those other things and we'll get to that so this is just a quick listing of what that proc irq and then in this case 149 this is an MSI X vector on an eye for TE driver there are a few hooks here the two important ones to look at our SP affinity you can see that it is writable by route this is how we change the affinity for this particular interrupt vector and then there's this affinity hint hook that's also hanging off but that's also only read-only this is something that's exposed and can be programmed to by the driver okay so how can interrupt balancing go wrong this is this is always fun to talk about so device drivers when they allocate memory they by default pretty much always allocate now on the closest Numa node so we used fmk malloc we can override this with an underscore node and force a certain limit node and that's usually fraught with danger but the proximity of where this Numa node is can be affected by the PCIe root complex as I mentioned before we have PCIe root complexes on each CPU socket but you really should look at what your vendor did when they built the motherboard and where those PCI Express Lanes are electrically coming out of the sockets a lot of the 1u pizza box servers you may have two sockets but only one of those sockets lanes are actually physically connected to the PCI Express slots so you have a whole bunch of PCIe lanes on another CPU that are hooked up to nothing the memory that we're allocating are the DMA buffers and the descriptor rings and where this gets into like a problem is again referring to this worst case scenario that I mentioned before if you have applications getting scheduled on different cores on different sockets that allocate say for networking user space socket buffers it could be on a different Numa no than the NIC and it can be even worse if you're in this case where you're allocating on a Numa know that you're physically connected to through PCI Express but on a different socket and then your applications oh and and and then hilarity ensues so some some diagrams cuz pictures are good and this is a very very simple picture right this is a very old picture of what does good interrupt balancing what is what is it what should it look like and here we have an ingress egress you know rx txq and we have this this the silo that I refer to of cache locality and cache goodness up to a CPU core what good and interrupt balancing on the whole system level topology should really look like and this is really kind of hard to parse and I apologize if anyone can actually look and see all these little numbers this was taken from a patent application that I had on our key balancing about nine years ago so it is with use of mine but it does show how does the each MSI X vector get applied to multiple cue pairs and how do they move around on the CPUs in a dynamic way this was this was around how to do dynamic power efficiency so but it still applies and then there's what does interrupt balancing really look like if you just kind of haphazardly go and and it's pretty much like that okay okay yes it's apparently this is a 50 lane wide road in Beijing I didn't even know they made them that big but apparently they do okay so now to our your key balance so the main flow of operation how does actually how does it actually work so it does run in userspace right that is a that is an important distinction it is privileged we saw that the proc ir q RQ naam S&P affinity file is only writable via route so it must run his route and we have information exposed from the kernel to determine all of these various system topologies right we've been talking about CPU layouts and and cache shared levels and and Numa nodes and how are they connected to all of the CPUs so there are a number of locations that I ricky balance will go and query the kernel on every every cycle when it's going through to see if anything changed right if you plug the CPU or pull memory out so it gets its CPU and cache configuration and layout from sis devices systems CPU it gets its Numa configuration from that path there and proc interrupts that we looked at that is the list of all of the allocated and also online interrupts so a driver may not be up and running yet so you may have not done the old school if config eat zero up but the driver may be loaded and ready to go you just may have not brought linked up that interrupt will be online and ready to go and it'll show up here it just won't be doing anything and then affinity hints per interrupt if they are exposed and we're going to get to that here in a slider - this may be populated by the driver and it may not be so if it is Iorek you balance will use it and there is a scan loop and right now that default is every 10 seconds it is changeable via command line I do not have the man page somewhere in here figure you can go read the man page and I'll keep it read the man page not the the typical acronyms so I Ricky balance then takes all of this data it will do a whole bunch of backflips and limbo and all this stuff to merge the Numa configuration with the CPU topology that is actually a lot of the the core code when it's processing and it builds this entire system topology map internally takes all of this what the interrupts load is where are the interrupts firing on which cpus and which interrupts are more busy than others and compares with any command line options that were passed to it as to how it's supposed to handle all of this data and then spits out an answer and that answer is usually not correct if it needs to go ahead and make any changes it goes and writes the new affinities out to this file this is the thing that actually goes and pokes at the a pic reprogramming rewrites the a pic values for those interrupts numbers to fire on different CPUs and just to reiterate it does not move the software cues that were scheduled from hardware interrupts so alluding back to the snappy thing for networking if you have a soft interrupts firing that was scheduled via a hardware interrupt and remember the software eq is run on the CPU that they were scheduled on so if you move your hardware interrupts over to a different cpu that hardware interrupts has to fire again on the different cpu to reschedule that soft irq on the different cpu this is a common issue that I get on scalability questions on the irq balance mailing lists this thing ran and I have this you know huge network card that's always under load and I rebalance interrupts and nothing changes and then I ask well did you change it through SNP affinity directly well yeah and that didn't fix it either so if you know anything about nappy nappy actually has to have no work to do to get back out of polling mode and go into hardware interrupt mode so this isn't one of those things that you you magically change this value and then things go yeah Paul I like stupid questions because cuz then hopefully they're easy to answer so the question is could there be an interface between our key balance and the nappy interface in the kernel - - feedback - move that is a great suggestion never thought of it and I like it there's a github page for higher key balance I'm not gonna say submit patches because that's the easy one but there actually is a Neil hormon is the the other maintainer he's at Red Hat we have a issue section to open up a bug or a tracker and usually when they open up we jump right on them that's that's a really good suggestion thank you Paul now is this where you're gonna be able to use the same interface for software interrupts things for the RCO software accused and move them around I was hoping that there's this thing called kernel or get kernel.org that I can submit Patras again okay okay so now to move a little bit more to the policy enforcement right so a lot of the work around irq balance in the last few years has been around how do we enforce policy but a lot of the questions I also get are why doesn't interrupt balancing just happen inside the kernel right and my personal take is I've actually tried to submit patches numerous times to allow the kernel or the device drivers to balance their own interrupts and that gets shot down because policy is always an entertaining conversation to have with kernel developers I agree with the concept at a certain level that user space should be making decisions on policy as to how the system works and the kernel should be responsible for providing all the information for that policy to be reasonably applied and you can see the success of irq balances policy that maybe we don't have enough information coming out of the kernel we may never have enough information coming out of the kernel because we may not get scheduler information based on where applications are running and yeah so I just kind of said that that the policy lives in user space we need the knobs we need the knobs to be able to get the right information to make the right calls and that is the common question is why doesn't the kernel just balance it if it already has all the information it knows where the memory is being allocated which Numa nodes and knows where it allocated its queue resources it knows where it well it allocates the interrupts and it should just say I want them right there because that's where I allocated all of the system memory for that ring that I want to Affinia ties to and the answer is see the first bullet point policy is always an entertaining conversation to have with kernel developers seriously though a few years back Christoph Helbig and I plumbers started talking about this as he was working on the nvme patches and realizing that well crap I can't Afeni ties my interrupts on this multi cue device well that's just stupid and then fast rewind about eight years before when I submitted the same patches to do it for networking and Christoph shot it down saying we don't do policy there was a question up there yeah so so so you're saying that yes there there are exceptions to policy being applied in the kernel versus user space and that is that is accurate this is one that we are actively trying to work on to to allow we do have a little bit of wiggle room but but we still can't apply in actual affinity directly to the irq core from a driver yet and I use yet because because one of these days those patches will go in okay thanks so how do we Bend irq balance to our will everyone's solution probably is killer our key balance and manually place the interrupts this is the typical thing I hear people do or if they don't even install it and there's this neat little set proc irq affinity script that has been running around the internet for years that has basically what we wrote and it just kind of you know this it went viral I guess is the the term today this is this is something that that works for some some things right but what happens if your system topology changes what happens if a workload fires up and starts hammering on a CPU that you didn't expect and now you have no flexibility to try to rebalance the interrupts to to accommodate that solution to is pray and that usually ends up with that lovely traffic jam solution 3 so this is getting into some more of the techniques that people are using today there are command-line options too hierarchy balance to ban certain higher queues this has been around for a bit and when you ban those interrupts then you can manually place those yourself so it's a a bit more robust than solution 1 so this will allow our queue balance to still move the other things around but the problem is it might be moving those things around and it can't sometimes make the right decisions because you won't let it balance other ones away if something gets hot right so this is still a slightly static configuration with a little bit of fluidity and then there's solution for this is a relatively new thing that we've added called policy scripts this is a programmatic way to define the topology so we suggest where we want all of the interrupts we allow other in our interrupts to move around and it's a very very nice way to override global settings that I our key balance currently has yeah so here's the the hint policy option this is something that is is you can give it this shell script or a policy script thing and this is a global setting and this is where I was saying that the okay now I remember what this bullet is this is where the driver is trying to expose its actual suggested affinity it's a hint and it needs to call irq set affinity hint that is that affinity hint thing if the driver does not expose that then irq balance is going to make a decision for you or you manually place this interrupt so the policy scripts are nice because it doesn't require the driver to have to do that right so you can specify I want to hint this thing over to this other CPU and I can define it in these key value pairs there's more information on it and the man page and you can specify the policy of those interrupts to be at the pneuma level at a certain cache level so you can be at the LLC you can have it at the l2 level the l1 level which would not be a great idea or at the physical core level so this gives you much finer grained control and we've seen some people use this and actually had some pretty good success with it they've been happy and I like trying to make people happy so this is this is where we're at today so where are we trying to go next and Paul had a great suggestion to solve the software EQ rebalancing issue but those knobs that I was mentioning that we would better need one of the things we don't have is scheduler feedback and that's one of the things I don't think we would get but knowing where applications are starting and where they're getting scheduled so that we can not have to set tasks and force them onto a CPU if irq balance can get something dynamic we can try to move interrupts around to move with network flows that are moving around between different cores this goes to the policy allow drivers to actually have better control of replacing their interrupts drivers know where they're allocating memory and they know where it would be best suited for an interrupt to fire so we should we should give them better control stay tuned some of the other improvements that we've been working on and this feeds into the Internet of Things theme I had to slip it in here somewhere eirick you balance has always been focused on more PCIe devices and PCI devices however a lot of the arm platforms have been starting to use this we've been finding that pcie devices don't get quite processed correctly by IR key balance I know that's a shocker so we've been getting some patches from the community to help out which is great and I think this is where we can really really benefit all of these are like really archaic command-line tools it would be nice to have some kind of like there is an ir key balance UI it is in the ir q balance tree it is not maintained by neil and i so I would like to see something around this to just make it easier for people to actually configure things so that they think that they configured something correctly and then they're like why the hell didn't that work right so so I think that's a that's an area that we can really improve on so to sum up so you can alway cup now interrupts balancing is not a simple operation right there's a lot of reasons that we want to try to interrupt interrupts and the reason may not be what happens on the actual reality side right in order to do this though there is a need for a complete system and topology and visibility to balance everything correctly these use cases right between scalability and power efficiency right there they're two bookends and understanding those are really really critical to having correct interrupt balancing right so I kind of threw these out and then just want to kind of tease your noodle a little bit how does this apply to virtual machines and virtual CPUs and pass-through Hardware not well how's that so with that I'd like to open it up to questions for a few minutes yes sorry so the answer to my question is are there going to be it's too hard or it's very simple so my previous employer who I won't name made a VoIP hardware for every ISDN connection it needs to generate 60 interrupts a second with rl6 kernels when you look at proc interrupts it smeared beautifully over all that the driver was multi-threaded it all worked in which was a 2.6 kernel in real 7 it's like 66% or on CPU one and the rest is smeared of what changed do you know so I don't know in that particular case it can be a lot of different things there were some interrupt balance irq balance changes so by default we were using affinity hint that was with anti-occupy literally setting the the process affinity to all if they want to run so it would run the kernel would randomly allocate it without IQ balance at all so I went down this path so my answer to that without looking at the code and see what changed is I I can I can assure you that the interrupts core did not have any hooks to allow us to do that programmatically that I was aware of because I had to go and write the other stuff yeah you talked at the beginning about for scaling at large particularly large network loads the need to spread across cause I've heard other people mention that actually if you've got a high CPU load you may actually want to dedicate a core to processing all your interrupts what are they what are the circumstances or the variables under which you would decide to go one way or the other yeah that's a good that's a good question so so the the very high level will hand wavy simple answer is it depends on the workload right so if you have lots and lots of small packets that are generating lots of interrupts nappy doesn't really ever kick in very well so you're gonna have a much higher interrupt rate so in that case you may want to collapse them down to like say 8 CPUs for bulk transfers it really depends on how much data you're moving and if you want to keep your your cores wide it also has to do with network hardware so some vendors they'll use our assess receive side scaling to blow the flows out and some of them fall over at a certain point of the number of cues that they bring online I know the Intel hardware the 80 to $5.99 if you went above 16 receive cues it's became a little marginal at that point yes Paul one of the fun things we've had with our Cu and some academic things is virtualization which you mentioned and that's one way you can sleep in an Arab handler is is that something that you have some way of taking account of or you're looking at it all so irq balance we the simple answer is no because once we're once we're inside of the the virtual machine we really don't we don't see a difference right we see an interrupt vector and it's at the kernels discretion as to how its going to handle sleeping we can balance and interrupt but it may not be a physical interrupt if it is an SR iov device of EF they are but they'll be going through the X to a pick instead and we can balance them within the the virtual machine but where they physically land after that and however the translation happens I think that's all we have time for could I get you please give a round of applause for Peter thank you you
Info
Channel: linux.conf.au
Views: 9,766
Rating: 5 out of 5
Keywords: lca, lca2019, #linux.conf.au#linux#foss#opensource, PJ(Peter)Waskiewicz
Id: hjMWVrqrt2U
Channel Id: undefined
Length: 45min 33sec (2733 seconds)
Published: Wed Jan 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.