An Overview of Scheduling in the FreeBSD Kernel: Marshall Kirk

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] good day and welcome to my virtual lecture at the virtual bsd can conference today i'm going to be talking about an overview of scheduling in the freebsd kernel so we're going to look at the scheduler in the freebsd kernel we're going to dive down into how it works and particularly some of the more recent changes that have been made in order to work with the ever larger number of core systems on which we're running scheduling is how we decide when and where and for how long to run all the threads that are in the system there's a bunch of threads that are in the operating system there's threads that are in the applications that are running a particular process may have a single thread or it may have a whole set of threads that are running within it so when it comes to scheduling we are looking at all of the threads that i've just described these threads get really divided into five different classes we have the i thread which is all the things running in the bottom half of the kernel and for the most part those are all the interrupts that are coming in all the asynchronous activities so disk interrupts network interrupts timer interrupts anything that is delivering interrupts and requires a thread to run in order to process those interrupts we have the current threads which are all of the threads that are running in the top half of the kernel and the top half of the kernel is all the synchronous things that the kernel is doing for the most part it is being driven by system calls that are coming in from the applications so you do a read or write or an open or close that system call comes into the kernel and while it's running in the top half of the kernel it will get one of these current level priorities and those tend to be higher than the priorities of processes that are running in user space because we want to get them through the kernel and out of the kernel so they're not holding locks that can potentially compete with other threads that are trying to get access to the data structures that they're looking at between these two we have the real time priorities and the real time priorities are for user processes that are real time processes and on the next slide we're going to look dig in a little deeper on how these priorities get set user processes for the most part run in this time share range and this is the area the set of priorities that we mostly focus on when we're talking about schedulers because it's the set of priorities that the colonel is devising for each of the processes that's running in the system and then finally at the very lowest set of priorities here we have this what are called the idle priorities and these are priorities that are for very much background tasks they'd be like a screensaver uh you know some other sort of activity that will not run if there's anything else that wants to go on within the system so you can see the numeric priorities here run from 0 to 255 higher values of priority imply lower levels of service that is to say zero is the highest priority and 255 is the lowest priority the ithred and kern classes are managed by the kernel the real-time priorities and the idle priorities are managed by user processes and so the user processes just set them uh the kernel will then just work with whatever they're set to and then finally we have the timeshare class and this is the uh the management of the the priorities that are shared between the kernel and user processes so user processes can influence these for example you can raise or lower your nice value which will bias them up or down but the kernel is actually selecting what that base level priority ought to be let's look at each of these different types of priorities in turn for the kernel priorities the colonel is pretty much in charge of those the system administrator can actually fiddle with some of those priorities if they want to so typically the interrupt threads are the highest priority is given to things that need the lowest latency and also that have high frequency so things like network packets tend to get a high priority whereas things like terminals if you have a serial line coming in that tends to get a low priority but for example in the old days we used to use serial lines to actually do networking we had a thing called serial line ip and in that case it was characters were coming in much faster than a user would typically type them and so it was important that that artificially be given a higher priority so that you wouldn't lose the characters okay but for the most part the kernel priorities we don't make any changes to so that really gives us three sets of areas that we can work with the first of these is real time and here the processes are setting the specific priority and the kernel doesn't second guess them it doesn't like they raise or lower them or change them it just runs with that and the effect of of this is that if you set a a real-time priority and that process then goes into an infinite loop and never gives up the cpu then the system will effectively look like it's locked up because the only thing that the kernel will be able to do is the interrupt threads but none of the top half of the kernel will ever run none of the user level will ever run so if you have a shell and you're trying to fix something you're you're out of luck so you have to be very careful if you're doing real-time priorities that you don't let yourself get into an infinite loop that you do give up the the processor periodically so that other things can happen the next two here are the schedulers that are available uh the the primary schedule that we use today is the interactive scheduler uh the so-called ule and then we still have the traditional scheduler the so-called 4bsd scheduler the 4bsd scheduler was actually written in 1978 by bill joy and myself as a stop gap measure until we had time to do something right and it actually lasted well into the the oos ule first came in freebsd version five and it was originally uh supposed to be the scheduler that would be the default scheduler but due to various problems that it had it wasn't really until freebsd8 uh that it would could be made the uh full-time the default scheduler for the system and ford bsd is still there uh it's a it's a very simple uh scheduler it doesn't have a lot of the as you'll see the abilities that ule has uh but it's still useful for certain embedded applications where you just have a small processor a small number of processes running and you don't need all of the bells and whistles of ula you just need something simple and fast so this is still there i'm going to not really talk much about it in this lecture i'm going to focus on ule because that is a default scheduler and it's the one that has the capabilities that we need on modern multi-core systems so ule as we will see deals with things like processor affinity and it actively calculates an interactivity score so that it can figure out which processes are batch sort of long-running computing processes and which ones users are interacting with so that it can give better priority to those interactive processes better response time finally the idle scheduler much like the real time scheduler the the priorities are simply set by the system administrator the kernel doesn't mess around with them it simply follows them and but unlike real time if one of these goes into an infinite loop it's not a problem because if there's anything else in the system that wants to run they will preempt it and so the the idols priorities really are not going to get you into any trouble now before we start diving down into the details of this i want to talk about how scheduling is really divided into two different parts there's what i call the low level scheduler and the high level scheduler and the low level scheduler is the the scheduling that runs when we do a context switch and on a modern processor that's busy context switching is something that occurs tens of thousands or even hundreds of thousands of times a second and so we need to have a very short time to figure out what we're going to run next if the current process is blocked and says i don't need to run now do something else we do not want to spend a lot of time trying to figure out what the next process ought to be and so the low level scheduler is as we'll see is just a set of priority queues and the low-level schedule just finds the highest priority thing to run and runs it and then the the as we'll see that's a very simple process it's like find a cue that's not empty pick the first thing off the list run it the higher level scheduler is the part that is figuring out where those different things should be what their priorities ought to be now of course in the case of real time those decisions are being made actually out in in user application land uh the the user application is figuring out what the priorities of each of its threads ought to be and that's presumably something that it reevaluates periodically and can change those priorities for the ule or for bsd these decisions are being made inside the kernel itself and as we will see this is done much less frequently so the for example for bsd once a second would run through all the processes in the system and recalculate what their priorities ought to be based on what they've been doing recently ule goes a step beyond that ule wants to never have to look at every process to make decisions about priorities so as you'll see it it tracks within each process the information that it needs to be able to make decisions about whether its priority should be raised or lowered whether it should be considered interactive or batch and thus figures out that priority in a way that doesn't require it looking at everything all at once and again the idle scheduler much like the real time scheduler the the high level decisions are being made in the user level uh application that's running where it can periodically decide if it wants to change the priorities that it's setting things to the colonel is only dealing with it at the low level and that is where it is where the processes are in the queue or the threads are in the queue and where they ought to be run this gets us to the low level part of the scheduling the the four bsd scheduler just has a single global set of run queues organized from highest to lowest priority it was designed in a day where we didn't have multiprocessors so just having a single queue for all of the processes made a certain amount of sense even when you have a small amount of multi-processing you don't have too much contention for the lock to be able to go look at that queue as you'll see in ule that's not no longer true so here it just had the set of processes so it would just scan the list find the highest priority run it ule scheduler actually uses three sets of run cues and there's a set of run cues for each cpu so each core in the in the system has its own set of run cues so when a processor stops running one thing and wants to select another one it doesn't look across all the cues in the system it just looks at its own cues so in the case of ule each each cpu has three cues from which to work and first of all we have the real-time queue and it has on it all of the the kernel both the the top and bottom half kernel and all of the real-time processes and those of the the timeshare threads that are classified as interactive all right and this is organized as a set of priorities from zero to 171. and so this this cue uh is what the first one that it will look at as we'll see later uh and as long as there's anything in this queue that's what it will pick to run when this one is empty it will then move on to the what's called the batch queue and this has all of the time shared threads that are classified as batch and it's actually organized in a thing called a calendar queue uh and i will have a slide a little later on that explains how calendar cues work and then finally there's an idle queue and it has processes in this top idle set of priorities so at a low level we'll check this if there's none there we'll do this if there's none there then we will do that and as i said unlike the old one where we had a single global cue that we had to have a lock on in order to manipulate here we just have the cues for each processor so that processor doesn't have to lock the queue because it's the only cpu that's ever going to be looking at it the one exception to that is that we may need to move processes from one cpu to another uh this is called load balancing and we'll talk about how load balancing works uh again in a later slide here's our priority based cues and this this priority based queue is the same as the old ford bsd1 or as the the first and third of the ule ones and the priority queues are just organized as an array of q heads uh linked lists and so what we need to do is starting at the lowest numbered one here uh we scan through uh until we find one that has something on it so initially we find that the priority 95 here has an emax so we're going to take that emacs what we do is we take it off this list we run it and then if it goes to sleep it's just gone until at such time as it wakes up if it uses up its time slice then it will come back here and get put at the end of the q 95 which in this case would be the only thing there so it would just essentially run until it went to sleep there would then be nothing here dot dot dot dot up until we get to 120 and now we're going to have rogue and if it uses up its time slice it'll go to the end and so we would then run vi and rogue and bi and rogan vi and rogan vi until such time as both of them were sleeping and then we'd get down here and start running xv and firefox and so on and so forth and if all of these were then sleeping so there was nothing on this queue then we would move on and go to the batch queue in the case of ule now we do not want to have to scan through 10 or 20 or 30 or 40 cues in order to find which one has something on it and if you think about cache lines there's a whole lot of cache lines we would have to hit to search through since each one of these is at least eight bytes so in order to speed finding where there's a non-empty queue we actually have a a bit array here which this called the status array and there's a the bit is set if there's something in that queue and at zero if there is nothing in that queue so here we can just pick up this single word or two words scan through to find the first bit that's set and that gives us then the index where we're going to find a cue that has something in it so we can pick up this and then go directly to the queue that we need pull the first thing off the queue and run it for ule then when we get to the batch level thing we are going to switch from the interactive uh pure priority based to something called a calendar-based queue and the idea of a calendar queue is to be fairer than we are with the the priority queues with the priority queues a high priority thing it can essentially starve anything at a lower priority from ever getting to run now if we've done our work right we aren't going to have anything on those higher priority queues it's going to run for very long if it's interactive it's supposed to do a little bit and then go to sleep so if it starts misbehaving in the sense of using a lot of cycles then what we will do is change it to being batch and then it will end up down here with the other batch processes now if it then goes back to behaving itself then perhaps it will get to uh move back to the interactive queue all right so the idea of the of the calendar queue is we want to try and be fair and let everything run at least a bit now things that have higher priority should get to run more than things that are at a lower priority but we don't ever want to get to the point where something that said that's just running continuously at a higher priority just blocks all the lower things from going on so a calendar queue works as a circular queue and so we have uh the the size of the queue here is nq the number of entries in the queue and we have this pointer run queue here which points to the current entry on which we are operating and what's going to happen is that we are going to once we get to run queue here we are going to run the everything that's on the list so in this case there's only one thing it's the c compiler it is going to run and it might go to sleep but it'll probably just run until we decide that it's not supposed to run anymore it's used up its time slice and at that point we are going to uh put it back onto the calendar queue and where we put it back on the calendar queue is a function of its priority so we have this ins q and this is the point where we're inserting and it's going to keep moving along as well in the way i'll describe in a moment but what we're going to do is when we've used up the time slice for a particular piece we are going to insert it at the location of insq but we are going to then add in its priority uh so this the the base priority here is whatever the the 171 or whatever the bottom priority is so if it's running at sort of the maximum batch priority then this priority minus that would be zero uh and it would simply go in right here but it's let's suppose it's its priority is 10 above that the highest priority and so then what would happen is it would be insq plus 10 mod of course the size of the queue so it's going to come much further down in this list and that means a bunch of other things are going to get to run before it gets to run again and then it's going to get put pretty far through the list and a bunch of other things get to run and so on so things that have a higher batch priority will get to run sort of each time slot as it comes along and things with lower priority will only get to run every third time or every eighth time or every twelfth time but everything's going to get some amount of opportunity at least once around each time around the queue here it's going to get to run okay now in some cases we're going to get down here where we've got you know multiple things on the list and so this one will run and then it'll get stuck ahead somewhere and then this one will run and it'll get stuck ahead somewhere uh as we described uh up here above and this you know lisp might be at a higher priority so it might just go one or two ahead of uh where it came off whereas t rough might be at a lower priority so it gets stuck way up here somewhere okay so inscue what we do is that we increment it every 10 milliseconds so it's just sort of plotting along every 10 milliseconds so if we get stuck with a whole bunch of things to do then insq may actually streak out ahead of us and the net effect of that is to sort of push the lower priority things even further into the future uh generally speaking though if run q gets it used it's done everything here this q becomes empty then it's going to increment up to the next q and after you've implemented run queue if it's if insq is equal to run q then we also increment inscue so ncq will always be at least one ahead of run queue and it may be more in in most cases it actually just walks along just one slot ahead of run queue but if if the system gets particularly busy then it may get further up ahead of run q and the idea there again is to just push things further out into the future just summarizing what i said for ula about the run priority uh if that set of priority queues which includes the real time threads uh any of the real time threads then we're going to uh select the first thread in the highest priority non-empty queue we saw that couple slides ago if this is completely empty then any of the batchq threads run the calendar queue starting from the first thread at the current entry the the time slice that we use there uh is typically on the order of uh five or ten milliseconds uh but if there's a lot of entries in the queue we will divide that that slice by the number of entries in the queue but we never let it get below two or three milliseconds because otherwise we would end up just getting too much churn and then finally if there's no calendar cues then if there's anything in the idle queue we will run that and just for accounting purposes we always have an idle thread which is at the absolute highest priority which is just there so that we don't have to have a special case in the code for what if there's not anything in any of these so you may notice when you do a ps you'll see each cpu will have some idle that will have accumulated some huge amount of time and that's just soaking up any time that there was nothing for that particular core to be to be doing so now let's drill down and look at the driver of this low-level scheduling how does this actually work well first of all how do we choose to do a context switch and uh for the priority based cues we're going to run all the threads in the top queue we give each one of them 10 milliseconds but we want to make sure that every thread is going to run within 50 milliseconds so a queue that's got 10 threads we'll use 5 millisecond time slice we do have a lower level on that so that if we get down to having if it would be less than five milliseconds then we just give it five milliseconds so that it can it can spread out a little bit uh but generally speaking we don't hit that limit the idea is we don't want to just churn too quickly because there's a cost to doing the context switch of reloading caches and so on for the calendar-based cues we run all the threads in the current slot until it's empty we give everyone every one of them a time slice of the same duration as we used up here so this same business at 50 milliseconds and or 10 milliseconds but 50 milliseconds max for any given queue etc and when a thread other than the currently running thread attains a better priority um then we will switch immediately so if for example we're running something in the calendar queue but something comes in an interrupt comes in on the the top priority queue then we will immediately uh switch over to that uh interrupt thread actually it's not quite immediate as we'll see on the next slide but as soon as is practical we will switch over and run that scheduling a context switch what what actually causes these tens or hundreds of thousands of contact switch per seconds to happen the most common case is voluntarily going to sleep so some process does some system call that can't proceed so for example you you read from the keyboard and the user hasn't typed anything so you go to sleep until the next keystroke gets typed or you're waiting for a packet to come down the network and so you go to sleep waiting for the next packet to arrive or you have issued a disk i o and you're waiting for the data to arrive from the disk so this is the most common case and in this case uh this is called a voluntary contact switch that is you're you are saying i don't need the cpu anymore uh run something else okay another fairly common case is when a higher priority thread becomes runnable and this is most commonly from interrupt level so a network packet arrives or a timer goes off or a disk i o completes and this this then causes the currently lower priority process to stop and the higher priority process to get to run and then we have once per time slice so this 10 millisecond timer that i talked about uh going off uh will say okay you've used up your time slice and now it's time for something else to get to run now we said that for example when a higher priority process comes in we want to switch to that process now often on a multi-core system there will be an idle cpu somewhere and so it can just be handed to the idle cpu and it can run and it doesn't need to interrupt the particular process that may have actually noticed that the interrupt happen because the interrupts generally are targeted at a particular cpu or particular set of cpus but assuming that we want to stop running the lower priority threat on this machine so that the interrupt can be handled for example sometimes we will pin interrupts and say these interrupts have to run on this cpu and so if there's a lower priority process we want to clear it out of the way so that interrupt can can run we don't necessarily want to stop that other thread exactly where it is it might be in the middle of a critical section it certainly could be holding some longer term locks and we don't want to put it to sleep while it's holding those locks because that's going to just cause excess resource contention so instead what we do is we set a flag in its threads flag saying i need to be rescheduled and then that request will be processed after if it's in an if it's an interrupt thread when the interrupt has returned uh or at the end of the current system call so if we've done a system call as we're getting ready to return back out of the kernel back to the user process we notice that this reschedule flag has been set and at that point we know any locks that were held have been released and at least any short-term locks and so it is now safe to context switch off to this higher priority process now if the rescheduling is involuntary that is the the it's we're not we're not leaving we're not switching because the previous process went to sleep but we're switching because it's a one of these cases where it's a higher priority process that wants to run in that case the process that we're switching away from still does want to run so we need to put it back onto the run queue because remember what happens is we take it off the run queue we run it and then if we go to sleep we're done and we just go find the next one we want to run but if it wants to continue running it's used up its time slice or it's being preempted then we need to put it back onto the appropriate run queue so that some point in the future it can continue running how do we actually context switch well what we're doing is under kernel control we're going to switch from one thread to another and there's essentially three steps we need to save the context of the current thread which is the registers address space etc that's defined by the hardware what you need to save we are then going to choose the next thread that we want to run and then we are going to having chosen it load its context so whatever was the register as an address space needs to be reloaded here remember that i said this low level context which runs a lot it's anywhere from typically one to as much as five percent of all the cycles on the machine are spent doing this and so you really really really want this to be fast now of the three parts the context load and save are defined by the hardware there's nothing you can do to make that any faster so the only place you have any room to maneuver is this picking the next thread and that's why we want the low level scheduler to be so simple you know pick up find the bit of the cue you want pull that thing off the queue load it uh typically in freebsd this can be done in about 20 assembly language instructions so that's why we we split the kernel into high and low level schedulers so that we can make that low level scheduler very very quick that allows us then to take much more elaborate steps in the higher level part of the scheduler because it's running much less frequently so there it's okay for us to spend some time really sitting down and thinking about what we want to do the next step that i want to look at then is ule what what is it trying to do and this now we're talking about the high level part of the ule scheduler our goals are we want to identify and give low latency to interactive threads this will allow that we want to allow for brief bursts of activity in order to make this work we have to differentiate the time where we are waiting for the cpu versus the time we are waiting for user input if the system is really busy you may not be running simply because you haven't been able to be scheduled yet but we don't want to give you credit for for that time waiting because you're you're in with everything else that's trying to run so what you get credit for is the amount of time that you're actually not wanting to run that is you've gone to sleep and you haven't taken some action that says now i want to run again another thing that we would like to do is in general to have a given thread to keep running on the same cpu over and over and over again rather than to have it bouncing around from one to another this is called processor affinity and the reason is because you have state that's on a processor you have the processor has caches memory caches and vm caches and other things and so to the extent that we can run you fairly soon on the same cpu a lot of the those memory cache the l1 l2 cache are going to still be there and if we move you to a different cpu then you're going to have to get all that back into the l1 l2 cache of the other cpu that we've moved you to so generally speaking we want to try and keep you on the same cpu the fact that each cpu has its own set of scheduling cues sort of naturally gives us that because the only way you're going to switch from one to another is at a higher level we decide all right that cpu is just too busy we're going to move you from its queue to some other cpu when we do decide to move you then we need to think about where we want to move you to and this is sometimes referred to as numa so uh the you have different different cpus uh have uh different characteristics relative to the one that you're on so as i've said the same cpu is sort of the ideal case but if there's multiple cores that are on the same chip they'll often share at least some of their caches and so if we can move you from one cpu on the chip to another cpu on a chip that is not going to be as bad as for example moving you to another chip that's on the same board so if it's a dual socket board and we move you from a cpu on one of the sockets to the other socket then there's no shared caching between uh the cpus themselves there's possibly some caching on the the memory plane that's the memory that's on that particular board uh versus have to for example go to the memory on a different board but it's still further away than something that's within the same chip and then of course we may have multiple boards in the chassis and so and we may have multiple chassis so again staying on a cpu in our chassis is going to likely be better than going you know somewhere else the other goal of ule is not to have anything where it starts with for every cpu or for every thread we never have to look at every thread in order to make decisions we occasionally do look at all or most of the cpus but again that's done sort of on the order once a second we amortize the fact that we do that over the fact that it doesn't happen very often so let's start with how we differentiate interactive versus batch so we've got a bunch of scheduling variables the nice variable which is the one that the user gets to set in a range of minus 20 to plus 20. the fault is zero if you go up to positive nice as that means give me lower priorities and if you go to negative that says give me higher priorities typically a user is only allowed to go to higher positive values of nice and only the administrator is allowed to go into the the negative range the next two variables here runtime and sleep tick are tracking the recent cpu utilization and the recent voluntary sleep time and so as you're running we have a statistical clock that goes off and each time a tick comes in whatever the currently running thread is on that cpu gets charged with a a runtime tick uh similarly uh when you're not running you are accumulating sleep time now we don't do that with a clock we just mark what time you want to sleep and note what time you wake up and that it's the amount of time that gets put into your your sleep tick we then have a priority which is the current priority that you're running at uh and a user priority which is the priority when you're running at user level normally these two are the same value but when you enter the top half of the kernel to do a system call your priority may be temporarily boosted because we want to sort of urge you to get through the system call and back out into user space so that we are not going to be contending as much for kernel resources so what's going to happen is the the way that we get these things to be recent cpu utilization and recent voluntary sleep time is by doing this decay and so we decay the run time and the sleep ticks uh whenever their sum exceeds five seconds so when the two of them uh add up to uh a total of five seconds we essentially shave off 20 of their value so we we take the run time and we shave off 20 and we take the sleep time and we shave off 20 percent so that some of them will now be four seconds and then we recompute the priority based on these values when the the thread either accumulates a tick or is being awakened and our decision on whether it is batch or interactive is if the sleep tick exceeds the run time so if we're sleeping more than we are running then we're considered interactive and if we're running more than we're sleeping then we're considered batch all right so how do we decide which cpu we're going to run on remember we get taken off the run queue when we go to sleep and so now when we wake up we need to decide where we're going to go so we're going to follow that hierarchy pretty much that i already described to you first of all threads can have a hard affinity the either the application or the colonel can just say this thread must run on this cpu or in the case of the user interface you can actually say it's got to run on one of this set of cpus so you might say it can run on any one of the cpus but the only these four that are actually on this chip are are permissible if you have an affinity to a single cpu then it's easy we just put you on that queue and we're done interrupt threads that are being scheduled by the hardware interrupt handlers are scheduled on the their current cpu that is the one that handled the interrupt in the first place but if their priority is high enough to be able to run immediately if it's not high enough to run immediately on that one because there's something at a higher priority already running then we will look first at the last cpu on which that thread ran and then we just walked down that hierarchy so we will first look for the the one that last ran on and otherwise and one that's on the same chip otherwise one that's on the same board other than same chassis etc and in the worst case we will search the entire system for the least loaded cpu running a lower priority thread so if the the search ends up offering a better cpu choice than the last cpu on which it ran we will switch it and the longer the sleep time has been the more willing we are to switch it because the the longer you've been sleeping the less stuff that's going to remain in the cache because if other things run of course they're going to load the cache up with their data so if you've run on their you know in the last millisecond then you probably have stuff that's still in the caches after a number of milliseconds five or ten milliseconds then the chances that there's much useful in the cache is pretty low and so the effect of moving you to some other cpu to run we're not throwing away nearly as much state because there probably isn't very much state left on the cpu on which you ran anyway the last thing is that we're going to have to periodically rebalance threads between the cpus and when it the cpu idles uh it first of all if it's got nothing to do it's going to look around for other cpus to see if there's one from which it can steal work this can be a bad thing to do you see some other cpu and you grab it well if that was the only thing that cpu had to do all you've done is pull it and it's now lost its caching and now that cpu is just going to have to find something to do so generally speaking you won't ever steal work from a cpu that only has one thing to do uh you'll just leave that there but if there are some other cpus then we will potentially take work from that if a job gets added to a cpu that has excessive loads it will look for other cpus to which it can push work so for example i might have a thread that has an affinity to only run on a particular cpu and so it goes to run and this cpu is already full of a whole lot of stuff it will say look i realize i have to run this thread but let me just shed this other one off my run queue and send it off somewhere else so that i won't be so busy and i will be able to better serve this particular thread that wants to run here and then approximately once per second the the ule load balancer runs and it looks across all the cpus and finds the busiest one and assuming that there's more than one thing on that cpu we'll take one of those and find the least busy cpu and put it there and so uh over time if things get out of balance this as sort of a backup will slowly migrate things around and question is why would you wait you know one second why not make it half a second or two seconds and the answer is that if you get too frenetic about trying to move things around you're really just stirring the pot to no great effect on the other hand if you wait too long then you can get a particular cpu that is very busy and it just takes a long time for it to get offloaded to to other cpus the upshot is that empirically what we've found is that doing this about once a second gives you the right balance between being overly moving things around but also being able to get things moved around when it makes sense to do so i'm going to finish up my talk today by actually looking at a very interesting paper that was done a couple years ago now by a group at the swiss federal institute of technology in lusan and they did they wanted to compare the freebsd ule scheduler the the linux completely fair scheduler and at the time they took the ule scheduler from freebsd 11.1 uh and they wanted to compare it with the linux scheduler in 4.9 now how are you going to do that i mean you can't just run between the two operating systems because there's so many other things that would affect it but they came up with what i thought was a very interesting way of doing it and that is they extracted ule out of freebsd and replaced cfs in linux with the ule scheduler and they could have done it of course either way they could have taken the linux scheduler and moved it over to freebsd and done their tests there but as they point out in the paper the ule schedule is 29 29.50 lines of code and cfs is 17 900 and so it was just a lot easier to pull this out of freebsd and drop it into linux plus a lot of people know linux and so uh you know it's it's easy for them to you know understand the the various issues uh that you know are otherwise due to other subsystems in linux than the scheduler okay so there's a number of common characteristics uh between both of these schedules which made it less difficult shall we say to to pull it out they have per cpu run queues both systems they uh context switch the low-level scheduler context which is only from the local run queue at wake up they select the the cpu queue on which the process should be run and they also both periodically will steal processes when when a cpu goes idle and they also both periodically perform load balancing so they because they have all of these common characteristics it was actually not all that difficult uh to move this over the the paper which i i'll give you a reference at the end of my talk uh actually discusses what was involved in moving it over and it it was surprisingly easy to do so i've talked about the ule scheduler let me just give you a thumbnail sketch of the cfs scheduler strategy completely fair scheduler as its name implies wants to be completely fair between all the threads so all the threads are run round robin there's none of this business of you know batch and interactive in fact in some of the tests that they did in the paper where they created enough interactive things they could completely keep the system busy and totally lock out the batch processes for hundreds of seconds in practice that's sort of an anomaly they had to work pretty hard to make that happen and you don't tend to get that happening in most systems higher priority threads they're going to run essentially something that looks like calendar so they're going to go through all of the processes and get them to run the way that they essentially stretch it out though is that higher priority threads get longer quantums so if you're a low priority thread when it gets to you in the list then you get a shorter quantum than a higher priority one which is going to get to to run longer and to try and give some boost to interactive if you've been sleeping for a long time then you get sort of dropped in right so you're going to run soon okay so there there is some extra boost that's given to things that are running interactively or running for short periods of time and sleeping a lot okay an additional feature that cfs has that is not present in ule is that they collect threads into hierarchical groups which they call c group and the idea here is if you have uh two processes one of which is single threaded and the other which has ten threads running in it under ule all threads are treated independently of each other and so the one with 10 times as many three or 10 times many threads is going to get a lot more of the cpu so by having these c groups what cfs allows you to do is to say all the threads within a particular c group have to share that quantum and so the the single threaded process and the 10 threaded process will get roughly equal amounts of cpu overall which means the the individual threads in the 10th red one are going to get much smaller quantums and these things are hierarchical so you can have all the threads within an application you can then put all of the processes that are belong to a particular login session into another c group uh and as i say all the all the threads within a c group share quantum so that one person can't log in and then hog a whole lot of the system they also have a bunch of things for dealing with pneuma which is somewhat different uh than the way that ule does pneuma so without further ado uh i you know how did how did it work out uh well the the this is a one of the figures figure eight out of the uh paper and i this across the bottom here are a whole slew of benchmarks that they ran which are all over the map different types of benchmarks and uh what this graph up here is showing is there's this line that runs down the middle and for things where there's a bar above the line that's an instance where ule is doing better than than cfs and where the go the line goes below is an instance where cfs is doing better than ule so you know the thing to notice here is that most of them are more or less the same they don't make much difference um but there are a few where there's just major differences uh so there's one two three four five six seven eight nine at least nine or ten here that are up and one two three that are down so the the number of cases where ule significantly outpaces uh cfs is there's a lot more of those uh than the ones where uh the cfs does better uh but sort of the takeaways is most benchmarks run about the same uh the outliers they actually explain in the paper and they're fascinating uh to read i mean that's that's sort of the most interesting in for me to see that um and really they pretty much the same overall ule finishes benchmarks faster than cfs by about 1.5 percent uh if we're running on a single cpu and about two and three quarters percent uh when we're running on a multi-core system the takeaway for me is that linux needs six times as many lines of code to more or less do the same thing that that ule can do in one-sixth the number of lines of code the way this has happened is uh and you see it in a lot of the subsystems of linux that there's a lot of code that is looking for special cases and then targeting that it's like oh i'm running the database application and it needs this kind of scheduling or this kind of i o uh processing or this kind of buffering or whatever and so they latch on to that and in fact it makes that benchmark run considerably better but what ends up happening is that sometimes it latches on to something that it thinks is one of those things but it's not and it turns out that that pessimizes how that application runs because it's being mischaracterized and that when you read the the details on that you can see that's kind of what's happening on some of these outliers here so with that i will go on to questions which normally i would take off the floor except it's going to have to come in over the wire let me just make a few comments here first of all at the bottom of the slides you'll see references to pages in the book particularly they are in the range of 114 to 126 and that is the design implementation the freebies the operating system second edition uh and there's a lot more detail there than obviously what i could talk about in 45 minutes uh the second reference here is the paper of the comparison of the the battle of the schedulers uh that is in the 2018 eusnix annual technical conference if you go to the usnix.org website they have open publication so from the day a conference runs all the papers are there the abstracts are there the slides from the talk are there the audio from the talk is there the questions from the audience are there it's it's absolutely great go to the 2018 annual technical conference go into this to the program uh pick out the battle of the schedulers and it's all there to read and listen to this is if you want to contact me my email address this is my website on my website i have lots of other things there's a freebsd internals course which is the whole book a lecture per uh chapter for those that really likes c code i have my advanced course which is uh 40 hours of reading the kernel code then there's a set of networking from the bottom up that was done by george neville neil it's a series of five lectures that drills down on the internals of the networking code i have the csrg archive which is all the archive of everything that happened at berkeley up until it spun off uh the open source version which is of what of course became freebies fbsd open bsd dragonfly bst and finally a four-hour history video uh well done with the glass of wine in hand the history of both the csrg at berkeley and then a view of the last 25 years of the freebies d project so thank you very much and i look forward to taking the questions that hopefully will come streaming in [Music] do [Music] you
Info
Channel: BSDCan
Views: 1,138
Rating: 5 out of 5
Keywords: BSDCan, BSDCan2020
Id: V6AxdJ-jdUg
Channel Id: undefined
Length: 56min 9sec (3369 seconds)
Published: Thu Aug 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.