LISA21 - 5 Years of Cgroup v2: The Future of Linux Resource Control

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey there good afternoon good morning good evening wherever you are in the world my name is chris down i'm an engineer at facebook on the kernel team mostly working on linux kernel memory management and especially a feature called c groups which as you may have guessed by the title is what we're going to be talking about today i'm also a maintainer of the systemd project mostly working on resource control things like that things which are related to secret memory management that kind of stuff most of the time i guess my job at facebook is thinking about how we can make linux more reliable and more scalable we have a million plus machines now and we can't just buy more ram we can't just magically increase capacity by adding it so we need to extract the absolute maximum out of every single machine in a safe way and that's the kind of topic which i want to talk about today and basically want to go over how over the last five years we have increased facebook reliability and capacity using these primitives introduced in secret v2 and how you can apply that to your workloads and your organization in a way that makes sense for you so i think many of us at leading-edge companies do have this trade-off of capacity performance and reliability and it's really uneasy one at facebook for example you know we have good problems our user base is increasing our product range is diversifying but with that growth comes scaling problems of a nature which we've not really had to deal with before and we're really feeling a crunch for capacity and we simply can't effectively solve these scaling problems by throwing harder at the problem you know we can't construct data centers fast enough uh we are we are running on clean power we can't necessarily get clean power easily enough in everywhere across the world and we have you know a million plus machines now and we just cannot afford to waste capacity because any small loss is a huge absolute loss across the fleet ultimately the point here is that we need to use resources more efficiently and we need to build the kernel infrastructure in order to allow us to do that at scale another challenge that we have is that many huge site engines are actually caused by lacking resource control so not being able to readily control things like cpu memory usage io that kind of stuff it's actually caused some of the most pervasive and huge issues and outages for large companies like us and we need to sustain and support an initiative industry-wide in order to actually solve this problem once and for all so how does all of that talk about capacity and reliability manifest into this c groups thing which this talk is ostensibly about well c groups are a kernel mechanism to balance and control things like memory cpu io that kind of stuff and i'm sure if you've operated containers before you've probably seen their name floating about every single modern container runtime docker core os kubernetes systemd you name it it uses c groups under the hood it doesn't do all that stuff by itself and there's a reason for that which is secret solve a lot of these long-standing problems with classic unix resource control and there's no reason for them to reinvent the wheel secrets have existed for about 14 years now and they've changed a lot in that time actually most notably five years ago or so we released secret v2 in kernel 4.5 and i gave a whole talk at around the time that we released it on why we were moving to security why we weren't just making incremental improvements to v1 and if you're interested in a really really in-depth discussion that talk is still a really good resource i think it very well represents the thoughts that we had at the time um but for now let's touch on you know the most fundamental changes that made segregate a step changing the paradigm rather than just incremental improvement on on what we already had so the first major change is how we uh organize c groups themselves actually and this is more than just an aesthetic change it actually allows us to do some things that we've never been able to really do before technically fundamentally this reason is the reason why we had to move to siguri 2 instead of making a you know c v 1.1 or whatever so it's certainly worth going into in some detail so that you can understand why we did that if you've had some interaction with c groups over the past you know few years uh you probably have been using sigri 1 until quite recently version 2 has been working great it's a game changer for resource control but most distributions were fairly worried that some applications like docker for example didn't support sugru v2 yet now though pretty much all of them do we'll go over a list at the end and fedora in fact as a distribution moves to siege v2 by default since fedora 32 so it's clearly operating well at scale and that was a pretty nice poke to those who still hadn't you know implemented sugar v2 support in their containerization engine uh even after you know several years to kind of get it there so i really appreciate fedora doing that in sugar v1 sysfs subgroup contains controllers as directories at the top level so you see resources like cpu memory pids you see them as a directory inside system sc group and inside these directories are hierarchies for each resource so for example inside the pc group hierarchy uh you see for example files which would manage the maximum number of processes in essay group and inside there we have a hierarchy of directories you know you can build up some secret hierarchy and each directory contains these files for pid management if you don't know if you're using secret v1 or v2 if you go to slash cfsc group and see you know directories like cpu memory block i o it's a dead giveaway that you are still on this legacy v1 which is basically frozen in time and we are not really supporting anymore so here is a concrete look at how this all looks in sigru one so the secret file system is typically mounted at slash sys slash fs slash c group uh inside you have a directory resource like you know memory or cpu and each resource maintains its own distinct hierarchy so you can have a single pid a single application in segreg or secret a in one resource but in secret bar in another and this flexibility of being able to have different hierarchies for every single resource and being able to put pids in different c groups per resource has some pretty negative technical effects which i will come back to in a short second the fundamental paradigm of sugary one is that c groups only exist in the unique context of a resource there's no one c group for a task on a machine and it's totally possible as i mentioned that appeared can be in group a in block io or in or in cpu and group b and memory for example by contrast in secret v2 we don't actually see the resource directories at slash systems c group anymore we see the directory for the system the uns themselves not any resources at the top level so how does the c group hierarchy know which resource it applies to well the answer is that the way that this works is essentially inverted now c groups are not created in the context of a particular resource but resources are enabled or disabled for a particular c group and we have one hierarchy one unified hierarchy another name for secret v2 which rules them all and this means that you explicitly opt into for example the cpu controller for a particular part of the secret hierarchy and once you've opted into the secret we give you the files which would allow you to control that particular resource so in secret 2 we now have this unified hierarchy which means that c groups are global a pid is in exactly one c group on the entire machine only one and it shares that secret between all resources memory i o cpu etc so instead of having a secret per resource you now have resources per c group so the idea here is that you or your container in its system opt into some subset of controllers using the secret.subtree control file you might write plus memory or plus cpu and you get the files which are related to controlling that particular resource this might seem like a purely aesthetic change going from you know a c group directory in in each hierarchy to one where you enable resources down the hierarchy as you as you want however this as mentioned has quite significant technical advantages the most major one of them and the reason again why we couldn't make you know a secret 1.1 is because when you write to a file in linux what happens it seems like a softball question but it's actually not it's actually somewhat complicated so you don't really write to the disk right for performance you write into the page cache and you create a dirty page a page which must be later written back to disk so at this point your rights to school has probably returned to your application with success your application can continue going about whatever it wants to do but the process isn't really over eventually that dirty page does need to be written back to the disk but when does it get written by to the disk who writes it back to the disk the daddy pages here were made for a particular application your application was called right and the flush to disk will be performed on behalf of that application by a kernel worker or a k worker and this could happen an indefinite amount of time after the right cisco returned in security one these went to the rootsy group which is essentially the i don't really know who's responsible for this it's essentially limitless and for most workloads these page cache writebooks are the vast majority of io being performed this can also account for a huge amount of memory on some workloads so not accounting for these means you're missing from your accounting a huge amount of information you're missing a huge amount of stuff which you actually would like to limit or throttle back on and it's impossible to track or control so having this new unified hierarchy in v2 is what makes tracking these multi-resource actions possible in security one you know you could be in secret a for i o but secret b for memory which as i mentioned in the previous couple slides but since the c group is the unit at which we charge out much like a page is the minimum unit for a memory a c group is the minimum unit for a syrup charge we can't make any assessment in secret v1 about the relationship of c group a in one resource to secret b in another resource and for these kind of multi-domain actions where you know we write a page into memory we're waiting for the i o to happen and it spans multiple resources we just drop the charges on the floor in version one so in secret v2 we're actually able to track these and map the request back to the originating c group so we're actually able to count these towards secret limits and take the appropriate action for example we can understand the relationship between io and memory for a right-back we can not only hold back some of these allocations if they are too costly but we can also perform reasonable actions based on i o or memory pressure inside the c group in the context of a singular c group which we couldn't do in v1 we can also do things like charge the cost of network activity to a series after the fact that might sound really trivial but actually it's quite complicated because network has this weird paradigm right network and cpu first of all are two different kinds of resources and again it's another example of a multi-resource action which is something which we only really support in v2 but network has an even more insidious property right which is when network packets come into your card you don't really know who it's going to yet it needs to be rooted it needs to be assessed and determined by the kernel which application is going to get those packets which application is going to get the data so you don't really know who's responsible at the time that the cpu or most of the cpu is taken but with c serial v2 we can go back and look and charge the resources spent at the point which we find out who the packets are for and that allows a lot better monitoring and resource control because that can be a really non-trivial amount of cpu for some workloads so linux 4.5 which is when we made secret v2 essentially a public facing api that anyone can use was released over five years ago so at this point you might be asking yourself you know if this was all merged like five years ago one earth has been happening since then why is he coming to talk to us about it now so in this talk my hope is to show you how the best intentions can often result in not ideal outcomes and our path to take the extremely complex realities of kernel resource management and present them in a way in which people can actually understand and use at scale what's being presented here is really the state of the art in resource control my ultimate hope is that by the end of this talk you'll have a better idea of why controlling memory or cpu or io is so complicated and hopefully you'll come away with a few ideas about how you might want to apply this to your own organization or your own workloads in a way that makes sense for them so back in 2016 when we released it we only really had one real user of sugru v2 was us it took a lot of work and overcoming obstacles to reach where we are now where you know google and lots of other companies are using it or seeking to use it and we had a bunch of cool and new primitives back in 2016 which we wanted to use but when we actually went to go and use them it kind of ended up something a little bit like this the primitives aren't entirely wrong they're still kind of on the right track but we need to work out in which cases we should make the operating system a little bit more square or our primitives a little bit more round in order to achieve the goal that we actually want and knowing which of these makes sense in any given situation requires spending quite a lot of time in prod gaining experience through a significant amount of testing experimentation and working with others to see what they're seeing on their workloads in the end we've really had to design new tunables for memory cpu and io from the ground up to achieve what we want compared to 2016 and a lot of the paradigms which we take for granted inside facebook and other large companies now simply didn't exist five years ago in linux so what does that mean to reinvent how we think about doing resource control from the ground up well one thing which is critical to understanding how linux memory manages memory is that it has different types of memory from the cpu's perspective there's really not much difference it has some kind of idea about permissions or stuff like that but it doesn't really treat them as semantically different whereas linux does for example anonymous memory is as the name implies not backed by any backing store memory explicitly allocated during the program allocation by malloc or similar paradigms is typically anonymous and most people also know about caches and buffers two sides of the same coin they've been part of the unified page cache for many many years now so i'll probably just call them file cache from now on um but if you ask most linux admins they will say that page caches and buffers are reclaimable which is not wrong but the problem is that a lot of linux admins have this misconception about what reclaimable really means doesn't mean that you will definitely be able to reclaim it it just means that you might be able to reclaim it at some point if you happen to ask nicely and we are in a good mood at that particular point in time so for example if some application is absolutely hammering some file it's very unlikely that we would choose to drop it from cache because the performance will just get driven completely into the ground right so while they can under some circumstances be trivially freed it doesn't always have to be the case and it's not necessarily the case for whatever ones you're talking about and this can cause some confusion when people ask you know why did my application um why did the um killer come along when i still had this free memory and we'll come back to some of these more an explorative foibles around what people consider free but isn't really free in a little bit the fact that these caches and buffers can be essential is also an example of why rss which is a metric that we love to measure as an industry is really kind of um rss often on dually skews a lot of attention towards a very small select number of types of memory anonymous file memory and mostly map file memory we forget though that many workloads simply cannot sustain any performance or operate without extensive file and buffer caches or other system-wide shared memory the reason we measure rss as an industry is not because it means anything really useful it's because it's really easy to measure and really stable this is a really poor reason to measure something so when somebody asks you know how much memory your application actually uses the only sensible answer unless you've compressed it until performance is degraded and we'll go over how you can do that a little bit later in the talk is that you really have absolutely no idea how much your how much memory your application uses as an example in one case inside facebook we had a team that for years believed their memory footprint was roughly you know 100 to 150 megabytes on the fleet and using our new metrics which we'll go into in this talk they discovered it's something more like two gigabytes as a huge absolute gain right like that's as a huge thing to not know for so many years and it probably explains some of the things that we've seen at scale so this is one of the reasons why inside modern resource control we typically limit all types of memory together instead of just you know trying to limit anonymous memory or trying to limit rss or whatever because if you just limit one type of memory if you just limit say anonymous memory and ignore the page cast for example you're going to trivially allow one application to still kill all the performance of the others because again the memory is not free so this is why in sugary 2 we limit an account for all types of memory together we have still have this memory.max file which is not based on rss it really is limited based on everything that you do inside of superb everything you allocate and this is a step change from the the era of you know per process limits or siri one which only limits on a subset of memory types in series one there are like 50 000 different types of memory limits you can set but the problem is you can't set any absolute memory limit to bound the c group so on the surface this memory max stuff looks like it should work reasonably well right and it does it does exactly what it says on the tin there's nothing wrong with it in terms of in terms of what it declared it was going to do it sets a maximum amount of memory which is allowed to be consumed in the c group and it does that fine the problem here is how you actually use that to compose a reasonable and reliable system so let's say that you have you know a couple of slices you have a couple of different groupings of applications so you have your best effort applications in a slice called best effort of slice it makes sense right and you might put things there like configuration management metric collection that kind of stuff things you'd really like to have running on your machine but if the workload is absolutely you know is being destroyed then yes we can live without them for a bit so the workload on the other hand consists of applications which you really do want to serve on this machine so for example it might be hvm on a web server or my sql on a database server and you can have multiple workloads on a machine of course and we'll come back to that in a bit but for now let's stick with one for simplicity so if this thing is not running the machine is useless that's what should be in the workload slice so the typical system administrator at this point being responsible might have a way of thinking that goes something like this so you're scared that non-system applications might kill things they might invoke the um killer they might degrade the performance of the workload so you put those non-critical applications in system.slice and set a memory.max oh but now now you're worried that some particularly badly behaved application is in there so you might want to set a memory.max for that particular application so it doesn't affect all the other applications oh but now you're worried that your main application might cause you know a global loom and slow down the machine as a whole so you want to set one there as well you want to make sure that you don't get global and you only get a secret boom and so on and so on until you end up with a bunch of impossible to reason about limits sprinkled all over the place which you can never really reevaluate in the future and inevitably one of these applications is going to legitimately grow a bit and it's going to be an absolute pain in the ass to reason about resource distribution again and what on earth these numbers mean and how you should adjust them and this is even infinitely worse if you work somewhere like facebook or google where we have thousands of services and there's no one person who understands everything that one runs on a machine so our ultimate goal here is you know just to keep whatever's in workload dot slice running right that's the important thing in the machine so what if we could just encode that instead of sprinkling memory.max everywhere and causing a headache for the system administrator what if we could just write something like this so memory.low is a fundamental change to the way that we've gone about controlling memory on linux and unix as a wider thing for the past you know 50 years on unix and the idea is that instead of trying to control memory by putting every application into this very tight coffin of like you know you're gonna use exactly n megabytes and you cannot even go a page over um instead we should just say you know how much applications in a cigarette which we want to protect need to operate and let the system sort it out sounds great how does it work though how does that actually work so all of this works based on reclaim so reclaim is this process of trying to free pages either in response to a global memory shortage where the system is generally under a lot of pressure or brushing up against limits within a c group like memory.max and trying to get it back down again so memory.low hooks into the kernel's reclaiming structure in order to protect some memory for a c group for example if you have memory.low equals 20 gigabytes as long as you are using below 20 gigabytes of memory you'll generally be completely exempt from being reclaimed you'll generally not have pages free from you and the only time we'll dip into memory.low reserves is if there's an immediate global memory shortage and you can see when that happens by looking at the memory statistics which we provide in the secret directory this might sound kind of simple but in reality it's actually fairly non-trivial to implement for example if multiple different levels of protection in the super hierarchy are competing against each other we have to decide what to do we have to decide how much protection we allocate to each one and we also have to decide you know how to distribute protections from a parent to a child especially when its children are competing with each other we also need to handle the case where you know you're slightly over your memory protection uh but we don't want this binary behavior if you go one page over and we just reclaim and so on and so forth when we bounce back and forth so among a modern we have this mechanism to reduce the reclaim proportional to how much of your workload is protected at any given time so another nice effect of this is that this is a lot more work conserving work conserving is just a fancy term to mean as long as you are not dipping into what the workload needs at any given particular moment you know you can use as much memory as you like even if you are in the best effort slice you can grow as much as you like as long as you are not affecting the workload this is much more tolerant of you know temporary spikes and changes in memory memory composition on the system we're not reserving those 20 gigabytes for the workload we're just saying you know if the workload happens to need it we're going to aggressively give it to it so inside facebook we are now primarily reliant on memory.low protection rather than limits in production to facebook and there are some more niche use cases for limits which i will go into one of them in a second but at the fleet level we have mostly transitioned towards using these protections instead of punitive limits so that said let's go into one use case where these limits actually can be used for something useful so one of the more common things which happens when people move to sugar b2 and i tell them you know my usual spiel about like don't use rss because it doesn't mean anything useful is that they go and set up a limit based on memory.current memory.current is a file that as you might guess shows the current memory usage for a particular c group but it's quite important to understand what that means so the very fact that we're not talking about rss anymore but all kinds of memory used by the c group means that the ramifications for this are different and the meaning is different so it includes things like caches buffers socket memory kernel objects allocated based on the application's usage and anything like that so all of this is in memory.current and that's exactly how it should be because that is what we account for and what we limit on as mentioned earlier though the reason that people choose rss to limit on is not because it's something good to measure it's because you know it's relatively static it doesn't really move around a lot it's easy enough to set a limit on and it makes you feel pretty pretty good at night i guess um even if it doesn't do anything particularly useful so memory.com suffers from exactly the opposite problem it tells the truth and people don't know how to handle the truth for example if you set a 8 gigabyte memory.max let's say and your application runs for a little while and there's no global memorable pressure the system is generally not contended at the global level what is your memory or current going to be it's going to be 8 gigabytes we're going to expand to fill like the memory which we've set as the limit for the c group because there's so much lag right there's no reason for us to to get rid of all this nice file cache buffers all that kind of stuff it's not that your application really needs eight gigabytes but the fact that linux is passively populating you know the page cache and other nice to have things and if there's no pressure for it to shrink from outside forces then we're just going to expand until it reaches the limit so what should we do how should we know what the real amount of memory needed is at any given time and again you can't just get that from rss we just don't have this metric in linux at this point in time so to answer this question we need the assistance of another technology which was invented at facebook so let me ask you a question which might seem kind of easy but have a think about it it seems like a softball question but it's actually kind of complicated so how can you tell that you're running out of physical memory before it happens it's actually kind of kind of complex so have a think about that if i were to ask you you know is your machine oversubscribed on memory what kind of metrics would you traditionally look at to determine that so one of the things which if you are you know more new to linux you might say is you know i want to look at the free memory i want to look at the free memory uh or free memory without cash and buffers but the thing is that doesn't really tell us you know how much memory is actually required that is having your memory you know fully in use doesn't typically necessarily imply being saturated just like the you know expanding to eight gigabyte case that we just mentioned a moment ago another metric which you might bring up is something like you know page scans the rate at which we are trying to free pages um if your patrons are high it could mean that the system is kind of oversubscribed on memory but it could also mean that your system is operating in an ideal state it's really hard to tell just from that um what exactly the situation is so it's not easy to tell from these kind of metrics alone uh what exactly what exactly the future holds whether we're about to go over the cliff or whether things are generally pretty stable so usually all of these metrics we come up with are just approximations of memory pressure so what is memory pressure well we've never had a metric like this in the kernel before we have many related metrics like memory usage pagecast buffer usage but even with all those kind of metrics in the ones we went over earlier it's really hard to determine is this ideally used or is this oversubscribed for example in memory in psi like the idea here is to use some metric to determine how long we are being stuck on the system doing memory work which we otherwise wouldn't have had to do if there was more memory on the system so some here means that some threads on that c group were stuck in memory work for say 0.21 of the time in the last 10 seconds full on the other hand it means similar but it means all threads in that secret were stuck on the memory work they all couldn't make any forward progress because they're stuck on memory so this could be things like you know waiting for a kind of memory lock it could be being throttled it could be you know waiting for reclaim to finish but even more than that it could be really dominated by this memory related io like for example revolting file content into the page cache doing swap ins that kind of stuff and pressure is essentially saying you know if i had more of this resource i could probably run m percent faster in this case i could run 0.21 faster this can also be really useful to you in developing high reliability or high availability applications for example we use this to know in advance when we are using too much memory and do load shedding backing off requests that kind of stuff in some services we also use it for our user space premium detector um d which has a really fine-grained policy engine and decides what to prioritize on the system in the case where the memory gets oversaturated we use it for all kinds of things not just in facebook but android and many others are also using it for resource contention detection and you can't just get that by measuring memory because again or resonant memory because again you don't really know which things you can reclaim which things you can free until you actually go and try and do it and that's why memory pressure is so powerful so how does this all relate to the case which we mentioned earlier where we you know have some slack in memory.current and we want to work out what the real memory usage is so let's take an example kernel compilation job which with no limits has a peak memory dot current of just over 800 megabytes however when a memory threshold of 600 megabytes is applied actually the job finishes in roughly the same amount of time with 25 percent less available memory the same even happens when we go down to 400 megabytes and we're now using half the memory which was originally used at peaked by the application with only a few seconds more wall time in the compilation pretty good trade-off usually however if we dial it down just a little further things never even complete now we are now you know nine minutes in compared to four minutes and it's not even done like we had to control see it so we know that the process needs but somewhere between 300 to 400 megabytes of memory to operate with reasonable performance but finding the exact cutoff where the job performance goes down the toilet is really tedious and trial and error process right it also only works when a job does this very fixed amount of work a reliably fixed amount of work every single time it runs like this example but most production services are not like that they respond to user input they respond to the context so to get an accurate number for services at scale that might have this very highly variable memory usage or memory requirement we need a better way to do that enter senpai senpai is a simple and self-contained tool that uses all of the technologies that we touched on so far in order to find out essentially how much memory does my application actually need to run so in order to do so it uses psi pressure metrics and it also uses memory.hi to apply just enough memory pressure on a container in order to evict cold memory pages which aren't necessary for nominal workload performance it's an integral controller which dynamically adapts to memory load peaks and troughs and it provides a memory working set profile of an application over time so it can be used to answer this question which we want to answer of how much memory does my application really need in any given point in time so in this case with the kernel compilation job the same one as earlier we can find that the answer for this compile drop is actually something like 340 megabytes or so but to get there we just have to keep on pushing until we see pressure because we are lowering memory.high which is the throttling limit and then we just keep on adjusting it as necessary backing off and being more aggressive and so on and so forth the script is only about 200 lines it's not very complicated but the technologies which it uses are the thing which i really want to focus on here um earlier in the talk i mentioned a team which had been for years believing you know that their fleet host memory footprint was like 150 megabytes or so and we found out using senpai that it was about more like two gigabytes until they their application just kind of fell on the floor and couldn't really do anything anymore and it's highly reliant on these new technologies memory.high and psi which we've been working on and improving over the past five years of crb2 we also use senpai internally to you know find regressions and leaks in a way which rss measurements simply could never do so it's a really important part of making sure that your systems are reliable and you understand how much you're loading them at scale earlier we also touched a little bit on why memory and io in particular are so interconnected because for example write backs go through memory first in the page cache but another reason is that we typically also do demand paging from disk for example the chrome binaries code segment where all the code lives is over 130 megabytes there's no reason for us on a contender system to load all of that up front so on a system with little slack we likely just load it with gr as gradually as possible and that has ramifications for both memory and i o as we buffer it in for this reason we really need to have control on i o when we have control on memory otherwise memory pressure is just going to always translate to disc io probably the most jejune way to solve that is by trying to limit disk bandwidth or iops however this doesn't usually manifest very well in reality if you think about any modern storage device they're typically a cued device you know you can throw quite a lot of commands at them and often when you throw multiple commands to them you discover you know you can get more bandwidth due to internal optimizations or whatever and also the mixture of io also really matters reads versus right sequential versus random seek uh the size and so on and so forth so this can this can really even apply on ssds to some extent and it's really hard to determine actually how much uh load a device is under or what doing something to the device will cause in terms of device load the cost of one io operation or one block of data is not uniform and it really depends on the larger context of the request it's also really punitive to make this you know iodup max kind of way of thinking where even if nobody else is using the disk you have a fixed limit which you can never exceed you can never go above n iops you can never go above so many bytes of the disk in a second and as such it's not really good for this kind of best effort work on the machine with variable lows but we do want it to succeed and make the best use of the machine resources when possible so one of the first ways we try to avoid this problem is by using latency as a metric for workload health so what we might do for example is to try and apply a maximal target latency for io completions on the main workload and if we exceed that we start dialing back other c groups with the looser latency requirements back to their own configured latency thresholds this prevents an application from thrashing on memory so badly that it just kills i o across the system or by doing so much i o or such costly i o that it actually starts to bite into the main workload this actually works pretty well it works works really well for systems where there's only one main workload and you have a very clearly defined uh very clearly defined main workload and best effort stuff but let's look at a multi-workload case as well right so here there are two high priority workloads which are stacked on a single machine one has an io. latency of 10 milliseconds the other 30 milliseconds the problem is as soon as workload 1 gets into trouble because it has the tightest latency limit everything else is going to be throttled that's the way out of latency works right that's the way it works we throttle people with lower limits until they reach their configured limits so it's always going to get 30 30 milliseconds latencies now again that's fine if the thing you're throttling is just best effort but here we have two important workloads which we want to prioritize and iot latency is great therefore if you have only one workload but it's pretty hard to prioritize multiple stacked workloads in a reasonable way because any not highest priority or not lowest priority workloads may be throttled too harshly this is more like a api design concern though than a technical restriction i mean there are technical problems here but it's mostly brought on by the way the api api is designed so how can we solve this so our solution is io.cost which might look very similar at first but notice the emission of the units these are not units in milliseconds but a cost-based solution where you specify some weight for the c group so how do we know what 40 or 100 or 16 means in this context well the total of all three is 200 so in theory if they're demanding the discourse that's all saturating the discs together we're gonna allocate best effort or slice about twenty twenty percent of saturation capacity workload one should get about fifty percent of saturation capacity and workload two should probably get about thirty percent of saturation capacity and just like iodide latency this is worth conserving which means that if nobody is using the disk if we're not saturating the disk then anyone can use 100 of what's sensible for the disk but this is not that simple right how do we know how when we're saturating the disk how do we know what 30 of saturn in the disk looks like how do we know what 50 of saturn in the disk looks like so iota cost builds a linear model of your disk over time it sees how the disk responds to variable loads and builds a model based on things like you know read or write io whether it's random or sequential i o the size of the i o and other things which allow it to build this model of how expensive is this i o expected to be in the current system context so it boils down this quite complex operation of how much latency or throughput does my application need and boils it down to the simple weight-based prioritization system for overloads and it doesn't suffer from the same problem as i would elaine see since you can stack workloads with this weight-based system since it doesn't seek to penalize others based on you not meeting your your allocated targets it just wants to balance out the disk to reach its peak potential this basic it has a basic on-the-fly model which uses q-depth as a back-off mac start again with the basic so io.cost has a few different mechanisms of operation the basic on-the-fly model uses q-depth as a back off mechanism which works kind of okay in most cases but for those who want more fine gain control we have this program called res cuddle bench which is linked on the slide which does quite detailed benchmarks of the disk and provides qual quality of service tunables to the kernel uh which you can just provide and copy paste it from that application into this iota costal qos and tell io to cost how your disk behaves in a certain kinds of load this is a little bit more accurate since we're more able to closely inject load and measure the effects on the disk in a really controlled way instead of just doing it passively based on the i o which you're performing at any given time but whether you use the on the fly model or whether you use a qos model the weight system in general is fairly successful in simplifying the ongoing balancing of i o significantly for users so i've been talking about sugar v2 for a while now in this talk i've also been talking about it for several years different conferences and historically the response i got to subaru too was this all very nice but you know docker doesn't support it so what am i going to do nowadays we do have quite a diversity of container runtimes and i'm pleased to report is basically supported on more or less everything which is relevant so even if nothing changes from your side just moving to secret v2 will allow your container manager to get significantly more reliable accounting for free and we spent quite a while working with you know docker and other folks to get things working and make sure things make sense there also very thankful to fedora for making super 2 the default since fedora 32 as well as making things you know more reliable behind the scenes for users this also gets you know people's arse into gear i think it was kind of a good signal to the industry as a whole that you know if you are serious about containerization if you want your product to be relevant you really better support sugary 2 now like the time has definitely come for you to support it or fade into irrelevance the kde and gnome folks have also been really busy using crips to better manage their desktop handling david edmondson and henry train from kde in particular gave this talk at kde academy 2020 and the title was using c groups to make everything amazing which i won't i won't dispute or try to disagree with that title in any way if they want to call it that and it basically goes over their use of c groups and sugary 2 in particular for resource control and interactive responsiveness on the desktop this is definitely a developing space but most of the major desktop environments now are investing in the space of resource control and cgp2 and if you're interested in that i definitely recommend giving their talk a watch it really goes into the challenges they have and the unique features which secret v2 has in order to solve those android is also using the matrix reported by the psi project to detect and prevent memory pressure issues that affect the user experience on android latency especially is especially important on platforms like this and we need things like the psi pressure metrics in order to keep the decimal responsive and keep the user happy it would really suck for example if you were about to buy something and when you click buy you know everything just completely froze and you have no idea whether it bought it or not and that's one of the things which psi can tell you you know whether you are approaching the limit whether you're approaching the edge and it can take some action to resolve that ahead of time so we've been working with engineers from google and in other companies to use these pressure metrics that we created to more proactively prevent these scenarios and produce a more responsive and fluent experience for the billions of android users worldwide one thing i'm personally pretty excited about is that a number of the technologies which i've talked about in this talk allow us to do things as engineers that we've never been able to do before at the operating system level and this is one of the first times where we've presented you know these as kind of a cohesive pack as something cohesive to improve your operating system and improve your systems at scale so i really hope that i've been able to help you think about how these tools and technologies might be able to help you and your organization and your projects and your services in a way that's relevant for what you actually need i've been chris down and this has been five years of secret v2 the future of linux resource control thank you very much and have a great rest of your week you

Info

Channel: USENIX

Views: 1,394

Rating: undefined out of 5

Keywords: usenix, technology, conference, open access

Id: kPMZYoRxtmg

Channel Id: undefined

Length: 43min 19sec (2599 seconds)

Published: Wed Jun 09 2021