OS hacking: Profiling (and optimizing) the SerenityOS kernel with "qprofiler"

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

well hello friends welcome back to the program today we are gonna do some performance analysis so uh i have this thing that i don't like on serenity which is that the javascript test suite that we have keeps taking longer and longer to run and we have this test.js program that you can see it running here has a little progress bar in the taskbar and it just takes takes quite a while to run there it's finished in 12 seconds that's a cold run of course so nothing was cached if we run again we should get a slightly better result just because all of the libraries and everything from disk has been paged in um but yeah so 11.3 seconds and if we run the linux build of test.js it is just significantly faster you can see here it runs in just under two seconds so 1.9 something seconds and we have just a very large performance difference between linux and the serenity kernel when it comes to running this program and of course there are other things than just the kernel that differ we also use our own z library here which has our own malloc so there are a number of differences but the huge difference here in runtime i would really like to know what's going on with that and usually what i would do is i would use the serenity in system profiler to to capture a trace and so on you've seen me do it many times but today i thought we would try something different so i've seen this thing called q profiler which is supposed to be a qmu sort of a thing that connects to your qmu instance and just samples whatever the um machine whatever the system inside of qmu is doing at the moment and um i remember maybe a year ago somebody tried to get this working for serenity but i i don't know if they ever did um so i thought that we could make that thing work today or at least try to so let's just clone wait that's not what i was looking for this is what i wanted all right so let's see what this thing does how do you use it um dot slash q profiler okay so you have to pass a path to a socket and some arguments um wait um how do we get qmu to create that socket though probably this thing okay so let's tweak our qmu arguments so that where's that list i think right here will be fine so that it creates this socket and i'll just do that okay and then if we run now does it create that thing okay so presumably qmp is like qmu management protocol or something um and then now we want to run this thing okay so um that's in source q profile or q profiler okay so it's doing something um probably have to change this file name here getting a bunch of invalid access and then we eventually fail yeah right hello world no switch files so let's specify our own kernel i guess uh yes yes yes okay so the kernel and can we make it let's change the duration to like one second okay so now at least it fails faster and we're getting a lot of these invalid access errors so i feel like this is what somebody got stumped on last time um so i think the problem here is going to be related to um the fact that these are virtual addresses so our kernel is loaded at a very high virtual address so we probably have to subtract the kernel base address somewhere so let's just see how this how this thing is implemented um okay so it's python which i'm definitely not super familiar with i have written some python but not much okay so arguments parsing blah blah blah okay serve is a qmu monitor protocol all right so that's presumably the object that we used to speak to qmu and then for i in range r i guess duration divided by frequency so this is the number of samples that we want to take i guess and then oh i see okay so we figure out how many samples do we want to take and then we iterate that many times and at the after each sample we sleep for the frequency which we were passing 0.05 which i guess is 50 milliseconds so it seems like a vaguely reasonable approach and then what are we doing here so do command what does it do command do command just sends a command to serve which is that um qmu monitor protocol what's the protocol yeah protocol okay so it sends info registers i guess and then it gets back response maybe rsp sounds like it would be the stack pointer but i think in this case they mean response so can we like print out the response see what it looks like okay so i guess we're sending info registers and then we just get this big register dump coming back over the over that socket okay and then we are doing a regex search for the instruction pointer rip equals which will find us this line right here this is where we have the current instruction pointer which matches those bogus accesses um okay and then rip is the capture group one so yeah okay so rip here is our captured instruction pointer from that regex so let's just print that see what comes out okay so progress these are our instruction pointers and then uh it sends another command which is uh xp ebp plus eight so it's trying to get the current stack frame i guess so what do we do with that we run it run it regex on it and then save that in rip2 which is then used for nothing okay so well seems like this doesn't do anything so let's just not do that um and then so we have this problem now right that these things are not at the right offset so what's the kernel base um yeah this is the the kernel base that we use on x86 64. so i think if we just subtract this from the addresses um maybe everything would just start working so rip is a string right now how do we turn that into how do we turn a string into a number in python string to int hex convert a hex string to end in python okay um oh i see so if we had a prefix it was like oh x then it would figure it out by itself but because there isn't one we have to specify the base but that seems really simple so let's say let's call this rip string and then rip is int rip string comma uh 16. okay so these are now um numbers that's fine and then finally we should subtract the base weight the base that we're using here so that and um how do we print something hex print text python print text python how do you do a hex function oh you just do like that oh sweet okay okay um that's pretty cool okay so we have those and do we want to convert it back into a string wait how is this used later rip so we keep it in a hash i guess this is like remembering how many samples we had at this particular instruction pointer and then we do an adder to line for every remembered instruction pointer for i and rip hash okay and then i is being passed verbatim to add or to line but i think outer two line expects a hex that doesn't mold thing but we can probably just do this then that seems okay um okay so one of these is not like the other we got this guy right here which does not look like a valid instruction pointer um right yeah yeah because because um we are just capturing the rip like the instruction pointer register from the cpu whatever it is so if we are running in the kernel then we will be above that address but if we are running in user space code we will not be there so i guess we can just do something like if rip is less than this number colon um continue yes i python'd okay um so starting to get something that looks like maybe a little bit like a like a profile um these are mangled c plus plus symbols can we just get adder to so they're using adder to line like they're kind of shelling out to adder to line and then expecting that to um just print something nice but adder to line has the mangle so let's use that team angle okay look at that so now we are getting nice c plus looking names 78 these are not sorted oh that's fine um so let's see if we can let's get rid of that print that we had we're printing the rip we don't need to see those anymore because it's working and this output is kind of messy it says like percent of time function name at a bunch of question mark and that like has been executed so has been executed seems like completely like i don't need that um also i guess i don't super care about like percent of time can we do sample counts instead so that would just be like wait did we just do this because rip hash name for i would have the uh the raw sample count i guess and then let's see okay starting to look kind of cool um and then um the formatting here is irritating so i don't know how to do format strings in python right justified number python uh okay use f string syntax to write a line number f string syntax wait i don't know if i'm in the right place this looks like something i would have written in in qbasic when i was a very young man okay let's see this looks more familiar so um you would have something like colon like this for eight spaces and then you would pass in the thing can we do something like that that was not quite right oh dot format oh okay okay my bad okay um so the file name part here is kind of useless added to line can you get adder to line to like skip the file name show function names well it's not the end of the world although these um dwarf errors are irritating subprocess check output can you redirect standard error with nothing sub process check outputs there um yeah this sounds like what i want just tell subprocess to redirect it for you past study okay um let's see am i doing this right oops i'm probably um writing very hackish python at the moment but it totally worked all right and then um i think we can just sort dash n at the end and that should ensure that uh we're always sorting by samples counts oh okay so let's see if this thing actually works what about we just um just like capture while it's running test.js get a couple of these oh i guess these are not so many samples because we're only running for a second so let's run for five seconds and um can capture like way more samples also frequency is not spelled correctly here but uh let's just ignore that uh all right so what are we getting here so these are so we don't get stacks here like we only get the instruction pointer at every sample opportunity so we don't get like we don't get the full clear picture of what's going on but we do get to sample the currently executing function which can probably tell us something so what we can see right here in this particular capture is that we have the highest number of samples in memory manager kernel region from virtual address that is the place where we spend the most amount of time or like that's the place where we are caught red-handed the most number of times like specifically in that function so seems interesting let's see let's see what we can do about that so kernel region from virtual address um right so we okay so this is used i think um when you get a page fault for example so you have a the cpu tells you that you have a fault and it gives you an address and then it's up to our memory manager code to figure out okay well what's supposed to be at that address and then page something in and first thing we need to do in our code is to figure out what is the sort of higher level memory region object that's responsible for this address and as you can see here we do that by just iterating through all of our kernel regions and checking if each one contains that address and if so we return that region and this is obviously n so we'll run in linear time which is not great and uh we i think it's probably a vector no it's um an intrusive linked list um but i think we should be able to replace this with a red black tree because we we already use a red black tree for for user space memory regions so this is specifically for kernel regions so like memory that's inside the kernel only memory area if you look at the address space class it has a red black tree for managing uh regions here so each process address space already uses red black tree which has uh logarithmic time for um for lookup so let's try to switch it out so i think one little thing to know here is that the the address space for for user space memory the address space object owns the regions contained within whereas in kernel code regions are owned by own putter objects so around the kernel you'll find like these random ohm putters to a region um not the case for user space regions but the case for kernel regions so this is a weekly owning intrusive list at the moment so we are going to need a weekly owning red black tree so what were they using flapper or anything flat putter to we'll use flat putter to region star okay and let's see what we do with this m kernel regions so this code here instead of iterating through the entire set of regions what we would do instead is then um there's an api i always forget what it's called um mkernel regions find largest not above there we go um find large is not above virtual address and if that's not a hit if we don't have one oh that's a putter putter actually it's kind of a yeah the api is a little bit awkward but uh it is i don't know how to make it better so okay but if we do have a region there then region butter i guess like wait no no this should work yeah hmm okay yeah that looks really nifty actually okay so that would be o of log n worst case performance and then we probably have a bunch of other situations that use m kernel regions that we now have to change because it's not an intrusive list anymore so we don't have append for example we have is it set nope insert yes uh oh and we need a keys so that would be region batter get and to remove something i guess we have to probably pass in the key right so yeah so the virtual address of the region is what we pass in here and that has to be a pointer coming in great um and then here um those are now oh so wait this is just a dump function that like dumps out all of the kernel regions so in that case we just need to switch those to um to arrows is what i'm trying to say but we can also do something like this i don't know maybe that's okay less changes so see if that works i feel like this is one of those things that that like i i've been meaning to do forever but i just keep forgetting about and i mean it's been a long time since i even thought about this but the fact that it is um the place where we spend where we are caught red-handed the most number of times in this profile is probably a sign that we should do something about it so obviously using a better data structure is um is one of those wholesome things that is just kind of obviously a good idea but let's see if it actually improves anything and and of course let's see that the code actually works how do you think i got it right um we'll see okay we're up and let's run test.js well it's chugging along at least not crashing and we completed in 10 seconds cold so that already seems like a relatively large improvement actually first cold run was about 12 seconds oh look at that uh hot cached we are down to 9.3 seconds so um it's actually quite a substantial improvement super awesome um so let's let's just commit that because this is obviously a good thing um kernel switch store kernel memory regions in a red black tree this we were already doing this for user space memory regions in the memory address space class um so let's do it for kernel regions as well this gives a nice speed up on test.js and probably basically everything else as well yes that's super cool um okay so let's um let's capture again using our q profiler thing i guess we will just um start this thing and then run this thing and you know it's not an exact process that we're doing here just uh sampling a little bit and then i guess now we're just waiting for it to crunch through and yeah as you can see that function is just gone from the top the um what was it called kernel region from virtual address kernel region from virtual address right is that even in here it is not so um kick ass okay and then now the top one is bitmap view find next range of unset bits um i think that's probably um the kernel heap allocator so the kernel heap allocator uses this to it's like it's a first fit malloc that we use in the kernel and it uses uses as api so i don't know i mean it's it's certainly possible that it's doing something silly uh it's a kind of an overly complicated implementation of this thing um but i okay this is not the day to write a new malloc i think so let's just kind of ignore that one um what else is up here so spin lock locker destructor okay spin lock locker which um templated to recursive spin lock so i'm guessing that this gets in line here recursive spin lock unlock uh yes this gets in line so it's probably all of this code right here where we have uh atomic load atomic store and a whole bunch of gunk so [Music] um it's probably some way to to streamline this as well but it seems vaguely forgivable at the moment and here is just a raw atomic primitive compare exchange strong [Music] interesting that that's even showing up but that's often the case more spinlock locker okay flush tlb local okay so tlb flushes are i mean there's a fact of life when you are doing a kernel you have to flush the tlb but is it doing something silly um okay so it has a loop and we okay so we get the way that flush tlb api works is that we get an address and a number of pages that we're supposed to flush the tlb4 starting at that address so then we just loop uh through the page count here and um this looks there's nothing obviously wrong about this to me unless unless there's like a better instruction to use or something uh but this it feels like the kind of thing that would just it's just bound to be expensive all right and then physical region return page okay so that's when we are deallocating a physical page because it's no longer used why might that be slow okay well we have a fix me okay so find a way to avoid looping over the zones here so it seems like my past self was predicting that my current self would eventually come here uh and need a hint about what to do about this it's funny when you when you write like a little note for your future self and then you find it and you're like thanks man i appreciate that you thought of me so past andreas suggests that we should do some math on the address to find the right zone index the main thing that gets in the way of this is unifor non-uniform zone sizes perhaps it would be better if all zones had the same size okay so this is about our physical memory allocator and the way that the allocator works is that it takes it enumerates all of the physical memory that we find on boot and then it chunks it up into regions so we can see right here we have a user space memory physical region which starts at this physical address ends at this physical address has the size and we chunk it up into what we actually what we call physical zones so we have 60 physical zones of 16 megabytes each and then we have 10 physical zones of one megabyte each and the reason that we end up with that distribution is because we didn't have enough for another 16 megabyte zone here so um i think we're just preferring to chunk it into 16 megabyte zones and then for the remainder we just chop it into zones of one megabyte each and what my past self was saying in this comment is that this gets in the way of using some simple math here because what we end up doing instead is in order to return a physical page so we have an address of a physical page that we're giving back to one of these zones and in order to find the appropriate zone we have to iterate through all of the zones check if the zone contains the physical address which is just a pointer compare and if it does contain it we de-allocate the address into that zone but i'm guessing that this loop here is what is making us show up at all so what if what are the what is what are the options here so um as i was saying here perhaps it would be better if all zones had the same size so what that means is that um we would sort of uh ignore these the 10 um 10 zones of one megabyte each so 10 megabytes here would go to waste because we couldn't chop it into 16 megabyte zone and we would just tolerate the lost memory here that does seem a little bit sad i guess like uh 10 megabytes not not having those 10 megabytes available to allocation just completely lost seems like silly so that would be unfortunate but i guess something we could do is we could um we could remember how many of these full-size zones that we have and in the common case we will be somewhere within this set so we could do a simple address math to find the zone index here and if we're outside of that we could fall back to iterating through the remainder of the zones um that might be in a compromise that we could do so that seems like something we could do right um so how are these set up initialize zones oh man when did i write this code it feels like it was recently but it was in july okay so well it's almost half a year ago okay okay make zones pages per zone right first we make 16 megabyte zones with um 409 six pages each and then we divide any remaining space into one megabyte zones with 256 pages each so quite descriptive although the way that the argument here is form formulated as a page count rather than a megabyte count makes it a little bit less obvious how it works but the comment makes up for that [Music] so what is the thing i want to express down here i guess it's [Music] i want to compute a full-size zone index which would be the physical address that we have coming in which is the address that we are returning to the page allocator uh patter get minus the lower address of this physical region yes okay so now we have an offset into the physical region and then we can divide that by um page size which gives us a page index into the full size zone into the into the physical region okay and then we want to divide that by full size zone page count or pages per full size zone okay and then let's just put these here pages for full size zone 4 96 that seems or or we could also like say small zone size is one megabyte large zone size is 16 megabyte um forgive me for saying mb instead of mib uh yeah so let's take the opportunity here to make this a little bit more descriptive so large zone size and small zone size and then that's pages per zone which is now zone size size d pages per zone is zone size divided by page size so a little bit of math there shouldn't be too complicated um okay so now we have pages per full size large zone page count pages per large zone um maybe i don't need this actually then i just need to divide it by large zone size yeah okay and then if full size zone index is um wait how do we know how many of these we have do we have a size what is that is that the total size of the allocator and pages okay so hmm um wait what else do we have here we have the zone size and pages so how do we know how many large zones we have that would be large zone size divided by page size right so um yes i think that should be [Music] then it should indeed contain that address so the zone is m zones at full size zone index and then we can just de-allocate the block and we don't have to loop through them okay otherwise we do have to loop but um we could start the loop later so we'll ah so we could start the loop at large zone count um something like that okay and then we should definitely end up finding it so this is um okay i feel like i very likely got something wrong in the math here um but basically if if we are within the first couple of zones that are full-sized we can just do simple arithmetic to compute the index into the zones vector and that allows us to skip looping in the common case which should be faster i would imagine um so return page be interesting to see if that disappears from this thing and also interesting to see if it um makes test.js run faster it doesn't seem like a bad idea in general to avoid looping if possible so seems like it runs 9.6 not bad um okay 9.2 hot so ah no nothing nothing amazing here but um 8.995 ooh look at that numbers starting with eight very cool um let's try to just capture a profile and see if we can see mr return page mr physical region return page let's just do that return page not a single one so that's pretty cool basically we were able to just get rid of it from or at least i mean obviously this is not very precise or exact but we went from four um hits to zero hits or zero samples which i would say is an improvement and you know doing arithmetic instead of looping is probably a good idea so i'm going to go ahead and say that this is a good idea and we should keep it return page no return page um that is good what is this hpet read main counter update time schedule read main counter so that's the um that's the system timer the high precision event timer i think it stands for hp high precision event time yes so that's what we're using to drive our scheduler and um it's like the timer chip that we use so it's kind of weird that we're spending so much time just reading the i mean reading like the current clock essentially why are we doing that so much let's um let's investigate read main counter okay so uh we have two paths here if it's a 64-bit counter then we call some six to four bit function otherwise we do a 32-bit read i suppose so um let's find out if it's a 64-bit or 32-bit the one that we're using so that we can um it's going to be very spammy but i mean i just want to find out uh which path we're taking here 64-bit okay okay so whoever's calling us we call read register save 64. um wait what is this as per 247 this reads the 64-bit value in a consistent manner using only 32-bit reads um but what if i'm on a 64-bit cpu does this even matter two four seven i'm assuming it's like a section in the spec issues related to 64-bit timers with 32-bit cpus a 32-bit timer can be read directly using processors that are capable of 32-bit and 64-bit instructions however a 32-bit processor may not be able to directly read a 64-bit timer a race condition comes up if a 32-bit cpu reads the 64-bit register using two separate 32-bit reads sure but what if we are on 64-bit then i think we don't need this right because we can just do a single 64-bit read um so we can just do like um this is like specifics to to i36 and otherwise we would just um oh we have to um let's do something like anonymous union with anonymous struct um something like like this then i can just do reg dot full wait because yeah cause like we don't need to worry about um if since we have a 64-bit cpu in this case we don't need to worry about the race condition when you can only read 32 bits of the timer value at a time so actually if you look at what we what we do here on 32-bit is that like we read part of the timer value and then we loop until we're able to read the second part with the first part still intact i think that's that's what we're doing uh if new high is high then break yeah yeah so it um we do a loop here until we're able to do a consistent read of both parts but none of that matters on 64-bit cpus which is what i'm using here so i wonder if that will make h-pet read main counter drop to nothing as well because i mean certainly this is a much simpler piece of code um than this so should probably validate that the generated assembly um produces a single 64-bit read read register save 64. um wait how do i do that read register save 64. or maybe it's folded into read main counter actually this probably is a read main counter okay hp read men counter um wait why is there so much going on here oh yeah right it's all this stuff here in case it's a not a 64-bit counter but it if it is a 64-bit counter so we coming up here test for that otherwise if it is 64-bit then jump to here so b52 go here mm-hmm am i calling to somewhere [Music] um well either way let's just see let's just let's just sample it so it's not like magically faster at least 9.82 seconds sure that's cold 8.962 okay so dipping comfortably into the 8.9 ish range that's pretty nice 900 okay okay so let's do a capture here so see what it comes up with okay so it moved up to here we were previously had four samples now it's down to two uh of course again this is not a very precise science but it does seem like a fairly reasonable thing to do i mean like the change itself seems super obviously good because this is not a concern on 64-bit um although maybe we should express it the other way around just to that might look a little bit more understandable yeah i think that's that's like a good change so let's uh commit that colonel um read 64-bit h-pet um value um simplify uh let's see simplify 64-bit agepad reads we don't have to worry about um [Music] [Music] racy 32-bit reads when we're reading the 64-bit hpat value using a 64-bit cpu yes that's that's a nice change and then yeah so let's do that um physical memory zone thing kernel um make the physical region return page use do uh arithmetic instead of loop yeah so um in a common case we hmm most of the time we will be returning or freeing physical pages within the full-size uh full-sized zones so we can we can can do um some simple math to find the right zone immediately instead of looping through the zones checking each one um we still loop we still we still do loop through the um slack slash remainder zones the end that's probably an even nicer way to solve this but this is already a nice improvement yeah okay i wonder if we can can we like capture even more of these like we take it down to because really i just want like a lot of samples so that we can just kind of get a better picture of what's going on uh it would be cool to make it to uh whole stacks though but then you would have to walk the you have to walk the stack um and for that to work correctly you also have to pause the cpu because otherwise you would get like like random unrelated things so you'd have to completely pause the cpu capture the rip the stack pointer and then walk the stack backwards um you know and sort of reconstruct the call stack that way but it's probably doable but um i think today we're gonna be happy with what we have here still still found some really nice things using this technique so oh wait wait i'm doing this the wrong way okay let's see so now i'm capturing lots and lots of samples hopefully every two milliseconds for five seconds should be a lot of samples i should be able to compute that in my head but let's use the power of tools so 2500 samples i wonder if it actually is able to capture that many using this qmp protocol i feel like there's probably some amount of overhead um and maybe what we're going to end up getting now is like um a bunch of samples in the in the idle idle task or whatever i'd loop because i don't know i feel like it's not very fast at capturing samples somehow i suppose we could we could run this in a loop and capture that way in case it just falls out to idle now it's just taking forever what are you doing oh wait here it comes okay well we didn't go out to idle so that's great um spin lock locker sure that's the allocator tlb flush um allocate committed user physical page so page allocation is pretty heavy nothing super obvious here really but probably all of these things are worth looking into but now i've done a whole bunch of them and i feel like we should probably wrap it up so um let's see maybe what i should do is i should uh fork this thing and just push my um just push my little changes that i made so that we can have them in case somebody wants to play with this i will just push it to here um make this work for serenity os x 86 64. yeah so i'll just i'll just push that up and shout outs to um to matthias vara who created this q profiler thing very neat um good thing i i think we can probably we can probably make all kinds of improvements to this but at least now we have something basic that works and we can build on top of that seems like a potential is strong here so git push mine master i guess yeah there we go and right and then you need this stuff if um i think i'll just make that wait what are you mad about freaking shell check um i'll just make this we can just run with the qmp socket on that seems fine this allows external connections to the qmu monitor via qmp um okay so there's a nice little set of changes that we found in this nice little way so i think this will be it for today's video uh if you made it this far then i thank you for watching and for hanging out um i hope you saw something interesting today this was a very fruitful little session i would say i really enjoyed what we were able to do and certainly we improved test.js runtime quite a bit here we started at what was it 12 seconds cold 11 seconds hot or something like that and we've been in the nines dipping even into the eight set sometimes um and surely there is a lot more stuff to find here and also 8.8 that's really juicy of course the linux version was running in less than two seconds so obviously we have a long way to go and a lot of problems to figure out and stuff still you gotta you gotta be happy about the little wins and i'm happy about this one so um thank you very much for hanging out and i will see you all next time bye

Info

Channel: Andreas Kling

Views: 7,647

Rating: undefined out of 5

Keywords: serenityos, c++, programming, osdev

Id: 5JbDDRUk8Mk

Channel Id: undefined

Length: 66min 7sec (3967 seconds)

Published: Sat Dec 11 2021