DevTools hacking: Whole system profiling (all processes)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

well hello friends welcome back to devtools hacking today we're gonna keep looking at profiler stuff and uh it's gonna be about whole system profiling so until now we've only been able to profile one program at a time with our profiling system so if we select file manager desktop here for example we can create a profile of the file manager process see what it's doing but if we want to see a holistic view of what all the processes in the system are doing then there's currently no way to do that like you have to choose one process and that seems like an unnecessary limitation so today we are going to start building out full system profiling and i think the idea is pretty simple but the devil is in the details so basically the way profiling currently works is that if we look at our scheduler tick which is um this thing that gets called every couple of milliseconds by the timer in the cpu to interrupt the current task and schedule a new task what we do here before we go and figure out a new task to schedule we will first check if the current process is profiling and if so we will go and append a so-called performance event or perf event to the perf event buffer of the current process so this is the basic mechanism and we have these two syscalls profiling enable and profiling disable and the way they work is you call profiling enable with a pid and it essentially creates a performant buffer for that process and then marks the process as being profiled and then whenever we run the scheduler which is like every couple of milliseconds if we're currently executing in the in a profiled process we will generate one of these performance and just stash it away in that buffer and then later on you can you can extract all these events either you can go and pick them up in the proc file system there's proc pid perf event but also when a process exits and uh it has perf events it will dump this thing called the perf core which is uh what happens if you do something like profile dash c text editor so now we exited the program and we can see that there is a perf core file here that hasn't been dumped and you can open those with the profiler program so that's what happens um so in order to profile the entire system though this has to work a little bit differently so um i'm thinking let's approach this in a very simple way so um we'll start with just some syscall here that i guess if you don't provide a pid so if you say like minus one for example we'll just profile everything um so we can put up here pid is minus one um and then i guess we should only allow the super user to do this so we will return eper return eperm in that case if you're not a super user and otherwise we will start profiling whole processes or let's call it profiling full system true right okay and i guess we should probably uh so we're going to need some kind of a performance event buffer to stash all these events into because now we're going to generate a lot of them right because previously it would only get we would only generate a perf event um whenever whenever the profile process was scheduled and let me just show you what a perf event looks like actually so here is um a typical where here's the performance event to data structure so it has a type and then it has a stack size which is like how many stack frames then we have the thread id and time stamp and then um we have these performance events from malloc and free um and then we have a union of data so just to not waste too much space here and then we have all of the stack frames with a limit of 32 apparently so this is a perf event and there's finite space for these so currently we allocate four megabytes of memory for the performance event buffer uh right here we can see that when we create one we actually allocate four mags so four mags is fine for for a single process like you can get a fair amount of profiling that way but for the entire system we're gonna need more than just four mags so i think um maybe what we'll do is we'll we'll have a way to say like buffer size here and then you'll have to specify but it can default to four megabytes or you know actually let's let's organize this a little bit so this has this has a big flaw here that because we're doing this buffer allocation in the constructor we don't handle allocation failure at all um here like if this fails to allocate because we're out of kernel memory we will just carry on anyway so why don't we take this opportunity to make this better uh instead of having people call the constructor let's allocate before we ever construct anything so we'll make something like um putter performance event buffer try create with size yeah that'll be good buffer size and then we can hide this function down here in the private area still make it explicit though and then we will go and let's see what do we want to do in this thing i guess allocate that buffer so buffer is uh k buffer try create with size buffer size and if that fails we can very comfortably um just fail here right otherwise we will adopt a new perf event buffer uh and move the buffer we just allocated into it although i guess we should release non-null because it might be null yeah so the constructor here will instead take a non-null own putter to k buffer hmm wait is that how we use k buffer i guess that makes sense yeah yeah right okay that's fine and that's what we're storing here as well so we can upgrade that one to a non-no-one putter as well yeah that'll be good and then wait why don't you like that um let's see buffer and i guess we had some more arguments here actually like the read write and allocation strategies and stuff like that so let's actually copy this code that we already have we'll just override the buffer size yeah there's an important thing here to note actually that the um the page allocation strategy for performance event buffers is allocate now most of the time we just reserve physical pages for memory allocations but the performance event buffer has to be allocated right away and that's because the next time we access it might very well be in the scheduler um during an rq and um when we're handling a timer rq we we don't want to cause a page fault and so we have to actually populate this thing with physical memory right away because we're not set up to handle page faults during irqs um oh and i guess i have to adopt own that thing actually yeah so adopt own creates a non-alone putter adopt creates an arnold ref putter maybe it should be called a dobrev i don't know anyway adoption is where we are so very good and then now that we've set this up um we can look for places where we actually construct these um so in process ensure perf events right so this is called uh by the profiling enables us call so we shouldn't trust this thing to totally work every time right like instead we should do something like um if process fails to create uh perf events um buffer if needed then we'll return enom yeah i think that'll be better and then who uses this and sure like this find usages thing uh in sea lion i feel like it's it's showing me way too much stuff and i i can't um i can't get a good overview here can i somehow expand all like can you simplify this view somehow i just want to see like um one line per usage i don't know anyway let's not get caught up with that right now so i'm sure perfect like i just want to see these um profiling right oh yeah so i didn't mention this so there's also a system call that allows you to inject a perf event of your choosing so the process a process can insert any performance it wants into its own earth event and we already have that integrated with our malloc implementation so you can you can set an environment variable and that will cause malloc and free to generate perf events and then those can be parsed by the profiler program to create a sort of a picture of what was allocated at this point in time it's not something that we've used a lot lately but it's something that we will use eventually for sure just that's what this is for here so i guess we should have a failure case here so if create a perf event buffer if needed if that fails um we didn't have memory for that otherwise yeah so let's go and create this thing so where is perf event right here yeah so we'll just have that and um that should probably be private actually because nobody from the outside should be calling this i guess we'll just put it in process cpp or where was the ensure function yeah it goes here actually okay so performance event buffer try create with size um so for a process we'll use four megabytes for midi bytes if you will uh and if that fails return false otherwise we'll turn oh i guess we'll do this why don't you like this oh it's not static got it [Music] um right and then yeah so let's do this change first and then we'll get into the full system next because this is actually nice because now we're we're covering up the case or fixing the case where you try to profile something when you're just about to run out of memory that is a very good thing to fix so you're complaining about line 73 here um so the buffer can no longer be null because it's a non-player so that's one less edge case very good and yeah so the actual capacity in terms of performance events in a performance event buffer is the size of the underlying k buffer divided by the size of a performance event struct so um the buffer slice that we pass in is not the number of events that we can handle but rather the number of bytes we can store in these buffers uh but even so four megabytes um that's like um i don't know like four four or five seconds at least of 100 cpu profiling so let's just get a profilamundo cool everything still works that's good um great so let's commit these changes because i think these are just nice standalone changes anyway um did i forget to remove that yes i did okay and i'm supposed to commit over here which i keep forgetting colonel um don't handle allocation better handling of allocation failure um and profiling if we can't allocate a performance event buffer to store um the profiling events uh we now fail uh this is profiling enable and this perf event with eno mem instead of carrying on with a broken buffer yeah so i think what are you complaining about here typo that's not a typo save it to the dictionary that's a real word uh okay so the next step here is i guess to add the global profiling flag so let's see about that um so where does that go i guess we can just have a global variable for now so uh we'll put that with um profiling like next to the syscall i guess it doesn't really matter at this point so [Music] what did we have here we had um profiling full system maybe you should call it all threads that's slightly slightly correcter gee profiling all threads cool and then the big thing that we need actually before we set that flag is we need a perf event buffer performance event buffer but a ginormous one so i guess uh global performance oh should we just call this like global profiling hmm you know profiling all threads is more more um descriptive i like that better um so performance event buffer try create with size let's do um i don't know 32 megabytes we'll start with that um yeah so let's try that and hmm actually i wish i had um i wish we had a place to stash these but i guess we're just gonna have to leak it for now um because we don't really want to use um startup time constructors for this type of thing so for now let's just do this global performance yeah and then like if there's already a thing then clear it else create okay and then let's maybe put this in a critical section because it would be kind of gnarly if the scheduler comes and interrupts us during this whole affair here so let's just make sure that that doesn't happen and we will do the same corresponding change down here in profiling disable so if you're minus one you're you are the super user then we will simply turn off this thing and it should be good so then in the scheduler when the timer ticks um here i guess so profiling all threads i already forgot what the other one was called uh globalproof events right um so like if global uh no no profiling all threads then we should most certainly have a global performance pointer this becomes an elsif case and then global performance and then we just do this thing right here um let's see yeah so adding a sample to it um it needs to know like what process we're in and stuff like that but i guess um the perf events buffer the performance buffer could figure that out does it know what process it belongs to no it's passed in when we serialize it hmm yeah this is going to get a little um there's a bunch of things we're gonna have to fix here because um until now it has um there's always been like and there's only ever one uh process being profiled so there's no need to keep track of um which process is the sample in and like what was mapped at which memory address in that current process um but that kind of changes now so performance event buffer um needs to also keep track of um memory mappings i suppose um okay well you know what let's just uh let's just do one thing at a time so we will progress forward one step at a time so maybe we can reorganize this actually so performance event buffer performance um yeah and then we will simply do this this and if perf events and that cool so yeah a little deduplication here but essentially these two can move up here all right so now that we record back traces in every running thread um then we need to have a way to extract this stuff from the kernel so we can't really dump a perf for anymore because um that that happens on process exit and everybody's not going to exit right so and also we don't want to produce like a perf core for every individual program that was profiled but rather we want like a whole system profile all bundled up into one thing so uh i think what we really want here is a proc file system entry so in proc fs um [Music] here is proc pid perf event which does this for um for an individual process so the way it works is we find the process that this pid per 5 events inode corresponds to prevent interrupts and then serialize the perf events buffer for that process to json passing in the pid and the executable path which are then included in the json output so i think what we're going to do here is we're just going to add a new proc file so we used to have a profile pro we used to have proc profile before it moved into progpid but i think maybe we can bring back proc profile uh it's just that it will be like the full system profile so we just forgot to remove this previously that's why it's gray here i think we'll just bring it back um it will become our full system profile thingy so we'll put it maybe here profile the proc file system is a slightly confusing place but essentially it's based on callback functions that serialize kernel state into a buffer and the last entry here in these structs the last entries are the callback functions so this one is going to be called profile and we'll write it here so uh profile so we get the inode identifier but we don't need that because we know that it's brock profile it's not per pid but we do care about the buffer builder so we have a builder and now we're simply going to um let's see i guess we need to get the um that performance buffer that we were talking about extern this guy right here global perf events okay so if we don't have it then we can just actually return false that's how you signal that um we can't read this there's some error reading this proc fs file uh if things are good we will return true okay and now what remains is simply serializing this thing to json actually maybe that returns something because i think yeah that's what we're doing down here so two j's on um and we pass in the builder and we can't really pass in a pid or an executable path because we don't have those things so um we're gonna have to figure that out some other way so to json um maybe we'll call this or two json is fine global perf events okay so this one okay buffer builder becomes like that used by a full system profile profile you okay so what does this thing do um okay so it looks up the process from the pit and then grabs the process space lock so that the process address space cannot be changed while we are serializing the stuff and then we create an object serializer uh and print out the pit and the executable and then we also dump out the address space um so i think what we're gonna need to do here is for the full system profile we're gonna have to dump out all of the processes that are part of the profile we're gonna have to like we're gonna have to change the json format of profiles basically because currently the pit and the executable path they're like top level entries but we're we're going to need to make like a processes um array at the top level of the profile i think i think that'll make sense and then in there we'll store kids um executable paths and memory layouts these regions right here but for now let's just make it so we can serialize one of these things so i think um down here is the actual meat and potatoes of serializing the actual events and up here is just the uh profile specific stuff so maybe we'll we'll split this into a separate function um something like to json impul like that and it will take a serializer like that maybe and let's see because the json object serializer is a template class that can target any builder we have two builders we have string builder and k buffer builder so you don't see it here but it's actually templatized based on the input here that's why we need to templatize the serializer here so we'll do serializer um call it object like to do down here and then we move the aforementioned meat and potatoes up to there so two json impul object like that i think yeah yeah yeah okay that that seems okay um oh i didn't actually add the new variant so that is performance event buffer to json this one right here um [Music] to json impul object okay so let's see if we can capture a full system profile and then get some json out of proc profile that would be cool so we have to be root now we will enable profiling for pid minus one so this should be capturing stacks for the whole system we'll disable it and proc profile is there and let's cat it okay we we have stuff um let's let's jp that so we can see what it looks like a little bit better um all right all right so tid is thread id and it's zero in all of them that's a little bit suspicious um i feel like it should should be something other than zero sometimes but maybe we're not populating that field let's see let's let's use um tom nom noms grand tool actually um grab it yeah so they're all zero right yeah so something is preventing us from setting that correctly uh and where do we normally populate it performance event event and then type la la la where will we be setting the tid i have a punch that nobody's actually setting it also what the heck is this this doesn't seem related at all um okay where is this even crash demon okay clang d you silly that's this is not the same tit why are you confusing that anyway so i guess just nobody's setting that um but we can we can set that where do we set the timestamp event.timestamp so here we just get the current uptime in milliseconds sure so we can easily just smash in the tid here as well by just doing thread current tid value let's see if that works actually so we haven't done any multi thread profiling either like like now we're doing multi-process profiling but we haven't done multi-threaded profiling before so i guess we just never got around to to profiling something that was interesting in that way but now we have to deal with all of it and i keep forgetting what i'm supposed to do proc profile okay so now we have tits here so let's run that and grab for it cool so actually you can see that a lot of these have to zero the zero is the um the kernel process which is the idle task so we're actually capturing idle we're sampling idle time right now as well and that's really going to bloat up our buffer so i think um we should probably opt out of that at the moment and only capture um samples where something was scheduled because yeah i mean it's it's nice that we're idle as much as we are but uh um just in the interest of capturing interesting profiles let's um let's not capture the idle task for now and leave a fix me about maybe making that an opt-in mode for example like you could imagine having um because it can be really interesting to capture idle samples as well um but it's also more costly and yeah so i i'm just going to disable idle frames for now and we will probably get back to that at some point when i want to use them for something so here for profiling all threads [Music] um let's say um if do we have a pointer to the current thread right here tid is zero well can we do we have a way to look up the kernel here somehow normal colonel kernel process i was thinking like maybe some of the surrounding code was already looking at this because what i really want to know is like yeah this is what i want to know if the thread is the current processor's idle thread that's really the more correct check to be doing so [Music] current threat is not fix me uh we currently don't capture sample um collect samples while idle that will be an interesting mode to add in the future yeah let's see that that works because if we capture the idle stacks then we're just going to bloat this thing up immediately previously like the single process profiling mode does not obviously does not capture idle stacks because because you would enable it for something that's you will never enable it for the idle thread because you can't do that so oh right okay and product profile yes now we only have 18 events instead so substantially fewer but if we grab footed we can see that noted zeros anymore so very very good okay so i think i'm happy with what we have here so far we're gonna commit this um actually let's just verify that we didn't break the old mode yeah we're still good okay and let's commit in c line um start work on a full system profiling i'm calling the super user now call process profiling enable with pid minus one to enable uh wait i can't see the breaking thingy here this is the 72 uh character 72 column marker here but like if you do this you can't see where it is so we gotta expand a bit all threads in the system all running threads in the system um today the perf events will be collected um are collected in a global performance event buffer um are currently 32 megabyte of her in size let's say yeah um the events can be accessed um via proc profile also we should we should make sure that that thing is super user only actually but i think it already is yeah yeah right so only root can see that thing which makes sense i think this is none of um none of your business if you are not root okay cool um anything to note i guess no that's probably pretty good all right so let's start looking at actually opening these things in the profiler program so what we want to do now is change the format in such a way that the profiler can make sense of it because if we let's take a look at generating a profile the way it is today and what's actually in one of these files so we now have our perf core here and if we look at that up top we can see that we have at the top level we have the pid and then the executable path and then the memory regions in that program so as i was saying earlier like we're going to want to move this into sort of a processes array or a threads array processes array i guess makes sense um and then we'll need to dump the executable path and regions for every process that we've captured um stacks from also it seems like um the more pager program is not working quite right uh also i cannot break out of it well now we made the shell cigarette or made it quit here okay well um this is not good we need to improve our mower program anyway um let's let's change the format of the um of the profile uh so uh i think i think we can make a change that um then remains compatible with single process profiling obviously we just need to like produce an array instead of uh put put these things at the top level because we can still leave all of the sampled stacks in a top level array because you can go from each sample's thread id to the relevant process like you can you you should be able to look that up from the processes array uh but yeah so so let's let's move that into its own thing so i think we're gonna have to change the way the performance events are stored a little bit maybe um because it's not it's not obvious when we should capture the memory layout of a program because if you capture it like if you capture it the first time you get a sample of a process then it hasn't been dynamically linked yet so you don't even have all the libraries uploaded so you need to capture it later on when the memory layout has all the code loaded but how do you know when that is i guess [Music] no this is a bit trixie i guess if when a process dies it could check if it's contributed samples to the global profile of the global perfimance buffer and if so it could sort of in its dying breath add all of its memory regions to it that would be one option another option which is terrible would be to do it on every every time we capture a stack but that's just going to be extremely inefficient way too much work [Music] another option is that we would just do it on when we actually collect the profile but then we would miss programs that have exited um which is not good because what one of the things that i want to do with this is be able to do something like um like compile something right uh so compile a a package inside serenity and then look at the whole system profile to see what was actually going on like first we ran make and then make spawned a gcc process and that spawned a cc1 process and that spawned a linker in the end and blah blah blah and you want to be able to see that whole chain right so we'll have plenty of exited processes in our interesting profiles so i don't think it's enough to just um to just do this on exit [Music] or to just do it when we actually serialize the perf events but rather i guess one thing we could do is when the dynamic loader right before it calls into main it could um it could give a hint to the kernel that like oh this process has finished mapping its executable code now would be a good time if you are profiling to capture the state of uh the address space maybe that's what we should do um although that still doesn't perfectly cover like um already running programs or like if you start profiling while a program is in the process of dynamic loading but it's not yet finished or actually no no that does cover that yeah maybe that is the best it's a decent option but if something is running and it's already loaded then the first time we sample it will also capture the address space yes i think i think that's good so basically there will be a way to sort of uh seal the address space for purposes of profile capture um yes yes yes yes yes we can we can do something like that we'll try that and maybe there are better approaches that we can think of later but we have to start somewhere so essentially um we can probably take advantage of this thing we already have if we look in the dynamic loader right before we call the main function of the loaded program right here we call into the entry point one of the very last things that we do is um call emphasis call which um this solidifies with the placement of libsystem which is the only library in the system that's allowed to call the kernel so the very last thing we do here before we call main is um to call emphasis call so we could just hook into emsa's call i think yeah when you call this thing with null we start enforcing this is called regions and this is the point where we want to capture um the address space layout for profiling so if g um pro pro perf events what was it called global perf events yeah yeah yeah um although now i'm getting ahead of myself because i wanted to do something else first i wanted to do the part where we just changed the the format of the json then i got into this instead um okay let's um let's do the json thing first because otherwise it will be hard if we do this first then we'll be hard to um get those things into atomic commits so and i would like to do that so let's do the json thing first so let me just write a quick note here to self that what we want to do is capture the um executable address space layout of a program of the process at this emphasis call and store it in the global perf events buffer um four processes sampled uh that already had that that had already called this ms this call um capture their address space layout on first sample yes that is the um what we're gonna do with that but let's do the json thing first so when we dump out this json stuff right here to json um then oh yeah yeah right so we need to change the the struct so here it needs to have or it shouldn't be a performance event but rather a separate thing like a sampled process let's say so in here we'll store a pid and an executable path and the memory regions need to go in here so what do they look like it's essentially a name and a range right so regions has name base and size so yeah that's the name of the range that's it uh everything else can be worked out from from that so um sample process will have struct region which has string name and these don't need to be packed it's not such a big deal actually um sample region range range yeah that's it that's what goes in there and this does not need to be seen outside so we'll just put that in here and do we need to track the tids that belong to it i think we do um although okay wait hold on sample the process i think we'll store these in a hash map [Music] wait can we have a hash map from a process id i don't even know if we do that kind of thing um seems like we can do that okay um sampled process maybe we'll store it that way processes yeah so you look up a process you find this thing and then in here there will also be a vector or you know what let's just put a hash table thread id threads cool this is all the information we need and then when we do this thing right here um we are not going to um we're not gonna serialize this on the fly but rather it will have to be it'll have to be done somewhere else so to json um this whole thing needs to disappear actually so in proc fest where we call this thing which is here um then i think what we'll do then is we'll simply do this and then that information has to have been already stored in there although who does that so when you start profiling a program like when you do profiling enable i guess we would have to do it then here so perf event buffer add process yeah let's do this um collect or add process let's say and let's make it a const process because we're not going to mutate the process we're just going to collect some um personally identifiable information about the process so um let's see first if we already have it um or i guess we can when you call this we'll always either update or replace the existing information we have about the process although we don't want to lose any threats that may have been killed hmm maybe we'll ignore the killed thread issue right now yeah i think i think we'll we'll fix me about that what about threads that had died uh and then we don't need this and then we can just do sampled process and then let's see if we can initialize that so pid is process um executable is process executable absolute path i think we um we need kernel file system custody is that not what we need oh maybe i should have stored this as a process id actually it's a pointer hmm so i guess we'll do something like this then fine is that a complicated getter no it is not okay and then threads we'll just fill that in momentarily process for each thread sample process threads set presumably thread did i think that's how it would work should we do that even no it looks better that way okay and then finally processes set process pit move sample process oh and this expects an own putter actually no big deal there we go okay so that's probably good and then well we also need to capture the memory regions and i forgot to to make something of that so region regions there we go so let's see let's take a lock on this guy that will be process space get lock just lock that upper oh it's a spin lock actually spin recursive scope it's called spin lock and let's take it here okay and then [Music] we'll be in and out pretty soon um then we just need to walk the memory regions of this thing so do we have a region regions sample process regions append um name is region name range is region range cool nice and tidy this coat actually looks really tidy um i don't know why i just i like the way it looks today okay so let's use that stuff now when we serialize so instead of doing this gunk we need to start walking all of the things that we have so um and let's actually put that up here so for process in m sample problem processes sure and then we want to start an array so object add array um call it the processes array maybe okay and now we will add an object process object process an object my brain wants to turn that into project which in some way makes sense but it still doesn't so wait do we need to name this object no we don't want to name an object that doesn't even it's not a thing um just add the object and then we want to add keys to it should give me a serializer yeah yeah then i can add pid process pid no what's in there oh right it's a pointer can we avoid that being a pointer by doing this perhaps um oh wait wait it's um it's a hash map not a hash table so process is it value yeah yeah that's what we want okay so ed executable and then the regions oh there we go finish what comes next regions array is processed process object add regions yes and [Music] um add array that's what i want to say yeah i'm not super used to using these json serializers but they are awesome it's basically it allows us to do to build up a json tree um step by step without um actually building a tree in memory so it's like um they're just serializing um the string into a into a text buffer it's a very awesome kind of classes by um sergey i think sorry you wrote these very very cool we used to build up entire json object trees in the kernel which was very time consuming so he came up with this serializer templates that they just append characters to an arbitrary builder class so you can use it with either stringbuilder or kbuffer builder very cool okay so we want to build the regions array so i guess jt then process regions region is jt dot oh wait wait i don't need to jt that these are just regions regions array add object also will these things not um finish implicitly on oh yeah yeah they finish on destruction if not already finished so i don't actually have to do that that's kick ass but i probably do want to finish here otherwise it won't be destroyed until after this yeah yeah so here i need to do that but um okay set name region name base range base get you don't like these because it should be at okay i'm really struggling to remember the api even though i literally just used it okay and then we can get rid of this whole thing here oops excuse me 204 missing initializers well well aren't you fancy so i'm missing initializers for regions i guess and then what am i not doing nicely not returning an iteration decision here just continue and then what oh threads as well right and then they need to be in the different order and now what oh we're actually calling to json somewhere to to json um would that be when you finalize it where do we dump the perf core dump perf core that's probably where we're doing that yeah yeah so when we dump the performer file uh on process exit we use this one but we shouldn't need to do that anymore so we should just be able to simply to json although this is um this is an alternate version of the api wait so oh but that guy calls that goofy api oh okay so that's fine that's fine we'll just do it this way get rid of this api and we can also get rid of that api and now you have these two you can use whichever one you like um also this one could probably be const if this one is closed because because why not it's weird that one is and one isn't let's make it goosed and what are we missing now um this thing oh that doesn't exist oh wait did i just nuke that or uh i'm confused maybe we shouldn't even offer that as an api you can just make your own buffer so we'll do this and then the perf core uh generation thingy just can just make a buffer here so let's see um buffer can you buffer try create with size how much space do you need how much space do we normally allocate what's the default of this thing is there a default okay buffer builder uses four mags okay well that's fine oh yeah yeah let's just do it this way yeah k profit builder by the way is kind of a dumb api um because it also has this flaw that it will allocate a buffer and has no way to tell you that something went wrong so it's a very confusing api and we should probably get rid of it but um not at this very moment right now we have something else that we're doing um process cpp 456. something is wrong here to json oh yeah yeah so at least now we should be able to capture a profile and see that the json looks good so let's just capture some desktop action here just um selecting a bit and sure he didn't like that i should have thought of that well we'll do it the this way instead exit and we do get a perf core so let's take a look at this magnificent thing so we now have a processes array and here is one entry pin 29 executable bin profile and all the memory regions in that thing looks very good um so that's that's really cool so let's see if we can just uh trick trick it into doing it on emsy's call as well or um actually no let's let's not get let's not get ahead of ourselves i keep trying to get ahead of myself first let's make the profiler program actually understand the new format so in the dev tools profiler program we have the profile class which is responsible as i recall for load from perfcore file and it's going to fail right here this is where we failed now because it looks for a pid key at the top level object and also for the executable key so instead we now need to look for the uh processes so if object has a prop it doesn't have processes then return string invalid perf court format no processes yeah and then we need to actually go and gather up all of these processes so um same concept actually as in the kernel so i guess we can just grab this thing here because now we're gonna soak this up maybe we will put this in the profile class declaration as a private class of course this can just be pt um these will just be ins and here we'll have a base and a size instead but yeah other than that we're good and we'll call him process okay so now to actually populate this thing so or object get processes uh if it's not an array or actually let's do that separately get processes maybe this we should have done it this way process is value um if we don't have that value yeah then we don't have to do multiple lookups look at us being good boys avoiding multiple lookups uh if although null is a bit weird here's object get i mean that's okay that's okay um if process value is not array processes is not an array as array values so now we can iterate the process values okay so if process value is not an object and let's fail again in valid perform process value is not an object this is kind of tedious but it will it'll help us um it might help us catch something going wrong so i'm just gonna do it process value as object okay at this point now we're just gonna make some assumptions so vector process processes um process sample process and is process get pit to u32 uh executable is process get executable to string threads is regions hmm oh should that mean i 32 maybe yeah okay okay okay yeah we'll do that okay and then we need to also get all of the threads and regions um i feel like i didn't actually export all the threads [Music] hmm i suppose you know what i'm just going to skip over the thread part right now and focus on processes and then we'll have some fixmes about threads anyway um so we'll focus on regions instead process get regions if it is null then invalid curve core format regions is um no regions for process that doesn't look good actually we can even say if it's not an object or array rather yeah and then let's do that first just to get that free sanity check for auto region in regions value as array values um sampled process regions append region get named to string um bass is region oh wait these are values we need to make them into an object as object okay here we go size is region get size to u32 cool and then finally here sample processes append move sample process and we will have all of those boys collected cool and then we can get rid of this [Music] um and the regions are now per process um library metadata okay so this has to be kind of reorganized now so the profile needs to keep track of processes and each process has library metadata so maybe we'll move this up actually okay so library metadata is a member of process you can't be there anymore executable path goes away event stays yeah everything else kind of stays but the processes change right yeah okay and then the constructor i suppose will simply have to receive not the executable path nor the library metadata but rather a vector of process yeah this is what has to happen and that's also what we will keep next to here vector process cool remove the sample processes move the events and library metadata goes up here somewhere let's see regions values oh wait wait wait so it wants the value but that's okay oh yeah yeah that's fine that's fine we can actually do it that way so then um we just pass it in here library metadata is boom it's kind of awkward looking actually can remove the m underscore also this thing should probably be renamed to like um object um metadata or i don't know exactly what to call it but like because it's not strictly for libraries only we also have an entry for the main executable so library metadata is a misleading name anyway that's cool though and then i feel like we were using that down here yeah so here we need to know which process we're looking at but we can do that by looking at the uh the thread id of the sample so [Music] or of the event rather so let's just collect that so event we don't even we don't even catch that at the moment but we will start doing that right now so um do that and then proof event get tid um to i32 yeah there we go now we have the tid and then what did we do with that so yeah as i was saying we're gonna skip over the part where we have multiple threads in the same program and we'll have to come back to that later and fix that but right now we're just gonna assume that there's one thread per process which is obviously not correct but it will allow us to move a lot faster so the library metadata first now we have to look up the sample process and the vector of sample processes so um sample processor sample processes find first wait what do we have i'm trying to remember what kind of api we have first matching that's what i'm looking for what does that return optional t that's good entry dot pid equal to event dot tid fix me this uh doesn't this doesn't support multi-threaded programs um yeah and then if um process has value then let's see library metadata yeah that's that's that's really kind of gnarly but um makes me this logic is kind of gnarly find a way to clean it up yeah but still low that that's that's okay i guess so we'll just do a linear scan through the vector of processes find one that has a pid matches the tit of the event um the main thread of every new process has the same thread id as process id so we're just exploiting that fact here and then let's fix this constructor cool and i don't know what you're complaining about but um i think this is some clang thing so i'm just gonna fix that up okay let's try to build in sea lion more often and just uh keep forgetting to do that but it is nice to be able to click on the errors uh profile has no member libraries right it doesn't have that because we removed that so this is the disassembly model right so this guy now has to instead of looking at the profile libraries we have to look at the relevant process for the current profile node but every profile node belongs to a specific process so that should not be a problem do we have the profile class here i feel like we don't oh it's below us um okay so what we want is we just want to get to the process so we can give every profile note a pid i think that would be okay every profile node can just know which pid it belongs to no big deal mpit let's say and when you create one of these bad boys you gotta provide the pit dude you've got to provide the pid right here okay and then in the disassembly model of fame um we can't do this but we can get the process like um actually we should have an api for this hmm we should have something like uh profile node dot process um okay let's let's yoink this and put it up here i'll make it a lot more um friendly and i think we'll join the library metadata as well just because because why not and then here you need to provide a pid sign okay and how dare you call it without a bid and now we're cheating again i'm using the tit instead of the pit makes me more cheating with mixing intentional mixing here um yeah yep yep yep yep and then this thing also needs to pit all right all these changes are very very mechanical so not not even really thinking so much about what i'm doing i'm just like fixing um the breakage the breakage consequences of moving those two classes out um out of the profile class so we should be good there and then now we have process at the top level so we should be able to provide an api on the profile node which is a node in the call tree by the way so here now we should be able to offer const process process something like that um which of course wait how do you do this like this um can we get to the profile from the profile now wait can we not pro hmm profile node doesn't have a back pointer to the profile i thought it did okay all right well no no big deal um we'll just do that a little bit of a sneaky trick that you can do in c plus if you don't have a pointer to something as a member you can always just take it as an argument um processes profile process find process with pid yes and then we'll give him an api to look it up it's all coming together um i guess we can just do like m processes first matching um value or no better okay and what did i screw up here uh first matching is non-const come on let's cheat i don't want to go and change vector right now okay and why don't you like that wait what first matching what is this return oh wait it returns it by value that's not what i want anyway um i didn't want it by value i wanted an iterator do we have find we do have find but find takes i won't find if right okay i'm i'm not used to find if uh somebody else added it but it is nice it's just it's not in my brain yet so find if returns an iterator right so m processes find if entry pin is paired then we have the iterator and if i t is n return null clutter otherwise turn i t um no how does that work process is right there okay wow that is really awkward looking [Music] does this help the situation maybe slightly maybe there's a much better idiomatic way of saying what i just said here and this looks really dumb i don't know uh anyway what are we doing wrong on 312 let's find out i'm trying to copy construct a process oh wait is this them same problem right here i wanted to find if oh here i can just use it arrow right okay and then we have this problem right here but this should now be fixable because all we need to do is node dot process pass in the profile as we said we would and then arrow hoping for the best and then library metadata is ooh so this is kind of rickety looking i don't like that fix me this is kind of rickety looking with all the okay and we have at least opened it although it seems like it we are not comprehending what's going on in here so not super great but we can see the samples um they are right here this is what they look like so what would it look like if we profile the whole system now worked this is enabled and disabled by the way in case that was not obvious segmentation fault null pointer draft well that is just a bit smelly let's get a crash reporter for that and we are crashing in library metadata library containing so i bet you that we are calling this on a null pointer probably is that right here or wait where is that load from perf core file actually okay by the way we should we should probably fix this thing like the fact that this is a segmentation fault and it's um we crashed because of an oldie reference but that does not appear anywhere in the crash report we should really display the um the address that caused the segmentation fault and like whether it was a reader right stuff like that anyway so library containing online 438 this is crashy sure so on 317 we're calling it right here and i guess it's because this thing right because i'm stupid so um if it is we need to be um [Music] we need to be null checking this pointer before calling it like that that was dumb okay let's try again okay this is our full system profile uh of course now we can't see which process which pid everything is in here or thread id we could definitely add a column to the samples here that would be i guess the first thing to do with that um i think maybe we won't do the call tree aggregation today we'll only do the samples view so let's go and add um a thread id field here a process id maybe um i'm thinking like what's the most honest thing to call it sample index timestamp thread id it is going to be the thread id so let's put the thread id thread id i wanted tit is fine vent did boom i cannot wait to see that that's going to be super interesting so let's do something exciting like opening a browser okay so we actually see kernel frames here that's neat and yeah you can see the thread id so very very cool we're now capturing stacks in all threads and we can only see the kernel symbols so far which is still very interesting but obviously it would be even cooler if we could see the user space symbols as well very interesting if we flip that so it's so cool to see like the whole system uh even if we don't separate things by process here it's still really cool to see like what is the most common thing across all processes to happen in the kernel and it seems like that is in this browser launch test i just made select this call making a vector of ins we can definitely improve that situation there's no need for select to have a dynamically allocated vector actually since select is bounded by the fd set site constant fd set size constant so where is foot vector in fact wait foot vectors already already has inline capacity oh wait it's a vector of int wait a minute okay i'm just gonna allow myself to go down this uh sneaky little um investigation here oh look at this this vector here doesn't have inline capacity so we're actually hitting the kernel heap in select quite a lot quite a bit it's like 3.4 of of this i mean admittedly very random thing i just did like just launching the browser but even so like that's kind of interesting we have string allocations in uh directory traversal looking pretty heavy interesting very interesting oh process all that's cool that's like reading proc all which is read by this guy up here and this guy they both read it separately um could definitely do some consolidation there like making a single process that draws both of them for example would be a start you would reduce this um anyway ah dude all right this is we're gonna have so much fun with this feature but let's improve it a little bit first so we're still we're still going to need to see like what's going on in user space here right because kernel symbols are super interesting but we need to see the whole picture so what is going on with user space why can we not see um why can't we not see what's loaded so i guess because of the ad processing right like we never actually populated that thing so let's do that emphasis call trick that we talked about finally now is the time for that so when you call ms's call here we'll do the hack hack uh if we're um profile doing a global profile at the moment add this process to the global proof event um we do this here because now is a reasonable i guess now it is here because at this point the executable parts of the address space should be final mostly final yeah because almost no programs actually add more executable memory once they've started um we actually don't have any program that does this right now but in the future we might but we we currently don't so what we want to do here is just check if we need that thing so his name is profiling all threads his friend is mr global perf event and we'll put this in a scope critical here so scoped critical add process this okay that's cool and then we also need to do it according to what i wrote down um if somebody has already called emphasis call then we need to capture his outer space when we first sample him so that would be um i think we can do that without causing page faults so scheduler uh and where does that go perfect um hmm okay let's see let's put something in here maybe i don't know this is a little weird no no wait wait here we don't know if we're global or not so we'll do it here if current thread process space and forces syscall regions so if that's not if that is already true then we need to perf events add process current thread process so this would do it on every iteration we only want to do it once um this is kind of awkward um wait okay so what if we just do it every time now just to test it um what am i doing wrong a map wait map cpp5 oh wait i don't see that class declaration okay how will this go okay looking somewhat promising look at all these user space things oh dude dude this is looking very spicy and even the call tree here actually totally works it's just that we don't know which process is which because it doesn't um it doesn't aggregate for a process here it doesn't keep track of that but we can see it here for the individual samples which is just so cool um this is so neat why is this thing so wide i think it could be much much less wide um so what are we doing in user space let's see or what would be um i was thinking what would be an interesting thing to do i guess let's let's try to make a new one before we do something like um let's load google so do that and then we open this thing open google.com and we stop the profile so here we should be able to see all kinds of interesting stuff um like the browser launching me clicking google and then like talking to the web server and everything so many different things involved in this and now we can see that lib crypto is the biggest user space cpu hog clocking in at 13.66 and of course it's actually in the protocol server process which we don't normally profile that much because i'm always like focusing on the browser but now that now that we can profile the whole system at the same time we can actually see into all of these helper processes yeah you see here protocol server is the program doing this ah dude dude i love this this is so cool man i i'm just i'm just so happy at it i don't know what to say like this is my favorite new feature by far and i'm going to use the heck out of it um [Music] there's the html parser of course not looking terribly bad probably a lot we can do in the crypto library to make it faster because it's like it's the big int implementation that is soaking up the time here and i think we have barely scratched the surface of like optimized big implementations like we have some naive allocation avoidance optimizations but like i'm sure that there's a lot of tricks and stuff that we can do to make this a lot faster um and you know you have two of are these actually calling each other or what divide divide calls both yeah so actually i guess we can flip this to get a top view instead to see oh crash the heck out of it well i can't have everything um can i get a right so crash because of reasons and this guy sure has mapped everything because uh yeah because we'll just map everything so that we can disassemble it but look at this guy here he's like mapping the same library multiple times it's kind of silly you can certainly avoid that um yeah anyway anyway i i just got i got a little carried away just uh messing with it there but how cool is that seriously um i guess we can bring it back right away so neat okay this thing where i click on something that's like wide and it scrolls the view it's so irritating i don't know why i never go and fix it it's like you click on something that's not a white piece of text and it's fine but if you click on something long it's um this very jarring horizontal scroll happens i don't know what's up with that um but very cool very very very very cool um maybe we should show the executable name here in this view also not just the thread id but the executable um which would be cool so let's do that samples model um thread id executable name let's say oh wait where is that even we only have the event um event oh yeah yeah right but we can cheat so we can get the profile process for the event tid which is again more abuse of the pitted relationship makes me more abuse of the pig ted relationship um yeah so we'll try this and if we don't find it we'll just return nothing the empty string still i mean this should this should be very interesting and let's also verify that profiling a single program still works i don't want to break that i definitely broke that okay cool samples model 65 i forgot to implement column name for executable that's fair easy fix oops yeah so this still works now you can see the executable right here very very cool and let's grab a big boy profile and what do we want to profile this time maybe something that we rarely profile like the fire demo wonder what it's doing and you know when you're running something like this you also kind of wonder what's the fire demo doing but what's the window server doing compositing all of this stuff like what kind of time are we spending on that and previously we have not been able to see that but this should actually give us some ability to to see that as well so we can see here is actually the um page flipping happening triggered by the windows server of course quite aggressively because it's just um drawing at as many frames per second as we allow it to which would be 60. um and the blit is um heavy oh wait that's cool brock of us all very high up here so these guys are kind of busy interesting and we can see this executable here this is just just so cool man okay i need to stop um fawning over the code and make some commits so what do we even have here so we have converting the um [Music] format we converted the format of the thing right to to have it as a separate thing okay so now let's do that first yes yes yes yes yes and yes right so this one is kind of questionable when we tick we now always update the process in the perf events buffer so it's a little bit weird [Music] but we'll live with it for now but this is very very good yes looks good yup yup you youtube yep even you yeah so i'm only leaving out the emphasis call hack so we'll put that in a separate commit so i can explain it kernel plus profiler um capture metadata about all processes all profiled processes um previously we the um perf perfcor file format was previously limited to a single process since the bid executable region's data was top level and the json this patch um moves them moves all the moves the process specific data into a top level array named processes and uh we now um add entries for each process that has been sampled during the profile run this uh makes it possible to um this makes uh the profiler and then this makes possible to see stacks from samples from multiple threads when viewing a proof core with profiler this is extremely cool i'm gonna even add exclamation point to that one and here is the hack yeah hmm actually like do we even need this hack at the moment because um right now we're we're just capturing the process information on every sample anyway so we could actually live without this hack for now um i think maybe that's actually a better option oops uh and then we can think more about it because it wasn't debilitatingly slow to do it this way um which i thought it was going to be so i think we can probably live with this for a moment while we get a chance to think more about ways to approach this um and then we don't need to introduce this awkward hack and like um do stuff in emphasis call that doesn't really have to do with emphasis call as it's supposed to be um i think that's that's generally a good idea so let's do something else fun like open a markdown file and have it load it's very cool and disable and that is very very interesting so here again lib crypto showing up because we uh that markdown file had a screenshot from the readme which it was fetching from github because it's a github https url so we can see that showing up here now it's very neat and where is where is even like markdown parsing does it even show up i guess it's very not significant dude oh yeah and the multi-process browsing stuff like that's so much better um to have this multi-process profile view because now you can actually see all of the moving parts and what they do and actually if you recall if you watched the video um a week ago or so when i was profiling startup to figure out what was going on and try to improve startup performance then there was this part where i couldn't i couldn't see what was happening because like it looked like in the profile like nothing was happening for a while um because i was profiling like the text editor starting up but in this new world um if nothing is happening in one program we can actually we should be able to see what's happening in other programs at the same time so those kind of mysteries will no longer get in our way that's that's very cool mem copy there is just infinity things to investigate here so cool anyway let's let's let's not get carried away so but let's let's add a big fat fix me here um fix me um this is um very nasty we um we dump the current processes uh address space layout every time it's sampled this is we should figure out a way to do this less often yeah because um there might be some elegant solution to this like uh imagine that you had um for example you could have a version number on the address space object and then you could just increment that whenever you mutate the address space and then um you would just keep track of like which version of the address space you've cached or something like that there are many different approaches we can take to this i'm just um i'm just going to not care about them right now because we've been sitting here for a moment and i need to um have a nice little break so we are just gonna amend that to the previous commit and maybe enjoy it one last time for today um obviously there's still various things to do with this like making the tree view more process aware and doing multi-threading support adding more ui to take advantage of all this new information that we have and find interesting ways to visualize it like you could imagine doing um a timeline you could split up the timeline per thread for example if you wanted to so you would see every thread as its own timeline [Music] if you're running in multi-processor mode you could do one timeline per processor there's so much stuff we can do so let's let's profile the javascript test suite and let's see what the system is actually doing while we are running it probably not much other than the javascript but we can see here actually which programs are running um i guess one thing you could do here is uh um you could allow filtering by process so um you could have like uh like uh almost like a checkbox for every process somehow somewhere and then you could filter which processes you want to see so if you want to see all processes or a specific set of processes or whatever or processes in a specific process group okay so here we can see that terminal comes in windows server so other stuff does get does get uh scheduled while we're running test.js obviously i mean obviously we have to repaint the screen and everything um menu app let's come in and you know this is stuff that also obviously affects the runtime of javascript test suite so it is really cool to see that and here we have the task bar because of course if you didn't notice when you run test suite if you look at the taskbar entry for this terminal we have a progress bar that shows the progress of the test suite a feature that i am very fond of anyway i think uh i think uh this is gonna be it for today so um i'm obviously super happy with this feature and i'm gonna keep iterating on it but man we are gonna have fun with this um very cool stuff so if you made it this far then i thank you for watching for hanging out uh i hope that you saw something interesting here today and um i think this was really cool if you have some ideas for interesting visualizations we can do with this new data and please let me know um or if you profile the system and find something interesting then please let me know as well anyway i'll see you all next time thanks for hanging out bye

Info

Channel: Andreas Kling

Views: 6,448

Rating: undefined out of 5

Keywords:

Id: AK9H6_pR0BM

Channel Id: undefined

Length: 134min 48sec (8088 seconds)

Published: Tue Mar 02 2021